@clawhub-quochungto-93dad49abd
Design batch data processing pipelines for large-scale, bounded datasets processed offline. Use when building ETL workflows, processing logs or clickstream d...
---
name: batch-pipeline-designer
description: |
Design batch data processing pipelines for large-scale, bounded datasets processed offline. Use when building ETL workflows, processing logs or clickstream data at scale, generating ML feature pipelines or search indexes, or joining two large datasets that cannot fit in memory. Trigger phrases: "design a batch pipeline", "should I use Spark or MapReduce", "how do I join two large datasets", "build an ETL workflow", "process server logs at scale", "how do I handle skewed data in joins", "implement PageRank on a distributed graph", "design an offline processing job". Covers MapReduce vs dataflow engines (Spark, Flink, Tez), three join strategies (sort-merge, broadcast hash, partitioned hash) with selection criteria, graph processing via the Pregel/BSP model, and fault tolerance via materialization vs recomputation. Does not apply to unbounded input streams (see stream-processing-designer) or low-latency OLTP query serving. Produces a pipeline architecture recommendation with engine choice, join strategy, and fault tolerance approach.
model: sonnet
context: 1M
execution:
tier: 2
mode: hybrid
inputs:
- type: document
description: "System requirements, data flow diagrams, schema definitions, or architecture descriptions"
- type: codebase
description: "Existing pipeline code or infrastructure configs to understand current setup"
- type: none
description: "Skill can work from a verbal description of the processing requirements"
tools-required: [Read, TodoWrite]
tools-optional: [Grep, Bash]
environment: "Can run from any directory; codebase access enables analysis of existing pipeline code"
depends-on:
- oltp-olap-workload-classifier
---
# Batch Pipeline Designer
## When to Use
Use this skill when you are designing or evaluating a **batch processing pipeline** — a system that reads a bounded, fixed-size dataset, runs a job over it, and produces output. Batch jobs are scheduled periodically (hourly, daily), tolerate high latency, and are measured by throughput rather than response time.
This skill applies when:
- Processing large datasets offline (logs, clickstream, database snapshots, file dumps)
- Building ETL workflows to move data between systems
- Generating derived datasets: search indexes, ML training features, recommendation model inputs, aggregated reports
- Implementing multi-step data transformations across distributed storage
- Joining two or more large datasets that cannot fit in memory on a single machine
- Processing graph-structured data (social graphs, link graphs, dependency graphs) in bulk
**This skill does NOT apply to:**
- Stream processing where input is unbounded (see `stream-processing-designer`)
- Online query serving where latency < 1 second is required
- Transactional workloads (see `oltp-olap-workload-classifier` to confirm the workload type)
**Dependency check:** If the workload classification is uncertain, invoke `oltp-olap-workload-classifier` first to confirm this is a batch workload before proceeding.
## Context & Input Gathering
### Input Sufficiency Check
```
User prompt → Extract: data sources, transformations required, output destination
↓
Environment → Check: existing pipeline code, schema files, infrastructure configs
↓
Gap analysis → Do I know the input volume, join requirements, latency tolerance?
↓
Missing critical info? ──YES──→ ASK (one question at a time)
│
NO
↓
PROCEED with pipeline design
```
### Required Context (must have — ask if missing)
- **Input data characteristics:** What datasets are being processed? What are their approximate sizes?
→ Check prompt for: dataset names, "gigabytes", "terabytes", "millions of records", log file descriptions
→ Check environment for: schema files, data flow docs, existing pipeline configs
→ If still missing, ask: "What are the input datasets, and roughly how large is each one? (e.g., 100 GB user event log + 10 GB user profile database)"
- **Required transformations and joins:** What operations need to happen — filtering, aggregation, joins between datasets?
→ Check prompt for: SQL-like language ("group by", "join on", "aggregate"), descriptions of data enrichment or correlation
→ If still missing, ask: "What does the pipeline need to produce? For example: 'count page views per URL', 'join user events with user profiles', or 'build a feature matrix for ML training'."
- **Output destination:** Where does the processed data go — a database, search index, file system, another pipeline?
→ Check prompt for: destination systems, "load into", "write to", "serve via"
→ If still missing, ask: "Where does the output of this pipeline need to go, and how will it be consumed? (e.g., loaded into Elasticsearch, written to HDFS for downstream jobs, served via a key-value store)"
- **Latency tolerance:** Is this a daily job, hourly, or near-real-time? How long can the job take to complete?
→ Check prompt for: "daily", "hourly", "SLA", "must finish by", scheduling language
→ If still missing, ask: "What is the acceptable job completion time? For example: 'must finish within 6 hours of midnight' or 'runs weekly, no strict deadline'."
### Observable Context (gather from environment)
- **Existing infrastructure:** Look for Hadoop, Spark, Flink, or Airflow configurations that constrain engine choice.
→ Look for: `pom.xml` with spark/hadoop deps, `requirements.txt` with pyspark, `airflow/dags/`, `Dockerfile` with cluster images
→ If unavailable: assume greenfield, no constraint on engine choice
- **Data partitioning:** Check whether input datasets are already partitioned and by what key.
→ Look for: HDFS directory structures, partition specs in Hive metastore configs, dataset metadata files
→ If unavailable: assume unpartitioned, sort-merge join is safe default
- **Skew indicators:** Check whether any join keys are known to have high-cardinality hot keys (e.g., celebrity users in a social graph, viral content IDs).
→ Look for: comments about hot keys, prior skew handling code, "top user" or "trending" concepts in data model
→ If unavailable: assume uniform distribution until data profile reveals otherwise
### Default Assumptions
- If engine is unspecified and infrastructure allows it: recommend a dataflow engine (Spark or Flink) over raw MapReduce — they execute significantly faster for most workloads.
- If join dataset sizes are unspecified: default to sort-merge join (safest for unknown sizes, no memory assumptions).
- If output destination is unspecified: write to distributed filesystem (HDFS, S3) — composable, immutable, safe.
- If fault tolerance requirement is unspecified: use materialization for jobs longer than 30 minutes; use in-memory recomputation for shorter jobs.
### Questioning Guidelines
Ask ONE question at a time, most critical first (input characteristics before transformation details before output). Show what you already know from the environment before asking. State WHY you need the information.
## Process
Use `TodoWrite` to track steps before beginning.
```
TodoWrite([
{ id: "1", content: "Confirm workload is batch; gather input/output characteristics", status: "pending" },
{ id: "2", content: "Select processing engine: MapReduce vs dataflow engine", status: "pending" },
{ id: "3", content: "Design join strategy for any cross-dataset operations", status: "pending" },
{ id: "4", content: "Design multi-step workflow and intermediate state handling", status: "pending" },
{ id: "5", content: "Assess graph processing needs (if applicable)", status: "pending" },
{ id: "6", content: "Design fault tolerance strategy", status: "pending" },
{ id: "7", content: "Specify output destination and loading strategy", status: "pending" },
{ id: "8", content: "Produce pipeline architecture document", status: "pending" }
])
```
---
### Step 1: Confirm Workload Classification
**ACTION:** Verify the workload is genuinely batch before proceeding. Determine input volume, update frequency, and latency tolerance.
**WHY:** The boundary between batch and stream processing determines which design space applies. A pipeline that processes data "every hour" sounds like batch — but if the latency requirement is under 5 minutes, it is stream processing in disguise. Getting this wrong leads to building the wrong system. The key distinguishing property of batch is that the input is **bounded** — it has a known, fixed size at the time the job starts. A job knows when it has finished reading all input, so it can eventually complete.
**Classification signals:**
- **Batch:** Input is a file dump, database snapshot, or accumulated log from a fixed time window. Job runs periodically. Latency measured in minutes to hours. Primary metric is throughput.
- **Stream:** Input arrives continuously and is never "done." Latency measured in milliseconds to seconds. See `stream-processing-designer` instead.
**IF** workload classification is ambiguous → invoke `oltp-olap-workload-classifier` first
**IF** clearly batch → proceed to Step 2
Mark Step 1 complete in TodoWrite.
---
### Step 2: Select the Processing Engine
**ACTION:** Choose between MapReduce and a dataflow engine (Spark, Flink, or Tez) for this pipeline.
**WHY:** MapReduce is simple and robust — it fully materializes intermediate state to disk, tolerates frequent task failures without any data loss, and runs safely on preemptible, low-priority compute. But this robustness has a cost: every intermediate result is written to the distributed filesystem and read back by the next stage. For complex pipelines with many stages, this disk I/O dominates runtime. Dataflow engines (Spark, Flink, Tez) treat the entire workflow as one job, pass data between operators in memory or local disk rather than the distributed filesystem, and skip sorting where it is not required. They execute the same computation significantly faster — often orders of magnitude for multi-stage pipelines — while providing equivalent fault tolerance via lineage tracking (Spark's Resilient Distributed Dataset) or operator checkpointing (Flink).
**Decision framework:**
| Factor | Prefer MapReduce | Prefer Dataflow Engine (Spark/Flink/Tez) |
|--------|-----------------|------------------------------------------|
| Pipeline stages | 1-3 stages | 4+ stages (many intermediate materializations in MR) |
| Task preemption risk | High (shared cluster, MR runs low-priority) | Low to moderate (dedicated cluster) |
| Existing infrastructure | Hadoop-only environment | Any modern cluster |
| Fault tolerance model | Must tolerate very frequent task kills | Standard hardware failure tolerance |
| Processing model | Batch only | Batch + can evolve toward streaming |
| Join complexity | Simple reduce-side joins | Complex multi-way joins, broadcast joins |
**Engine-specific selection within dataflow engines:**
- **Spark:** Batch processing, machine learning (MLlib), iterative algorithms. Rich ecosystem. Best default choice.
- **Flink:** Lowest latency for near-real-time batch; strong streaming integration; best when pipeline must eventually handle stream input.
- **Tez:** Thin optimization layer over YARN; lowest overhead when Pig/Hive workflows already exist and just need faster execution.
**Note on high-level languages:** In most cases, you do not write raw MapReduce or dataflow operator code. Higher-level tools (Hive, Spark SQL, Pig, Cascading, Crunch) compile to the underlying engine and include query optimizers that automatically select the best join algorithm. Declare joins relationally and let the optimizer decide — this is preferred unless fine-grained control is needed.
Mark Step 2 complete in TodoWrite.
---
### Step 3: Design the Join Strategy
**ACTION:** For any operation that correlates records from two or more datasets, select and specify the appropriate join algorithm from the three available strategies.
**WHY:** Joins are the most expensive and failure-prone operation in batch pipelines. The naive approach — querying a remote database for every record — is orders of magnitude slower than the pipeline's normal throughput, and it makes the job nondeterministic (the remote database may change during the job). The correct approach is to bring all the data needed for a join to the same place as a local operation. Three strategies accomplish this, each suited to different data size and partitioning conditions.
**Strategy 1 — Sort-Merge Join (reduce-side join)**
Use when: both datasets are large, sizes are unknown or similar, or datasets have no pre-existing partitioning guarantees.
How it works: Both datasets go through a mapper that emits records keyed by the join key. The shuffle/sort phase collects all records with the same key to the same reducer partition. The reducer sees all records for a given key together and performs the join. A secondary sort ensures that records from one dataset (e.g., a "profile" record) arrive before records from the other dataset (e.g., "activity events"), so the reducer can hold the profile record in memory while iterating through activity events without buffering everything.
Cost: All data is shuffled over the network and sorted. Expensive for large datasets but correct and scalable with no assumptions about data layout.
**Strategy 2 — Broadcast Hash Join (map-side join, small table)**
Use when: one dataset is small enough to fit entirely in memory on each mapper (typically < a few GB).
How it works: Each mapper loads the small dataset into an in-memory hash table at startup (the small dataset is "broadcast" to all mappers). The mapper then scans the large dataset record by record, looking up each record's join key in the hash table. No reducer is needed; no shuffle occurs. This is the fastest possible join when the small table fits in memory.
Cost: Requires the small table to fit in memory. The small table is replicated to every mapper — memory pressure scales with the number of map tasks running concurrently.
**Strategy 3 — Partitioned Hash Join (map-side join, co-partitioned tables)**
Use when: both datasets are large BUT are already partitioned by the same key, using the same hash function, into the same number of partitions (e.g., both output from prior MapReduce jobs with the same partitioning scheme).
How it works: Because both datasets are partitioned identically, mapper N for the large dataset only needs data from partition N of the small dataset. Each mapper loads one partition of the smaller table into memory and scans the corresponding partition of the larger table. No shuffle or reducer needed.
Cost: Requires both datasets to be co-partitioned — a strong prerequisite. This metadata (partitioning key, function, partition count) must be known and consistent between datasets.
**Handling skew in joins:**
If a join key has extreme hot spots (a "celebrity" user ID with millions of events, a viral content ID), a standard sort-merge join will send all records with that key to a single reducer, causing a straggler that blocks the entire pipeline.
Mitigation options:
- **Skewed join (Pig):** Run a sampling job first to identify hot keys. Route hot key records to multiple reducers randomly; replicate the other dataset's records for that key to all those reducers.
- **Sharded join (Crunch):** Specify hot keys explicitly; framework handles replication automatically.
- **Two-stage aggregation:** For GROUP BY on hot keys, first reduce to a partial aggregate per reducer (reducing volume), then combine across reducers.
Mark Step 3 complete in TodoWrite.
---
### Step 4: Design the Multi-Step Workflow
**ACTION:** Decompose the full pipeline into stages, specify the data flow between stages, and decide how intermediate state is managed.
**WHY:** A single MapReduce or dataflow job can only solve a limited range of problems. Complex pipelines require chaining multiple jobs — for example, a recommendation system may use 50-100 MapReduce jobs. The key architectural decision is how intermediate results are passed between stages: full materialization to the distributed filesystem (MapReduce style) vs. streaming through memory/local disk (dataflow engine style). This choice determines both performance and fault recovery behavior.
**Workflow structure decisions:**
1. **Identify job dependencies:** Which jobs produce the input for which other jobs? Draw this as a directed acyclic graph (DAG). A job can only start once all its upstream dependencies have completed successfully — partial output from failed jobs is discarded.
2. **Intermediate state strategy:**
- **Full materialization (MapReduce):** Each job writes output to a named directory in HDFS; the next job reads from that directory. The intermediate data is durable, replicated, and independently inspectable. Cost: extra I/O, extra network replication, straggler delays (must wait for all tasks in a stage before next stage starts). Benefit: any job can be restarted independently; output can be read by multiple downstream jobs; easy debugging (inspect intermediate files).
- **Pipelining (Spark/Flink):** Intermediate state between operators is kept in memory or local disk. Operators start as soon as their upstream operator has produced some output — no waiting for the entire stage to complete. Cost: if a node fails, the operator's output must be recomputed from the last checkpoint or original input. Benefit: much lower latency for multi-stage pipelines.
3. **Workflow scheduler:** For MapReduce-based pipelines with many jobs, use a workflow scheduler (Apache Airflow, Luigi, Azkaban, Oozie) to manage job dependencies, retries, backfills, and monitoring. Dataflow engines handle workflow dependencies internally.
4. **Unix philosophy applied:** Design each pipeline stage to do one thing well, read from a well-defined input format, and write to a well-defined output format. Inputs are treated as immutable — a stage never modifies its input. This means any stage can be rerun safely, old outputs can be kept in parallel directories while new outputs are being computed, and rollbacks are possible by switching back to a previous output directory.
Mark Step 4 complete in TodoWrite.
---
### Step 5: Assess Graph Processing Needs (if applicable)
**ACTION:** If the pipeline operates on graph-structured data (social graphs, web link graphs, dependency graphs), determine whether the Pregel/bulk synchronous parallel model is appropriate.
**WHY:** Many graph algorithms are iterative — they traverse edges, propagate information, and repeat until convergence (PageRank, shortest path, connected components, transitive closure). Implementing these in plain MapReduce is inefficient because MapReduce reads the entire input dataset and writes the entire output dataset on every iteration, even if only a small part of the graph changed. The Pregel model (also called bulk synchronous parallel, or BSP) addresses this by keeping vertex state in memory across iterations and only processing vertices that have received new messages.
**How Pregel works:**
- The graph is stored in a distributed filesystem as vertex and edge lists.
- Each iteration (called a "superstep"): the framework calls a function for each vertex, passing it all messages sent to that vertex in the previous superstep. The vertex updates its local state and sends messages to neighboring vertices along graph edges.
- Between supersteps, the framework delivers all messages: every message sent in superstep N is guaranteed to arrive at its destination vertex in superstep N+1.
- A vertex that receives no messages and votes to halt is deactivated. The algorithm completes when all vertices are inactive.
- Vertices communicate only by message passing, not by querying each other directly — this enables batching and provides a clean fault tolerance boundary.
**Fault tolerance in Pregel:** At the end of each superstep, the framework checkpoints the full state of all vertices to durable storage. If a node fails, the computation rolls back to the last checkpoint and resumes. Because operators are deterministic and messages are logged, selective partition recovery is also possible.
**When Pregel is appropriate:**
- Graph is too large to fit in memory on a single machine
- Algorithm requires many iterations over the graph structure
- Examples: PageRank, single-source shortest path, strongly connected components, community detection
**When Pregel is NOT appropriate:**
- Graph fits on a single machine — a single-threaded algorithm will likely outperform a distributed one due to cross-machine communication overhead in distributed graph processing
- Graph fits on a single machine's disks — frameworks like GraphChi enable single-machine processing of graphs larger than RAM
- Algorithm is not naturally expressed as vertex-centric message passing
**Implementations:** Apache Giraph, Spark GraphX, Flink Gelly API.
**SKIP THIS STEP** if the pipeline does not operate on graph-structured data.
Mark Step 5 complete in TodoWrite.
---
### Step 6: Design the Fault Tolerance Strategy
**ACTION:** Specify how the pipeline recovers from task failures, job failures, and node failures.
**WHY:** Batch jobs process large datasets over long runtimes, making failures statistically inevitable. A pipeline that cannot recover from partial failures forces full restarts — wasting hours of compute. The key insight is that fault tolerance in batch processing is much simpler than in online systems because inputs are immutable: a failed task can always be retried on the same input without risk of double-writes or corrupted state. The challenge is minimizing the work lost when a failure occurs.
**Fault tolerance mechanisms:**
1. **Task-level retry (MapReduce and dataflow engines):** If an individual map or reduce task fails, the framework automatically retries it on another machine using the same input. This is safe because inputs are immutable and task outputs from failed attempts are discarded — only the successful attempt's output counts. This handles transient failures (network glitches, temporary disk errors) transparently.
2. **Materialization to durable storage (MapReduce):** Because every job's output is written to HDFS before downstream jobs start, a failed job can be restarted at that job's boundary without recomputing upstream jobs. For a 50-job pipeline, a failure in job 35 requires rerunning only jobs 35-50, not the entire pipeline.
3. **Lineage-based recomputation (Spark):** Spark tracks the lineage of each Resilient Distributed Dataset — which input partitions were used and which operators were applied. If a partition is lost due to node failure, it is recomputed from the lineage graph using the original input data. No intermediate state needs to be persisted to HDFS. Trade-off: recomputation is expensive if the lineage chain is long; checkpoint RDDs periodically for pipelines with many transformation stages.
4. **Operator checkpointing (Flink):** Flink periodically checkpoints operator state, allowing recovery to the last checkpoint without recomputing from original inputs. Particularly important for stateful operators (running aggregations, graph iteration state).
5. **Determinism requirement:** Recomputation is only correct if operators are deterministic — given the same input, they produce the same output. Non-deterministic behaviors to avoid: using system clocks in computations, iterating over hash tables (ordering is not guaranteed in most languages), using random numbers without a fixed seed, reading from external mutable sources during the job.
**Selecting the fault tolerance strategy:**
| Scenario | Recommendation |
|----------|----------------|
| Job runtime < 30 minutes | In-memory recomputation (Spark lineage); full materialization is overkill |
| Job runtime > 1 hour with many stages | Checkpoint intermediate state periodically; full restart from scratch is too expensive |
| Multi-team pipeline (other teams read intermediate outputs) | Full materialization to named HDFS paths; allows independent debugging and reuse |
| High preemption rate (shared cluster, low-priority batch) | MapReduce-style full materialization; task-level retry handles frequent kills efficiently |
| Iterative graph algorithm | Pregel checkpoint at each superstep boundary |
Mark Step 6 complete in TodoWrite.
---
### Step 7: Specify Output Destination and Loading Strategy
**ACTION:** Define where the pipeline writes its final output and how that output is safely loaded for serving.
**WHY:** Writing pipeline output directly to a production database record-by-record is an anti-pattern: it is orders of magnitude slower than batch throughput, it overwhelms the database with parallel writes from hundreds of tasks, and it introduces partial-write visibility (consumers may read incomplete results mid-job). The correct pattern is to build the output as immutable files inside the batch job, then load them atomically.
**Output patterns:**
1. **Distributed filesystem (HDFS, S3, GCS):** Write output to a new directory with an atomic rename at the end. Downstream jobs read the directory by name. This enables rollback (keep the previous directory; switch back if the new output is wrong) and supports multiple consumers. Preferred for intermediate pipeline outputs and for outputs consumed by other batch jobs.
2. **Search index build:** Mappers partition documents by shard, each reducer builds the index for its partition, and index files are written to the distributed filesystem. Index files are then atomically swapped into the search cluster (Lucene/Solr, Elasticsearch). The batch job produces a complete, correct index; the serving layer does a single atomic switch.
3. **Key-value store bulk load:** Build the key-value store files inside the batch job (the mapper extracts keys and sorts by key, which is already most of the work for building an index). Write the database files to the distributed filesystem. Load them into the serving system (Voldemort, HBase, Terrapin, ElephantDB) via bulk load, which copies files and atomically switches the server to query the new files. The old files remain available for rollback.
4. **Anti-pattern — writing to a live database per-record:** Do not use a database client library inside a mapper or reducer to write records one at a time to a production database. Reasons: (a) network round-trip per record kills throughput, (b) parallel tasks overwhelm the database's write capacity, (c) job failure leaves partial results visible to consumers, breaking the all-or-nothing guarantee.
**The all-or-nothing guarantee:** A batch job's output is only visible when the job completes successfully. If the job fails, no output is produced. This property is what makes batch outputs safe to depend on. Preserving it for external outputs requires writing to a temporary location and atomically promoting the output on success.
Mark Step 7 complete in TodoWrite.
---
### Step 8: Produce the Pipeline Architecture Document
**ACTION:** Write a structured pipeline architecture document covering all decisions from Steps 1-7.
**WHY:** The architecture document captures not just what the pipeline does but why each design decision was made. This is essential for future engineers who will modify the pipeline — knowing the reasoning prevents undoing sound decisions. It also serves as the basis for the implementation plan.
**HANDOFF TO HUMAN:** The agent produces the architecture document; the human reviews trade-offs, confirms constraints, and proceeds to implementation.
**Output format:**
```markdown
# Batch Pipeline Architecture: [Pipeline Name]
## Workload Classification
- Type: Batch (bounded input, periodic execution)
- Input: [datasets, sizes, formats]
- Output: [destination, format, consumers]
- Latency tolerance: [acceptable job duration]
## Processing Engine
- Selected: [MapReduce / Spark / Flink / Tez]
- Rationale: [key factors from decision framework]
## Pipeline Stages (DAG)
[Stage 1] → [Stage 2] → [Stage 3] → [Output]
### Stage N: [Name]
- Input: [source dataset or upstream stage output]
- Transformation: [what this stage does]
- Output: [format, partitioning key]
- WHY this is a separate stage: [reason]
## Join Strategy
- Join type: [sort-merge / broadcast hash / partitioned hash]
- Datasets joined: [A (size) × B (size)]
- Selection rationale: [why this strategy fits the data characteristics]
- Skew handling: [if applicable]
## Intermediate State Management
- Strategy: [full materialization / pipelining / hybrid]
- Checkpoint points: [where state is durably saved]
## Graph Processing (if applicable)
- Model: [Pregel/BSP, implementation]
- Algorithm: [what graph computation is being performed]
- Iteration convergence: [how the algorithm knows when to stop]
## Fault Tolerance
- Task failure: [retry policy]
- Job failure: [restart boundary — which stages must rerun]
- Determinism: [operator determinism guarantees]
## Output Loading
- Pattern: [filesystem / search index build / key-value bulk load]
- Atomicity mechanism: [how partial output is prevented from being visible]
- Rollback strategy: [how to revert to previous output if new output is wrong]
## Trade-offs and Risks
- [Key trade-off 1]: [why this decision was made, what is given up]
- [Risk 1]: [failure mode and mitigation]
```
Mark Step 8 complete in TodoWrite.
## Inputs
- **Data source descriptions:** Dataset names, sizes, formats, update frequency
- **Required transformations:** Join conditions, aggregations, filters, enrichment steps
- **Output destination:** Where results go, how they will be consumed
- **Latency tolerance:** Acceptable job completion time
- **Infrastructure constraints:** Existing cluster type, available tools, team expertise
## Outputs
- **Pipeline architecture document** — engine choice, DAG of stages, join strategies, fault tolerance, output loading pattern
- **Join strategy selection** — which of the three join algorithms applies and why
- **Engine recommendation** — MapReduce vs Spark/Flink/Tez with decision rationale
- **Fault tolerance design** — materialization vs recomputation checkpointing strategy
## Key Principles
- **Bring data to the computation, not computation to the data** — the reason for co-locating datasets in a distributed filesystem before joining is that random remote lookups (querying a database per record) cost orders of magnitude more than local reads. MapReduce handles all network communication; the application code only sees local data. This separation also shields application code from partial failures.
- **Treat inputs as immutable and outputs as complete-or-nothing** — immutable inputs make retries safe (a failed task reruns on the same input, cannot corrupt it). Complete-or-nothing outputs make pipelines composable and debuggable (a job either produced a correct output or produced no output; there is no third state). These two properties together enable fault tolerance without distributed transactions.
- **Materialization is the trade-off between durability and performance** — fully materializing intermediate state to HDFS maximizes durability and restartability at the cost of extra I/O. Keeping state in memory maximizes performance at the cost of recomputation on failure. The right balance depends on job duration, cluster reliability, and whether intermediate outputs have other consumers.
- **Join strategy depends on data layout, not just data size** — the fastest join is not always the one that handles the largest data; it is the one that matches the data's existing partitioning. A broadcast hash join is fastest when one table is small. A partitioned hash join is fastest when both tables are co-partitioned. A sort-merge join is always correct but always pays the shuffle cost. Knowing what assumptions you can make about data layout unlocks faster strategies.
- **Skew breaks the "work is evenly distributed" assumption** — most batch processing performance analysis assumes uniform key distribution. Hot keys (celebrities, viral content, common stop words) concentrate work on a single reducer, making that reducer the bottleneck for the entire job. Identify potential hot keys during design and plan skew handling explicitly.
- **Graph algorithms on distributed systems pay a high communication cost** — distributed graph processing sends messages across the network for every edge crossing a machine boundary. If a graph fits on one machine, a single-machine algorithm almost always outperforms a distributed one. Use distributed graph processing (Pregel) only when the graph genuinely cannot fit on a single machine or disk.
- **Write output as files, bulk-load atomically into serving systems** — building the serving database inside the batch job (as sorted, indexed files on the distributed filesystem) and loading it atomically produces correct, rollback-able results. Writing per-record to a live database during the job trades away the all-or-nothing guarantee for implementation convenience, which is a bad trade.
## Examples
**Scenario: Web server log analysis — top URLs by traffic**
Trigger: "We have 500 GB of daily nginx access logs in S3. We need a daily report of the top 1000 most-requested URLs."
Process:
1. Workload: batch — daily job, logs are a bounded daily dump, 24-hour latency acceptable.
2. Engine: Spark — single-stage job, but team already has Spark infrastructure; no reason to use raw MapReduce.
3. Stage 1 (map): Parse log lines, extract URL field. Emit (url, 1) pairs.
4. Stage 2 (reduce): Group by URL, sum counts. Emit (url, count).
5. Stage 3 (sort): Sort by count descending, take top 1000.
6. No joins required — single dataset.
7. Output: Write top-1000 list to S3 as a JSON file. Atomic rename prevents partial reads.
8. Fault tolerance: Spark lineage — job is short enough (< 30 min) that full recomputation from S3 is acceptable if a task fails.
Output: Architecture doc specifying a 3-operator Spark pipeline with no joins, output to S3 JSON, daily schedule via Airflow.
---
**Scenario: User activity enrichment — joining events with profile database**
Trigger: "We have 2 TB of daily clickstream events (user_id, timestamp, url) and a 50 GB user profile database snapshot (user_id, age, country). We need to produce a dataset of (url, viewer_age, viewer_country) for analytics."
Process:
1. Workload: batch — daily job, both datasets are fixed snapshots, latency up to 6 hours acceptable.
2. Engine: Spark — multi-stage workflow benefits from dataflow execution.
3. Join strategy: broadcast hash join — user profile database (50 GB) is small enough to broadcast to each executor; 2 TB clickstream is the large dataset. Each executor loads profile hash table at startup, scans clickstream records, looks up profile by user_id. No shuffle required.
4. Skew check: if any user_id has pathologically many events (bot traffic), filter or handle separately — but for typical users, assume uniform distribution.
5. Stage 1: Load both datasets. Broadcast profile table.
6. Stage 2 (map-side join): For each clickstream record, look up user profile by user_id. Emit (url, age, country).
7. Stage 3: Write output partitioned by date to HDFS.
8. Output: HDFS directory, Parquet format, read by Hive/Spark SQL for analytics queries.
9. Fault tolerance: Spark lineage; profile table is fully in memory per executor so recomputation is fast.
Output: Architecture doc specifying broadcast hash join, Spark execution, Parquet output to HDFS, with note that if profile table grows beyond 200 GB, re-evaluate to sort-merge join.
---
**Scenario: PageRank computation on a web link graph**
Trigger: "We have a 10 TB graph of web pages and hyperlinks (300 billion edges). We need to compute PageRank scores for all pages to power search result ranking."
Process:
1. Workload: batch — graph is a periodic snapshot, computation may take hours to days, latency not critical.
2. Engine: Spark GraphX (implements Pregel/BSP model) — algorithm is iterative and graph-structured; plain MapReduce would re-read 10 TB per iteration.
3. Graph processing: Pregel model. Each vertex holds its current PageRank score. Each superstep: vertex sends its score divided by out-degree to all neighbors. Vertex receives scores from all incoming neighbors and updates its own score as weighted sum. Repeat until convergence (change in scores below threshold, typically 0.001).
4. Fault tolerance: Checkpoint vertex state to HDFS every 5 supersteps (configurable). On node failure, roll back to last checkpoint and resume. Algorithm is deterministic, so checkpointing + rollback is safe.
5. Skew: PageRank is a "pull" algorithm — vertices pull scores from neighbors. High-degree "celebrity" pages receive many messages; this is handled by the framework's partitioning, not by skew join handling.
6. Single-machine feasibility check: 10 TB graph — does not fit on a single machine. Distributed processing is required.
7. Output: Write (page_id, pagerank_score) to HDFS. Bulk-load into serving key-value store for search ranking lookups.
Output: Architecture doc specifying GraphX Pregel implementation, 5-superstep checkpoint interval, bulk load to serving store, with note to evaluate GraphChi if graph shrinks to single-machine scale.
## References
- For detailed join algorithm decision criteria and size thresholds, see [join-strategy-reference.md](references/join-strategy-reference.md)
- For MapReduce vs dataflow engine comparison with benchmark data, see [engine-comparison.md](references/engine-comparison.md)
- For skew handling patterns (skewed join, sharded join, two-stage aggregation), see [join-strategy-reference.md](references/join-strategy-reference.md)
- Cross-reference: for stream processing (unbounded input), see `stream-processing-designer`
- Cross-reference: for workload classification (batch vs OLTP vs OLAP), see `oltp-olap-workload-classifier`
- Source: *Designing Data-Intensive Applications*, Martin Kleppmann, Chapter 10 (pages 389-430)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data Intensive Applications by Unknown.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-oltp-olap-workload-classifier`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/engine-comparison.md
# Batch Processing Engine Comparison
Detailed comparison of MapReduce, Spark, Flink, and Tez for batch pipeline selection.
## The Core Distinction: Materialization vs Pipelining
The most important difference between MapReduce and dataflow engines is how they handle data flow between stages.
**MapReduce — full materialization:**
- Every job writes all output to HDFS before the next job starts.
- The next job reads from HDFS.
- Intermediate data is replicated across machines (durability overhead).
- Each stage waits for 100% of the previous stage to complete before starting.
- Straggler tasks (slow tasks in a stage) delay the entire pipeline.
**Dataflow engines (Spark, Flink, Tez) — pipelining:**
- Intermediate state between operators is kept in memory or written to local disk.
- Operators start processing as soon as upstream operators produce output.
- No HDFS replication of intermediate data.
- No waiting for full stage completion before downstream starts.
- Fault recovery requires recomputation (Spark lineage) or checkpoint rollback (Flink).
---
## MapReduce
**Strengths:**
- Extremely robust — designed to run on preemptible, low-priority compute alongside production workloads.
- Task-level fault tolerance: failed tasks are retried on the same input; output from failed attempts is discarded.
- Simple programming model — mapper and reducer callbacks are the entire API.
- Mature tooling, extensive operational experience.
- Correct for any batch workload; no assumptions about data layout or memory.
**Weaknesses:**
- Multi-stage pipelines require full materialization of each stage to HDFS — expensive for complex workflows.
- Sorting is always performed between map and reduce, even when not needed.
- A new JVM process is launched per task — startup overhead adds up for short tasks.
- Cannot express arbitrary operator graphs; only map-then-reduce topology (mitigated by chaining jobs).
**When MapReduce is the right choice:**
- Cluster runs mixed workloads with high task preemption rates (Google-style environment).
- Job must tolerate very frequent task kills without any state loss.
- Simple 1-3 stage pipelines where materialization overhead is not significant.
- Existing Hadoop-only environment with no Spark/Flink infrastructure.
- Compliance/audit requirements demand fully durable intermediate state.
**Typical performance vs Spark:** 2-10x slower for multi-stage pipelines. Comparable for single-stage jobs.
---
## Apache Spark
**Strengths:**
- Significantly faster than MapReduce for multi-stage pipelines — intermediate data stays in memory or local disk.
- Resilient Distributed Dataset (RDD) abstraction tracks data lineage for fault recovery without HDFS materialization.
- Rich API: batch (RDD, DataFrame), SQL (Spark SQL), streaming (Structured Streaming), ML (MLlib), graph (GraphX).
- Spark SQL optimizer (Catalyst) automatically selects join algorithms (broadcast, sort-merge) based on table statistics.
- Strong ecosystem: PySpark, Scala, Java, R APIs; deep integration with Delta Lake, Iceberg, Hudi.
**Weaknesses:**
- Memory-intensive: keeping intermediate state in memory requires careful memory tuning.
- Long lineage chains accumulate recomputation cost on failure — checkpoint RDDs for very long pipelines.
- Non-deterministic operations (system time, Python hash randomization, external reads) make lineage-based recovery unsafe without explicit handling.
- Higher operational complexity than MapReduce.
**When Spark is the right choice:**
- Multi-stage pipelines (4+ stages) where MapReduce would materialize to HDFS between each.
- Machine learning pipelines (MLlib integration).
- Iterative algorithms (graph processing via GraphX, ML training).
- Mixed batch + near-real-time processing (Structured Streaming runs the same code on streams).
- Teams already familiar with Spark; rich Python (PySpark) ecosystem.
**Fault tolerance model:** Lineage tracking. Each RDD/DataFrame partition knows its lineage — which operators and input partitions produced it. On node failure, lost partitions are recomputed from lineage. For very long lineage chains (many transformation stages), periodically call `rdd.checkpoint()` to materialize to HDFS and truncate the lineage.
**Key configuration levers:**
- `spark.executor.memory`: memory per executor — must accommodate working set of intermediate data.
- `spark.sql.autoBroadcastJoinThreshold`: maximum table size for automatic broadcast hash join (default 10 MB; increase if small table fits in memory).
- `spark.default.parallelism`: number of partitions for RDD operations.
---
## Apache Flink
**Strengths:**
- Lowest latency among dataflow engines — pipelined execution (Flink does not wait for a stage to complete before starting the next).
- Strongest streaming integration — same code runs on bounded (batch) and unbounded (stream) input; designed for eventual stream processing migration.
- Stateful operators with exactly-once guarantees via periodic checkpointing.
- Gelly API for graph processing (Pregel/BSP model).
- Lower memory footprint than Spark for many workloads (Flink manages memory outside JVM heap for serialized data).
**Weaknesses:**
- Smaller ecosystem than Spark; fewer pre-built ML libraries.
- More complex operational tuning (checkpoint configuration, state backend selection).
- Less SQL ecosystem maturity than Spark SQL (catching up with Flink SQL).
**When Flink is the right choice:**
- Batch pipeline that will evolve into a stream pipeline — same job graph works for both.
- Lowest latency batch execution is required (minimize time from job submission to output availability).
- Stateful batch computations (running aggregations, sessionization, iterative graph processing).
- Already running Flink for stream processing; use same cluster for batch.
**Fault tolerance model:** Periodic operator checkpointing. Flink snapshots the state of all operators to durable storage at configurable intervals. On node failure, Flink rolls back the entire job to the last checkpoint and resumes execution. Unlike Spark's lineage recomputation, Flink's recovery time is proportional to checkpoint interval, not lineage length.
**Key configuration levers:**
- `execution.checkpointing.interval`: how often to checkpoint (trade-off between recovery time and checkpoint overhead).
- `state.backend`: where checkpoint state is stored (RocksDB for large state, memory for small state).
- `taskmanager.memory.managed.fraction`: fraction of memory managed by Flink outside JVM heap.
---
## Apache Tez
**Strengths:**
- Thin optimization layer over YARN — minimal overhead, leverages existing YARN infrastructure.
- Enables Pig and Hive jobs to execute faster without rewriting job code (swap MapReduce execution engine for Tez via configuration).
- Good for organizations with large existing Pig/Hive codebases that need faster execution without migration.
**Weaknesses:**
- Does not provide its own high-level API — you use Tez through Pig, Hive, or Cascading.
- Relies on YARN shuffle service for data copying between nodes — less control over data transfer than Spark or Flink.
- Smaller community than Spark or Flink.
**When Tez is the right choice:**
- Existing Pig or Hive workflows need faster execution without code changes.
- Team is not ready to migrate to Spark/Flink but needs immediate performance improvement.
- Hadoop cluster with YARN is the only available infrastructure.
---
## High-Level Languages and Query Optimizers
In practice, engineers rarely write raw MapReduce, Spark, or Flink operator code for most batch workloads. High-level languages compile to the underlying engine and include query optimizers that automatically select the best execution plan.
| Language / Tool | Underlying engine | Notes |
|----------------|------------------|-------|
| Hive | MapReduce, Tez, or Spark | SQL-like; Hive query optimizer selects joins automatically |
| Pig | MapReduce or Tez | Dataflow scripting language |
| Spark SQL / DataFrames | Spark | Catalyst optimizer; column pruning, join reordering, predicate pushdown |
| Flink SQL | Flink | Growing SQL support; same optimizer principles |
| Cascading | MapReduce or Tez | Java API |
| Crunch | MapReduce | Java/Scala API |
The advantage of high-level declarative interfaces: you declare WHAT joins you need, and the query optimizer decides HOW to execute them (which join algorithm, in what order). Hive, Spark, and Flink all have cost-based optimizers that can even reorder joins to minimize intermediate state.
**Recommendation:** Use Spark SQL or Hive for most batch workloads unless fine-grained control over join algorithms or operator behavior is required. Reserve raw RDD/operator API for cases the SQL optimizer cannot handle (custom aggregations, iterative algorithms, graph processing).
---
## Performance Comparison Summary
For a representative 10-stage MapReduce pipeline on the same hardware:
| Engine | Relative speed | Primary reason |
|--------|---------------|----------------|
| MapReduce | 1x (baseline) | Full HDFS materialization between all stages |
| Tez | 2-4x | Eliminates unnecessary map stages; reduces HDFS I/O |
| Spark | 5-20x | In-memory intermediate state; pipelined execution; no sort unless needed |
| Flink | 5-20x | Pipelined execution; lower memory overhead than Spark |
Note: actual speedup depends heavily on the pipeline's bottleneck. Disk-bound pipelines see larger Spark/Flink improvements; CPU-bound pipelines see smaller differences. For single-stage jobs with no intermediate state, the difference is minimal.
---
## Engine Selection Decision Tree
```
Is the cluster shared with production workloads and task preemption is frequent?
├── YES → MapReduce (designed for this; tolerates frequent task kills)
└── NO
└── Does an existing Hive/Pig codebase need faster execution without rewriting?
├── YES → Tez (swap execution engine via config)
└── NO
└── Will this pipeline eventually process streaming (unbounded) input?
├── YES → Flink (same code for batch + stream)
└── NO
└── Is Python ecosystem important? (data science, ML)
├── YES → Spark (PySpark, MLlib, GraphX)
└── NO → Spark or Flink (comparable; choose based on team familiarity)
```
FILE:references/join-strategy-reference.md
# Join Strategy Reference
Detailed decision criteria, size thresholds, and skew handling patterns for batch pipeline joins.
## The Three Join Strategies at a Glance
| Strategy | Also called | Where join logic runs | Input requirements | Best for |
|----------|-------------|----------------------|-------------------|----------|
| Sort-merge join | Reduce-side join | Reducer | No assumptions | Default; both datasets large |
| Broadcast hash join | Map-side join (replicated) | Mapper | One dataset fits in memory | One small + one large dataset |
| Partitioned hash join | Bucketed map join (Hive) | Mapper | Both co-partitioned identically | Both datasets large, same partitioning |
---
## Strategy 1: Sort-Merge Join (Reduce-Side)
### How it works
1. **Map phase:** Both input datasets go through mappers. Each mapper extracts the join key and emits `(join_key, record)` pairs. Records from dataset A and dataset B both emit using the same join key.
2. **Shuffle/sort phase:** The framework partitions mapper output by join key (using hash-of-key to assign to a reducer partition) and sorts within each partition. All records with the same join key — from both datasets — arrive at the same reducer, adjacent in sort order.
3. **Reduce phase:** The reducer iterates over all records with a given key. Using a secondary sort (dataset A records before dataset B records, or vice versa), the reducer can hold one dataset's record in memory while streaming the other, rather than buffering everything. The reducer outputs joined records.
### Secondary sort technique
Secondary sort arranges records so that the "small side" of the join (e.g., one profile record per user) arrives first. The reducer stores this record in a local variable, then iterates through "large side" records (e.g., many activity events per user) emitting a joined output for each. This keeps memory usage constant — only one record from the small side is held at a time.
### Size thresholds
- Both datasets can be arbitrarily large.
- Network and disk I/O scale with total data volume. Budget 2-3x the input data size for shuffle traffic.
- Optimize by filtering and projecting early in the map phase to reduce shuffle volume.
### When to use
- Default choice when dataset sizes are unknown or both are large (> a few GB each).
- When no assumptions can be made about data partitioning.
- When the join key is not known in advance to be uniformly distributed (safe starting point; add skew handling if needed).
### Cost profile
- Expensive: full shuffle of both datasets over the network, full sort.
- Reducer start is delayed until all mappers for both datasets have finished.
- Intermediate data is written to local disk on each mapper, then transferred to reducers.
---
## Strategy 2: Broadcast Hash Join (Map-Side, Small Table)
### How it works
1. Before map tasks start, the small dataset is loaded from the distributed filesystem into each mapper's memory as a hash table (keyed by join key).
2. Each mapper reads one block of the large dataset and, for each record, looks up the join key in the in-memory hash table.
3. The mapper emits joined records directly. No reducer needed. No shuffle.
### Size thresholds
- Small table must fit in memory on each mapper (including other mapper state and JVM overhead).
- Practical limits: 1-4 GB per mapper is typical. Check executor memory configuration.
- If the small table is larger than memory but can fit on local disk, some implementations (Pig "replicated join", Hive "MapJoin") can use a local disk index with OS page cache to achieve near-memory performance.
### Multi-way broadcast joins
If multiple small tables need to be joined with one large table, all small tables can be broadcast simultaneously. Each mapper loads all small tables at startup. This is efficient as long as the total memory footprint of all small tables stays within the mapper's memory budget.
### Supported in
- Pig: "replicated join"
- Hive: "MapJoin" (also auto-detected by Hive optimizer when table is below `hive.mapjoin.smalltable.filesize`)
- Cascading, Crunch: broadcast join operators
- Spark: `broadcast()` hint on small DataFrame; optimizer uses `BroadcastHashJoin`
- Impala: used by default for small tables in analytic queries
### When to use
- One dataset is clearly small (user profile DB, product catalog, lookup table) and the other is large (event log, transaction history).
- When small table is frequently joined with many large datasets (load once, reuse across multiple map tasks).
### When NOT to use
- Small table grows beyond memory budget — switch to sort-merge join.
- Multiple large-large joins — broadcast only one side; the other must be shuffled.
---
## Strategy 3: Partitioned Hash Join (Map-Side, Co-Partitioned)
### How it works
Both datasets must be partitioned using the same key, the same hash function, and the same number of partitions. This is the "co-partitioned" or "bucketed" property.
1. Mapper N reads partition N of the large dataset.
2. Mapper N also loads partition N of the small dataset into memory (only partition N, not the full dataset).
3. Mapper N performs a hash join between the two partitions locally. No shuffle or reducer needed.
Because the co-partitioning guarantee means that any record in partition N of dataset A can only match records in partition N of dataset B, each mapper handles a fully independent subset of the join problem.
### Size thresholds
- Neither dataset needs to fit entirely in memory — only one partition at a time.
- Memory requirement: (total small dataset size) / (number of partitions).
- With 1000 partitions and a 100 GB small dataset: each mapper loads 100 MB. Very manageable.
- Large dataset can be arbitrarily large as long as one partition fits in local disk.
### Prerequisites
Co-partitioning is a strong prerequisite. The datasets must:
- Be partitioned by the same join key (e.g., both partitioned by `user_id`)
- Use the same hash function
- Have the same number of partitions
This is most naturally satisfied when both datasets were produced by prior MapReduce jobs with the same output partitioning scheme. The partitioning metadata (key, function, partition count) must be tracked — Hadoop ecosystem uses HCatalog / Hive metastore for this.
### Known as
- Hive: "bucketed map join"
- General: "co-partitioned hash join", "partition hash join"
### When to use
- Both datasets are large, but one is smaller than the other.
- Both datasets were produced by prior pipeline stages that happen to partition by the join key.
- The join key is the same as the partitioning key already in use.
### When NOT to use
- Datasets are partitioned by different keys, different hash functions, or different partition counts — co-partitioning guarantee does not hold.
- One dataset is small enough for broadcast hash join (strategy 2 is simpler and faster).
---
## Map-Side Merge Join (Bonus Variant)
If both datasets are not only co-partitioned but also **sorted by the join key within each partition**, a mapper can perform a merge join by reading both partition files incrementally in ascending key order and matching records with the same key — exactly like the merge step of a merge sort.
This requires no in-memory hash table, making it suitable for very large partitions. It works like the reduce phase of a sort-merge join but runs in a mapper (no shuffle required).
This situation often arises naturally if the datasets were produced by prior sort-merge join outputs (which are partitioned and sorted by definition).
---
## Skew Handling Patterns
### When to suspect skew
- A social network with celebrity accounts
- Content platforms with viral items (a single video_id with 10x the events of the average)
- Natural language data (stop words like "the", "a" as join keys)
- Time-partitioned data where some time windows have disproportionate traffic
### Detecting skew
Run a preliminary sampling job (or query the dataset statistics if available) to estimate the top-N key frequencies. If the top key has 10x the frequency of the median key, skew handling is warranted.
### Pattern 1: Skewed Join (Pig)
1. **Sampling job:** Scan a sample of the large dataset to identify hot keys.
2. **Hot key handling:** Records with hot keys are sent to a randomly chosen subset of reducers (spreading the load). Records from the other dataset for those hot keys must be replicated to all reducers handling that key.
3. **Cold key handling:** Standard sort-merge join.
Cost: data for hot-key matching side is replicated (increases network volume). Benefit: the hot-key reducer becomes a set of reducers, eliminating the straggler.
### Pattern 2: Sharded Join (Crunch)
Same approach as skewed join but hot keys are specified explicitly by the developer rather than discovered by sampling. Requires domain knowledge of which keys are hot.
### Pattern 3: Two-Stage Aggregation for GROUP BY Skew
When grouping by a hot key (not joining), a two-stage approach avoids the straggler reducer:
1. **Stage 1:** Each mapper sends records to a randomly chosen reducer (not the canonical hash-of-key reducer). Each reducer computes a partial aggregate for its subset of records.
2. **Stage 2:** The partial aggregates for each key (now much smaller in volume) are sent to the canonical reducer for final aggregation.
This is only correct for commutative, associative aggregations (sum, count, min, max). It is not correct for median or other order-dependent aggregations.
### Hive skewed join optimization
Hive's approach: hot keys are declared in the table metadata. Records for hot keys are stored in separate files. At join time, the hot-key records are processed using a broadcast hash join (map-side) while cold-key records use a standard sort-merge join. The join strategy is automatically selected per-key based on the metadata.
---
## Join Output and Downstream Partitioning
The join strategy affects the output's partitioning, which matters for downstream pipeline stages:
- **Sort-merge join output:** Partitioned and sorted by the join key (reducer output). Downstream stages that need the data sorted by join key can skip re-sorting.
- **Broadcast hash join output:** Partitioned the same way as the large input dataset (since one map task per block of the large input). Not sorted by join key.
- **Partitioned hash join output:** Partitioned the same way as both input datasets.
Knowing the output partitioning of each stage enables downstream stages to choose faster join strategies (e.g., if the sort-merge output is already sorted, the next join on the same key can use a map-side merge join).
Implement the Visitor pattern to define new operations on an object structure without changing the classes of the elements it operates on. Use when you have...
---
name: visitor-pattern-implementor
description: |
Implement the Visitor pattern to define new operations on an object structure without changing the classes of the elements it operates on. Use when you have a stable class hierarchy but frequently add new operations, when many unrelated operations need to be performed on an object structure, or when you want to accumulate state across a traversal. Includes the critical stability decision rule, double-dispatch mechanism (Accept/Visit), Iterator integration for traversal, and the encapsulation trade-off warning.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/visitor-pattern-implementor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- behavioral-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [5]
pages: [70-80, 306-318]
tags: [design-patterns, behavioral, gof, visitor, double-dispatch, iterator, traversal, object-structure]
execution:
tier: 2
mode: full
inputs:
- type: code
description: "Existing code with a stable element class hierarchy needing multiple new operations, or a description of the object structure and the operations to add"
tools-required: [TodoWrite, Read]
tools-optional: [Grep, Edit]
mcps-required: []
environment: "Any codebase. Language-agnostic — examples use Python but the pattern applies universally."
---
# Visitor Pattern Implementor
## When to Use
You have an object structure (a hierarchy of related element classes) and need to perform operations on those objects without modifying the element classes themselves. This skill applies when:
- **The element class hierarchy is stable** — it rarely gains new subclasses, but you frequently need to add new operations over the structure. This is the **stability decision rule**: if the hierarchy changes often, do not use Visitor.
- **Many distinct and unrelated operations** need to be performed on the same object structure, and you want to avoid polluting element classes with those operations.
- **State must accumulate across a traversal** — the operation builds up a result by visiting multiple elements in sequence (e.g., collecting all misspellings, computing a total price).
- The object structure is shared across multiple applications and you want each application to add only its own operations, not affect shared element code.
**The go/no-go decision before proceeding:**
| Element hierarchy stability | Operation change frequency | Verdict |
|-----------------------------|---------------------------|---------|
| Stable (rare new elements) | High (many new operations) | Use Visitor |
| Unstable (frequent new elements) | Any | Avoid Visitor — each new element requires adding an abstract `Visit` method to every Visitor and implementing it in every ConcreteVisitor |
| Stable | Low (1-2 operations total) | Consider simpler approach — Visitor may be over-engineering |
If the hierarchy is unstable, it is easier to define the operation in the element classes directly. If you use Visitor anyway, every new element type forces a cascade of changes through all existing Visitor classes — exactly the problem Visitor was meant to solve in the other direction.
Before starting, confirm this is not better handled by:
- **Iterator alone** — if the operation is simple and all elements are the same type, an Iterator may suffice
- **Template Method** — if the operation structure is fixed and only small steps vary per element type
---
## Process
### Step 1: Set Up Tracking and Apply the Stability Test
**ACTION:** Use `TodoWrite` to track progress, then evaluate whether Visitor is the right pattern.
**WHY:** The stability test is the most critical decision in the entire workflow. Applying Visitor to an unstable hierarchy produces a brittle system — every new element class requires modifying the abstract Visitor interface and all ConcreteVisitor classes. This is a worse maintenance burden than the original problem. Confirming stability first prevents committing to the wrong pattern.
```
TodoWrite:
- [ ] Step 1: Apply stability test and confirm Visitor is appropriate
- [ ] Step 2: Define the Visitor interface
- [ ] Step 3: Add Accept methods to element classes
- [ ] Step 4: Implement ConcreteVisitor classes
- [ ] Step 5: Implement traversal (ObjectStructure or Iterator)
- [ ] Step 6: Wire up the client and validate
```
Document the element hierarchy:
| Element | How often new subclasses are added | Operations needed now | Planned future operations |
|---------|-----------------------------------|----------------------|--------------------------|
| (list each concrete element class) | rarely / sometimes / often | N | estimate |
If "often" appears in column 2, stop and use `behavioral-pattern-selector` to reconsider.
---
### Step 2: Define the Visitor Interface
**ACTION:** Create an abstract Visitor class (or interface) that declares one `Visit` method for every ConcreteElement class in the structure.
**WHY:** The Visitor interface is the contract that lets each ConcreteElement call back to the visitor with its concrete type. Having a separate `Visit` method per element type — rather than a single generic method — allows the visitor to access element-specific state through each element's own interface without type checks or casts. The operation's name encodes the element type it operates on, making the code self-documenting and enabling compiler-enforced completeness.
```python
# Abstract Visitor — one Visit method per ConcreteElement
class DocumentVisitor:
def visit_character(self, character: "Character") -> None:
"""Called when a Character element accepts this visitor."""
pass # Default: do nothing (allows selective override)
def visit_row(self, row: "Row") -> None:
pass
def visit_image(self, image: "Image") -> None:
pass
```
**Design decisions for the Visitor interface:**
| Decision | Options | Guidance |
|----------|---------|---------|
| **Default implementations** | Empty default vs. abstract (forced override) | Empty defaults allow ConcreteVisitors to handle only the elements they care about, reducing boilerplate |
| **Method naming** | `visit_character` vs. overloaded `visit(character)` | Distinct names are clearer in languages without overloading; overloading reduces noise in languages that support it well |
| **Return value** | void vs. typed return | Void with accumulated state in the visitor is simpler; typed returns work if each visit produces an independent result |
---
### Step 3: Add Accept Methods to Element Classes
**ACTION:** Add an `Accept(visitor)` method to the abstract Element class and implement it in each ConcreteElement.
**WHY:** This is where double-dispatch is achieved. In single-dispatch languages (Python, Java, C++), when you call `element.Accept(visitor)`, only `element`'s type is used to select the method. Inside `Accept`, when the element calls `visitor.visit_character(self)`, now *both* the visitor's type and the element's concrete type determine which `Visit` implementation runs. This two-step dispatch is the mechanism that lets a single `Accept(visitor)` call route to the right combination of visitor-and-element behavior — without any `isinstance` checks or casts.
The pattern for every ConcreteElement is identical:
```python
class Element:
"""Abstract base — declares the Accept protocol."""
def accept(self, visitor: DocumentVisitor) -> None:
raise NotImplementedError
class Character(Element):
def __init__(self, char_code: str):
self._char_code = char_code
def get_char_code(self) -> str:
return self._char_code
def accept(self, visitor: DocumentVisitor) -> None:
# Double-dispatch: calls the Visit method specific to Character
visitor.visit_character(self)
class Row(Element):
def __init__(self):
self._children: list[Element] = []
def add(self, element: Element) -> None:
self._children.append(element)
def get_children(self) -> list[Element]:
return self._children
def accept(self, visitor: DocumentVisitor) -> None:
visitor.visit_row(self)
class Image(Element):
def accept(self, visitor: DocumentVisitor) -> None:
visitor.visit_image(self)
```
**The double-dispatch mechanism step by step:**
1. Client calls `element.accept(visitor)` — dispatch 1: resolved by `element`'s concrete type
2. Inside `Character.accept`, calls `visitor.visit_character(self)` — dispatch 2: resolved by `visitor`'s concrete type
3. The specific `Visit` body that runs depends on **both** the element type AND the visitor type
This is the key to adding new operations: adding a new `ConcreteVisitor` subclass provides a new set of `Visit` implementations without touching any element class.
**Encapsulation warning:** For the visitor to do its work, element classes must expose the state the visitor needs through public accessors (e.g., `get_char_code()`). If elements keep their state private with no accessors, the visitor pattern forces you to add those accessors — potentially breaking encapsulation. Evaluate which internal state each planned visitor operation needs, and add only those accessors. This is the main liability of the pattern.
---
### Step 4: Implement ConcreteVisitor Classes
**ACTION:** For each distinct operation, create a ConcreteVisitor class that implements the Visitor interface and defines what to do for each element type.
**WHY:** Each ConcreteVisitor encapsulates one complete operation over the entire object structure. Gathering all per-element logic for a single operation in one class keeps the algorithm coherent — you can read `SpellingCheckingVisitor` and understand the entire spelling algorithm in one place. This contrasts with distributing the logic across element classes, where the algorithm for a single operation is split across every element class and difficult to follow. The visitor also carries its own state (e.g., the accumulated list of misspellings), which would otherwise require global state or awkward parameter-passing.
```python
class SpellingCheckingVisitor(DocumentVisitor):
"""Accumulates misspelled words during traversal."""
def __init__(self):
self._current_word: list[str] = []
self._misspellings: list[str] = []
def visit_character(self, character: "Character") -> None:
ch = character.get_char_code()
if ch.isalpha():
self._current_word.append(ch)
else:
# Non-alphabetic character terminates a word — check it
word = "".join(self._current_word)
if word and self._is_misspelled(word):
self._misspellings.append(word)
self._current_word = []
def visit_row(self, row: "Row") -> None:
pass # Rows have no direct spelling content — children handled by traversal
def visit_image(self, image: "Image") -> None:
pass # Images are not text — skip
def get_misspellings(self) -> list[str]:
return list(self._misspellings)
def _is_misspelled(self, word: str) -> bool:
# Plug in actual dictionary lookup here
return False
class HyphenationVisitor(DocumentVisitor):
"""Inserts discretionary hyphens into the document structure."""
def visit_character(self, character: "Character") -> None:
ch = character.get_char_code()
if ch.isalpha():
# Accumulate word characters; when complete, apply hyphenation algorithm
# and insert Discretionary glyphs at computed break points
pass
def visit_row(self, row: "Row") -> None:
pass
def visit_image(self, image: "Image") -> None:
pass
```
**State management in ConcreteVisitors:**
- State that accumulates across the traversal (word buffer, running total, inventory count) lives as instance variables on the visitor — this is the designed purpose of ConcreteVisitor state
- State should be reset between traversals if the same visitor instance is reused
- Algorithm-specific data structures that would pollute element classes are natural residents in the ConcreteVisitor
---
### Step 5: Implement Traversal
**ACTION:** Decide who is responsible for traversal — the ObjectStructure, the elements themselves (composite recursion), or an external Iterator — and implement accordingly.
**WHY:** Traversal must be separated from the visit action because different operations often require the same traversal order, and you want to reuse the traversal without duplicating it. The three options have different trade-offs for flexibility, reuse, and coupling.
**Traversal option 1 — ObjectStructure drives iteration (simplest for collections):**
```python
class DocumentStructure:
"""Owns and iterates the element collection."""
def __init__(self):
self._elements: list[Element] = []
def add(self, element: Element) -> None:
self._elements.append(element)
def accept(self, visitor: DocumentVisitor) -> None:
for element in self._elements:
element.accept(visitor)
```
**Traversal option 2 — Composite elements recurse (natural for tree structures):**
```python
class Row(Element):
def accept(self, visitor: DocumentVisitor) -> None:
# Visit children first (preorder: swap order for postorder)
for child in self._children:
child.accept(visitor)
# Then visit self
visitor.visit_row(self)
```
**Traversal option 3 — External Iterator (maximum flexibility, required when traversal order varies per operation):**
```python
# Using an Iterator to drive traversal externally
# Allows different visitors to use different traversal orders on the same structure
def traverse_with_visitor(root: Element, visitor: DocumentVisitor) -> None:
iterator = PreorderIterator(root)
for element in iterator:
element.accept(visitor)
```
**Iterator integration — why it complements Visitor:**
An Iterator traverses a structure but cannot distinguish between element types. A Visitor distinguishes between element types but needs a mechanism to visit them all. The combination — Iterator for traversal order, Visitor for per-type behavior — separates two independent concerns cleanly. The Lexi case study (GoF pages 70-80) demonstrates this: a `PreorderIterator` walks the glyph structure, and a `SpellingCheckingVisitor` is passed to each glyph via `CheckMe`/`Accept` as the iterator advances.
**Choose traversal placement:**
| Situation | Recommended traversal placement |
|-----------|--------------------------------|
| Structure is a flat collection (list, set) | ObjectStructure drives iteration |
| Structure is a tree (Composite pattern) | Elements recurse in `Accept` |
| Traversal order varies by operation | External Iterator |
| Traversal algorithm is complex (irregular, depends on intermediate results) | ConcreteVisitor controls its own traversal |
---
### Step 6: Wire Up the Client and Validate
**ACTION:** Write client code that creates a ConcreteVisitor, passes it to the ObjectStructure's `Accept` method, and retrieves results. Then validate the implementation.
**WHY:** The client is the proof of the pattern's value. The client creates a visitor and passes it into the structure — it never needs to change when new visitor types are added. Validating this "closed to modification" property confirms the implementation is correct.
```python
# Client usage — adding a new operation (spelling check) without modifying element classes
structure = DocumentStructure()
structure.add(Character("a"))
structure.add(Character("l"))
structure.add(Character("u"))
structure.add(Character("_")) # underscore terminates "alu" — not a word
structure.add(Character("m"))
spelling_visitor = SpellingCheckingVisitor()
structure.accept(spelling_visitor)
misspellings = spelling_visitor.get_misspellings()
# Later: add hyphenation — zero changes to Document, Character, Row, Image
hyphenation_visitor = HyphenationVisitor()
structure.accept(hyphenation_visitor)
```
**Validation checklist:**
- [ ] No element class was modified to add the new operation
- [ ] Each `Accept` method calls exactly the matching `Visit` method (`visitor.visit_character(self)` in `Character`, etc.) — not a generic method
- [ ] The ConcreteVisitor accumulates all state it needs without accessing global variables
- [ ] The traversal visits all elements in the structure (no elements skipped unless intentional)
- [ ] Adding a second ConcreteVisitor requires zero changes to element classes
- [ ] If a new ConcreteElement were added, you can enumerate all Visitor classes that would need updating — and confirm this list is acceptably small
---
## Examples
### Example 1: Lexi Document Editor (GoF Case Study — Spelling + Hyphenation)
**Scenario:** Lexi is a document editor representing text as a hierarchy of Glyph objects (Character, Row, Image, etc.). The Glyph class hierarchy is stable — it changes infrequently because it represents the document model. But new analytical capabilities are added often: spelling checking, hyphenation, word count, forward search, grammar checking.
**Trigger:** "We need to add spelling checking and hyphenation to Lexi without wiring these algorithms into the Glyph class hierarchy, and we want to add more analyses in the future."
**Stability assessment:** Glyph hierarchy is stable (Character, Row, Image, Discretionary — defined by the language model). Operations are frequent and varied. Visitor is appropriate.
**Design:**
```
Abstract: Visitor — VisitCharacter, VisitRow, VisitImage
Concrete A: SpellingCheckingVisitor — accumulates _currentWord, checks IsMisspelled
Concrete B: HyphenationVisitor — assembles words, inserts Discretionary glyphs
Elements: Glyph hierarchy — each subclass has Accept(Visitor&)
Traversal: PreorderIterator drives Accept calls glyph-by-glyph
```
**Double-dispatch in action:**
1. Iterator reaches a `Character` glyph
2. `character.accept(visitor)` → dispatch 1: Character's Accept executes
3. Inside `Character::Accept`: `visitor.visit_character(this)` → dispatch 2: SpellingCheckingVisitor's `VisitCharacter` executes
4. `SpellingCheckingVisitor::VisitCharacter` calls `character.get_char_code()` to get the character — this is the element-specific accessor the visitor uses
**Output:** SpellingCheckingVisitor produces `get_misspellings()`. HyphenationVisitor modifies the document structure in-place. Neither operation touched any Glyph class. Adding `WordCountVisitor` is one new class — no other code changes.
---
### Example 2: Abstract Syntax Tree Compiler (GoF Motivation)
**Scenario:** A compiler represents programs as an abstract syntax tree (AST) with node classes: `AssignmentNode`, `VariableRefNode`, `LiteralNode`, `BinaryOpNode`. The node classes are fixed by the grammar. Operations needed: type-checking, code generation, pretty-printing, flow analysis, constant folding. These are added as the compiler evolves.
**Trigger:** "TypeCheck, GenerateCode, and PrettyPrint are methods on every AST node class. Adding flow analysis means touching 12 node classes again."
**Stability assessment:** AST nodes are stable (defined by the grammar; grammar rarely changes for a given language). Compiler passes are frequent. Visitor is well-suited.
**Before (the problem — operations distributed across node classes):**
```
AssignmentNode: TypeCheck(), GenerateCode(), PrettyPrint()
VariableRefNode: TypeCheck(), GenerateCode(), PrettyPrint()
```
**After (Visitor — operations centralized in visitor classes):**
```python
class NodeVisitor:
def visit_assignment(self, node: "AssignmentNode") -> None: pass
def visit_variable_ref(self, node: "VariableRefNode") -> None: pass
def visit_literal(self, node: "LiteralNode") -> None: pass
class TypeCheckingVisitor(NodeVisitor):
def visit_assignment(self, node): ... # type-check assignment
def visit_variable_ref(self, node): ... # verify variable is declared
def visit_literal(self, node): ... # validate literal type
class CodeGeneratingVisitor(NodeVisitor):
def visit_assignment(self, node): ...
def visit_variable_ref(self, node): ...
def visit_literal(self, node): ...
class AssignmentNode:
def accept(self, visitor: NodeVisitor) -> None:
visitor.visit_assignment(self)
class VariableRefNode:
def accept(self, visitor: NodeVisitor) -> None:
visitor.visit_variable_ref(self)
```
**Output:** Adding `FlowAnalysisVisitor` is one new class. `AssignmentNode` and `VariableRefNode` are never modified. Each compiler pass is readable as a coherent whole in its own class.
---
### Example 3: Equipment Pricing and Inventory (GoF Sample Code)
**Scenario:** A hardware product catalog uses a Composite structure: `Chassis` (composite) contains `FloppyDisk`, `Card`, `Bus` (leaves). Two unrelated operations need to run over the structure: compute total price and compute inventory count.
**Trigger:** "We need both `GetTotalPrice()` and `GetInventory()` on the equipment structure, and the equipment class hierarchy is finished — we don't want to add more methods to it."
**Design:**
```python
class EquipmentVisitor:
def visit_floppy_disk(self, disk: "FloppyDisk") -> None: pass
def visit_card(self, card: "Card") -> None: pass
def visit_chassis(self, chassis: "Chassis") -> None: pass
def visit_bus(self, bus: "Bus") -> None: pass
class PricingVisitor(EquipmentVisitor):
def __init__(self): self._total = 0.0
def visit_floppy_disk(self, disk):
self._total += disk.net_price() # leaf: net price
def visit_chassis(self, chassis):
self._total += chassis.discount_price() # composite: discounted
def get_total_price(self) -> float:
return self._total
class InventoryVisitor(EquipmentVisitor):
def __init__(self): self._inventory: list = []
def visit_floppy_disk(self, disk):
self._inventory.append(disk)
def visit_chassis(self, chassis):
self._inventory.append(chassis)
def get_inventory(self) -> list:
return self._inventory
# Chassis traverses children before visiting itself (composite Accept)
class Chassis(Equipment):
def accept(self, visitor: EquipmentVisitor) -> None:
for part in self._parts:
part.accept(visitor)
visitor.visit_chassis(self) # visit self after children
```
**State accumulation:** `PricingVisitor._total` and `InventoryVisitor._inventory` grow as the traversal proceeds. Without the Visitor pattern, this state would need to be passed as parameters through every element's pricing method or stored as globals — the Visitor makes it a natural instance variable.
**Output:** Running `component.accept(PricingVisitor())` produces the total cost. Running `component.accept(InventoryVisitor())` produces the component list. Equipment classes were not modified.
---
## Key Principles
- **Stability test is a prerequisite, not a suggestion** — applying Visitor to an unstable hierarchy creates a maintenance cascade. Every new element type requires a new abstract method in `Visitor` and a new implementation in every `ConcreteVisitor`. If this will happen often, the pattern costs more than it saves.
- **Double-dispatch is the mechanism, not a coincidence** — the two-step `Accept` → `Visit` call is intentional. Without it, single-dispatch languages would route all elements to the same `Visit` method, requiring type checks. The `Accept` method in each ConcreteElement is boilerplate by design: it exists solely to tell the visitor the element's concrete type.
- **ConcreteVisitor state is accumulation, not side-effect leakage** — the Visitor pattern is specifically designed to carry state across a traversal. Use visitor instance variables freely for this purpose. Each traversal run should get a fresh visitor instance (or reset the visitor's state) to avoid contamination between runs.
- **Encapsulation is explicitly traded for extensibility** — the Visitor pattern assumes element interfaces are powerful enough for visitors to do their work. If elements are too encapsulated, visitors cannot access what they need. The design obligation is to expose element state through well-defined accessors, accepting that internal state becomes part of the public contract.
- **Iterator + Visitor is a natural pairing** — Iterator solves "how to traverse" (order, data structure independence), Visitor solves "what to do at each element" (type-specific behavior). The two patterns are complementary, not competing. Use Iterator for the traversal loop, Visitor for the per-element dispatch.
- **Adding a new ConcreteVisitor is free; adding a new ConcreteElement is expensive** — this asymmetry is the fundamental trade-off. Design the element hierarchy to completion before adopting Visitor. If the hierarchy grows regularly, the cost of updating all existing Visitors may outweigh the benefit.
---
## Reference Files
| File | When to read |
|------|-------------|
| `references/visitor-implementation-guide.md` | Participants reference, full consequences catalog, language-specific notes (Java, TypeScript, Python), traversal placement decision tree, comparison with Interpreter pattern |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-behavioral-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/visitor-implementation-guide.md
# Visitor Pattern — Implementation Guide
> Reference for `visitor-pattern-implementor`. Read this when you need full consequences detail, participant role definitions, language-specific implementation notes, or the traversal placement decision tree.
---
## Participants and Their Roles
| Participant | Also called | Role |
|-------------|-------------|------|
| **Visitor** | NodeVisitor, DocumentVisitor, EquipmentVisitor | Declares one `Visit` operation per ConcreteElement class. The signature of each Visit operation encodes the concrete type. |
| **ConcreteVisitor** | TypeCheckingVisitor, SpellingCheckingVisitor, PricingVisitor | Implements each operation declared by Visitor. Provides the algorithm and carries accumulating state. Each ConcreteVisitor is a self-contained operation over the full structure. |
| **Element** | Node, Glyph, Equipment | Defines an `Accept` operation that takes a Visitor as an argument. |
| **ConcreteElement** | AssignmentNode, Character, FloppyDisk | Implements `Accept` by calling the matching `Visit` method on the visitor: `visitor.visit_assignment(self)`. |
| **ObjectStructure** | Program, DocumentStructure | Enumerates its elements. Provides a high-level interface to let visitors visit its elements. May be a Composite (see Composite pattern) or a simple collection. |
### Collaboration sequence
1. Client creates a ConcreteVisitor instance
2. Client calls `object_structure.accept(visitor)` (or starts traversal directly)
3. ObjectStructure iterates and calls `element.accept(visitor)` on each element
4. Each element calls the matching `visitor.visit_X(self)` back on the visitor (double-dispatch)
5. ConcreteVisitor executes the operation fragment for that element type, using the element's interface to access its state
6. After traversal: client retrieves accumulated results from the visitor
---
## Complete Consequences
### Benefits
**1. Easy to add new operations**
Adding a new operation over an object structure requires only defining a new ConcreteVisitor subclass. None of the element classes change. In contrast, if the operation were distributed across element classes, every element class would need modification.
**2. Related operations are gathered, unrelated ones separated**
All the behavior for one operation (e.g., pricing) lives in one class: `PricingVisitor`. The element classes stay clean — they only define structure. Algorithm-specific data structures are hidden inside the visitor.
**3. Visits across class hierarchies**
An Iterator requires all elements to share a common type (the iterator's type parameter). A Visitor imposes no such requirement — `VisitMyType` and `VisitYourType` can operate on completely unrelated classes. This makes Visitor useful when the structure contains elements that are not related by inheritance.
**4. State accumulation during traversal**
Visitors naturally carry state across a traversal as instance variables. Without Visitor, this state would need to be threaded through method parameters or stored globally.
### Liabilities
**5. Adding new ConcreteElement classes is hard**
Each new ConcreteElement subclass requires a new abstract `Visit` method in the Visitor interface, and a new concrete implementation in every ConcreteVisitor. In a codebase with many ConcreteVisitors, this is costly. This is the central trade-off: Visitor makes adding operations easy and adding element types hard.
**The key consideration (GoF, page 310):**
> "The key consideration in applying the Visitor pattern is whether you are mostly likely to change the algorithm applied over an object structure or the classes of objects that make up the structure. The Visitor class hierarchy can be difficult to maintain when new ConcreteElement classes are added frequently. In such cases, it's probably easier just to define operations on the classes that make up the structure. If the Element class hierarchy is stable, but you are continually adding operations or changing algorithms, then the Visitor pattern will help you manage the changes."
**6. Breaking encapsulation**
Visitor's approach assumes the ConcreteElement interface is powerful enough for visitors to do their job. This often forces elements to expose internal state through public accessors that would otherwise be private. Once those accessors exist, the element's internal representation becomes part of its public contract — harder to change later.
Mitigation: design element accessors deliberately. Only expose state that a well-defined visitor operation legitimately needs. Do not expose internal mutable state unnecessarily.
---
## Traversal Placement Decision Tree
```
Is traversal order the same for all operations?
├─ YES → Is the structure a Composite (tree)?
│ ├─ YES → Composite elements recurse in Accept (children first, then self)
│ └─ NO → ObjectStructure drives a flat iteration
└─ NO → Do different operations need different traversal orders?
├─ YES → Use external Iterator — each ConcreteVisitor picks its iterator
└─ Is the traversal algorithm complex (irregular, result-dependent)?
├─ YES → ConcreteVisitor controls its own traversal
└─ NO → External Iterator is still cleanest
```
**Composite element traversal (preorder):**
```python
class CompositeElement(Element):
def accept(self, visitor: Visitor) -> None:
for child in self._children:
child.accept(visitor) # recurse first
visitor.visit_composite(self) # then self
```
**External iterator + visitor (maximum flexibility):**
```python
def apply_visitor(root: Element, visitor: Visitor, order: str = "preorder") -> None:
iterator = PreorderIterator(root) if order == "preorder" else PostorderIterator(root)
for element in iterator:
element.accept(visitor)
```
**When the ConcreteVisitor controls traversal:**
The REMatchingVisitor example from GoF (pages 316-317) illustrates this: a `RepeatExpression` must repeatedly traverse its component. The visitor controls the traversal because the traversal algorithm depends on the intermediate match results — something only the visitor can know. This is the exception; prefer external traversal for most cases.
---
## Language-Specific Notes
### Python
Python's `isinstance`-based dispatch is sometimes used as a Visitor alternative (or anti-pattern). The problem the GoF note explicitly (page 76):
```python
# Anti-pattern: manual type dispatch in the visitor
def check(self, glyph):
if isinstance(glyph, Character): ...
elif isinstance(glyph, Row): ...
elif isinstance(glyph, Image): ...
```
This is brittle — adding a new element type requires finding and updating every such `isinstance` chain. The `accept/visit` protocol eliminates this by letting the element announce its own type through method dispatch.
Python supports overloaded `visit` via `functools.singledispatchmethod`:
```python
from functools import singledispatchmethod
class PricingVisitor:
@singledispatchmethod
def visit(self, element):
raise NotImplementedError(f"No visit handler for {type(element)}")
@visit.register(FloppyDisk)
def _(self, disk: FloppyDisk):
self._total += disk.net_price()
@visit.register(Chassis)
def _(self, chassis: Chassis):
self._total += chassis.discount_price()
```
This reduces boilerplate at the cost of slightly less explicit dispatch naming. The `accept` method in each element can then call `visitor.visit(self)` uniformly.
### Java
Java uses the accept/visit naming convention directly. Abstract classes or interfaces work for both Visitor and Element:
```java
public interface DocumentVisitor {
void visitCharacter(Character character);
void visitRow(Row row);
void visitImage(Image image);
}
public class Character implements Element {
@Override
public void accept(DocumentVisitor visitor) {
visitor.visitCharacter(this);
}
}
```
Java's lack of default method implementations in abstract classes means ConcreteVisitor subclasses must implement all `visit` methods. Use an abstract adapter class with empty default implementations when ConcreteVisitors only care about a subset of element types.
### TypeScript
TypeScript supports both the pattern directly and a discriminated union approach:
```typescript
interface DocumentVisitor {
visitCharacter(node: Character): void;
visitRow(node: Row): void;
visitImage(node: Image): void;
}
abstract class Element {
abstract accept(visitor: DocumentVisitor): void;
}
class Character extends Element {
constructor(public readonly charCode: string) { super(); }
accept(visitor: DocumentVisitor): void {
visitor.visitCharacter(this);
}
}
```
For discriminated unions as an alternative (when the type set is truly closed and small), TypeScript's exhaustive `switch` on a `kind` field can replace the accept/visit protocol. This is simpler but loses the extensibility benefit — adding a new type requires updating every `switch`. Use Visitor when operations are added often; use discriminated unions when types are added often.
---
## Comparison with Related Patterns
### Visitor vs. Iterator
| Dimension | Iterator | Visitor |
|-----------|----------|---------|
| Purpose | Traverse a structure uniformly | Perform type-specific operations during traversal |
| Element type requirement | All elements must share a common type | Elements can be unrelated |
| Operation diversity | Single operation (advance, current) | Multiple operations via ConcreteVisitor subclasses |
| State accumulation | No (stateless traversal) | Yes (ConcreteVisitor carries state) |
| Extension axis | Add new traversal orders | Add new operations |
They are complementary: use Iterator for the traversal loop, Visitor for the per-element dispatch body.
### Visitor vs. Interpreter
The Interpreter pattern uses a hierarchy of expression classes, each implementing an `Interpret` method. Adding a new interpretation (e.g., pretty-printing) requires adding `PrettyPrint` to every expression class. Applying Visitor to an Interpreter structure externalizes each interpretation as a ConcreteVisitor, making it easy to add new interpretations without modifying expression classes.
GoF: "Visitor may be applied to do the interpretation." (page 317)
### Visitor vs. Composite
Visitor is typically used together with Composite. The Composite pattern defines the object structure (tree of elements). Visitor provides the mechanism to define operations over that structure without modifying element classes. The `CompositeElement::Accept` implementation that recurses over children and then calls `visitor.visit_composite(self)` is the standard integration point.
---
## Known Uses
- **Smalltalk-80 compiler:** `ProgramNodeEnumerator` — a Visitor class used primarily for algorithms that analyze source code
- **IRIS Inventor 3D toolkit:** Uses "actions" (visitors) for rendering, event handling, searching, filing, and determining bounding boxes over a scene graph hierarchy
- **Java compiler (javac):** Tree visitors (`com.sun.source.tree.TreeVisitor`) walk the AST for different compilation phases
- **Most IDE analysis tools:** Linters, type-checkers, and code formatters typically implement AST visitors over a stable grammar-defined node hierarchy
Choose the right structural design pattern (Adapter, Bridge, Composite, Decorator, Facade, Flyweight, or Proxy) for a structural design problem. Use when you...
---
name: structural-pattern-selector
description: |
Choose the right structural design pattern (Adapter, Bridge, Composite, Decorator, Facade, Flyweight, or Proxy) for a structural design problem. Use when you need to adapt interfaces between incompatible classes, decouple an abstraction from its implementation so both can vary independently, build part-whole hierarchies where individual objects and compositions are treated uniformly, add responsibilities to objects dynamically without subclassing, simplify access to a complex subsystem, share large numbers of fine-grained objects efficiently, or control and mediate access to another object. Disambiguates commonly confused patterns: Adapter vs Bridge (timing — after design vs before design), Composite vs Decorator vs Proxy (intent — structure vs embellishment vs access control, recursion pattern). Also use when someone asks "should I use Adapter or Bridge?", "what's the difference between Decorator and Proxy?", "when should I use Facade vs Adapter?", or "do I need Flyweight here?"
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/structural-pattern-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- design-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [4]
tags: [design-patterns, structural-patterns, adapter, bridge, composite, decorator, facade, flyweight, proxy, object-oriented, gof, disambiguation]
execution:
tier: 1
mode: full
inputs:
- type: none
description: "A structural design problem description — the design challenge, existing code pain points, or a scenario requiring structural composition"
tools-required: [Read, Write, TodoWrite]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any agent environment. Scanning an existing codebase for context is helpful but not required."
---
# Structural Pattern Selector
## When to Use
You have a **structural** design problem — one about how classes or objects are composed to form larger structures. This skill applies when:
- Two classes need to work together but have incompatible interfaces
- An abstraction and its implementation are tightly bound and you need them to vary independently
- You need to build tree-like hierarchies where individual items and groups behave the same way
- You need to add capabilities to objects at runtime without subclassing
- A complex subsystem needs a simpler unified entry point
- You're creating large numbers of similar lightweight objects and memory is a concern
- You need a stand-in, gateway, or access-control layer in front of another object
- You're unsure which structural pattern fits, especially between Adapter/Bridge or Composite/Decorator/Proxy
**Not this skill if:** The problem is about object creation (use a creational pattern) or communication/algorithms between objects (use a behavioral pattern). If you haven't yet classified the problem as structural, invoke `design-pattern-selector` first.
## Context & Input Gathering
### Required Context (must have before proceeding)
- **Problem description:** What structural challenge exists? A rough description is fine — "I need these two incompatible classes to work together" or "I'm adding logging/auth to objects without changing them"
- Check prompt for: words like "interface mismatch", "add responsibility", "part-whole", "subsystem", "access control", "share objects", "decouple abstraction"
- If missing, ask: "Describe the structural design problem you're facing — what's currently rigid, mismatched, or needs composing?"
- **Design timing:** Is this a new design (before classes exist) or an integration problem (classes already designed and built)?
- This is the single most important disambiguator between Adapter and Bridge
- Check prompt for: "existing", "third-party", "can't modify", "library" → likely post-design; "designing now", "new system", "will vary" → likely pre-design
- If unclear, ask: "Are the classes you're working with already designed and built, or are you designing the structure from scratch?"
### Observable Context (gather from environment if available)
- **Existing code:** If a codebase is available, scan for structural signals:
- `implements`/`extends` mismatches between incompatible classes → Adapter candidate
- Deep class hierarchies (subclass count growing combinatorially) → Bridge candidate
- Tree-structured data with uniform operations → Composite candidate
- Wrapper classes that add behavior before/after delegation → Decorator candidate
- A "manager" or "controller" class that forwards to many subsystem classes → Facade candidate
- Massive collections of identical-except-for-position objects (particles, characters, icons) → Flyweight candidate
- Classes that gate, cache, or delay access to another class → Proxy candidate
### Default Assumptions
- Object scope (composition) preferred over class scope (inheritance) unless inheritance is clearly appropriate — object patterns are more flexible at runtime
- When timing is ambiguous, lean toward asking — the Adapter vs Bridge decision hinges entirely on this
---
## Process
### Step 1: Set Up Tracking and Frame the Problem
**ACTION:** Use `TodoWrite` to track steps, then extract the core structural problem.
**WHY:** Structural problems often present with symptoms ("this class is too hard to change") rather than causes ("these two abstractions are coupled"). Forcing a structured frame up front prevents jumping to the wrong pattern before the actual problem is clear.
```
TodoWrite:
- [ ] Step 1: Frame the problem
- [ ] Step 2: Classify the structural intent
- [ ] Step 3: Apply the disambiguation decision tree
- [ ] Step 4: Check pattern-specific applicability conditions
- [ ] Step 5: Produce recommendation with rationale
```
Capture these elements from the description:
| Element | Question to answer |
|---------|-------------------|
| **What exists** | What classes/objects are already in play? |
| **What is wrong** | What is currently rigid, mismatched, or inefficient? |
| **What must be possible** | What should the design enable that isn't possible now? |
| **Timing** | Are these classes already designed, or is this a fresh design? |
---
### Step 2: Classify the Structural Intent
**ACTION:** Identify which of the seven structural intents best matches the problem.
**WHY:** Each structural pattern solves a fundamentally different structural problem. Getting the intent right eliminates 5–6 patterns immediately. The intent is the core discriminator — not the implementation mechanics.
**Structural intent map:**
| Intent | Core problem | Pattern |
|--------|-------------|---------|
| **Interface translation** | Two existing, incompatible interfaces must collaborate | Adapter |
| **Abstraction/implementation decoupling** | An abstraction must vary independently of how it is implemented | Bridge |
| **Part-whole uniformity** | Individual objects and compositions must be treated identically | Composite |
| **Dynamic responsibility addition** | Responsibilities must be added/removed at runtime without subclassing | Decorator |
| **Subsystem simplification** | A complex set of classes needs one simplified entry point | Facade |
| **Instance sharing for efficiency** | Large numbers of fine-grained objects must share state to reduce memory cost | Flyweight |
| **Controlled/mediated access** | Access to an object must be gated, deferred, or monitored | Proxy |
**OUTPUT:** Identify the 1–2 most matching intents. If two intents match, proceed to Step 3 for disambiguation.
---
### Step 3: Apply the Disambiguation Decision Tree
**ACTION:** For each commonly confused pair, apply the specific discriminator below.
**WHY:** The structural patterns share implementation mechanics (all use indirection, all involve forwarding calls) but differ sharply in intent and timing. Using the wrong one produces structurally valid but semantically wrong code — Bridge code that should have been Adapter won't adapt; Proxy code that should have been Decorator won't stack.
#### Adapter vs Bridge
**The single question that separates them:**
> "Were the classes being connected designed independently before this problem arose, or are you designing the structure now so they can vary independently later?"
- **Answer: already designed, now incompatible** → **Adapter**
- Adapter makes things work *after* they're designed. The coupling was unforeseen.
- The goal is to make two independently developed classes collaborate without rewriting either.
- The interface you're adapting *to* already exists and is owned by someone else.
- **Answer: designing now, need future flexibility** → **Bridge**
- Bridge makes things work *before* they're designed. The separation is intentional.
- The goal is to keep an abstraction hierarchy independent from its implementation hierarchy so both can grow without a subclass explosion.
- You are designing the structure up-front knowing it will need to vary in two dimensions.
**Facade vs Adapter distinction:** Facade is sometimes described as an adapter for a set of objects. The difference: Facade defines a *new* interface over an existing subsystem. Adapter *reuses* an existing interface by making another class conform to it. If clients already know the target interface and you need another class to satisfy it → Adapter. If there is no target interface and you're creating a convenient entry point → Facade.
#### Composite vs Decorator vs Proxy
These three share a structural mechanism: they all wrap another object and forward calls. Differentiate by intent and recursion:
| Question | Composite | Decorator | Proxy |
|----------|-----------|-----------|-------|
| **What is the purpose?** | Represent part-whole hierarchy (structure) | Add responsibilities dynamically (embellishment) | Control or mediate access (access management) |
| **Does it hold multiple children?** | Yes — a Composite holds a list of Components | No — a Decorator wraps exactly one Component | No — a Proxy references exactly one Subject |
| **Is recursive composition essential?** | Yes — the pattern requires it (tree structure) | Yes — decorators can be nested (stacked behaviors) | No — the proxy-subject relationship is static and one-to-one |
| **Does the wrapper change behavior?** | Yes, by aggregating children | Yes, by adding before/after behavior | Sometimes (protection, logging), but the *subject* defines core behavior |
| **Known at compile time?** | Determined at runtime (tree built dynamically) | Determined at runtime (decoration can be applied/removed) | Often known at compile time (proxy-subject relationship is fixed) |
**Quick diagnostic:**
- "I need to build a tree of objects where nodes and leaves behave the same" → **Composite**
- "I need to add logging/auth/caching to objects at runtime without changing them" → **Decorator**
- "I need a stand-in because the real object is remote, expensive to create, or needs access control" → **Proxy**
---
### Step 4: Check Pattern-Specific Applicability Conditions
**ACTION:** For the leading candidate, verify its specific applicability conditions are met. For Flyweight, ALL five conditions must be true — it is an AND-gate, not an OR-gate.
**WHY:** Each pattern has conditions that make it inappropriate even when the intent matches. Verifying these prevents applying a pattern that will cause more problems than it solves.
**Adapter — use when:**
- You want to use an existing class but its interface does not match the one you need
- You want to create a reusable class that cooperates with unrelated classes that don't have compatible interfaces
- (Object adapter only) You need to adapt multiple existing subclasses — class adapter would require subclassing each one
**Bridge — use when:**
- You want to avoid a permanent binding between an abstraction and its implementation (implementation must be selectable or switchable at runtime)
- Both the abstractions and their implementations should be extensible by subclassing and combined independently
- Changes to the implementation of an abstraction should not require recompiling clients
- You have a class proliferation where subclasses exist for each combination of abstraction variant and implementation variant (the "nested generalizations" smell)
**Composite — use when:**
- You want to represent part-whole hierarchies of objects
- You want clients to be able to ignore the difference between compositions of objects and individual objects — clients treat all objects in the composite structure uniformly
**Decorator — use when:**
- You need to add responsibilities to individual objects dynamically and transparently, without affecting other objects
- You need responsibilities that can be withdrawn (not just added)
- Extension by subclassing is impractical because it produces too many classes or classes are unavailable for subclassing
**Facade — use when:**
- You want to provide a simple interface to a complex subsystem (subsystem complexity has grown such that most clients need a simple default view)
- There are many dependencies between clients and the implementation classes of an abstraction — introducing a Facade decouples the subsystem from clients and promotes subsystem independence
- You want to layer your subsystems — use a facade as a defined entry point at each layer, simplifying inter-subsystem dependencies
**Flyweight — ALL five conditions must be true (AND-gate):**
1. The application uses a large number of objects
2. Storage costs are high because of the sheer quantity of objects
3. Most object state can be made extrinsic (passed to methods rather than stored per-instance)
4. Many groups of objects may be replaced by relatively few shared objects once extrinsic state is removed
5. The application does not depend on object identity — since flyweight objects are shared, identity tests (`==`) will return true for conceptually distinct objects. **If your code needs to distinguish between individual instances that represent different logical entities, Flyweight will break that logic.**
**Proxy — use when:**
- A remote proxy: you need a local representative for an object in a different address space
- A virtual proxy: you need to create expensive objects on demand (lazy initialization)
- A protection proxy: access to the original object requires access-rights control
- A smart reference: you need additional behavior triggered when an object is accessed (reference counting, loading a persistent object, locking)
---
### Step 5: Produce the Recommendation
**ACTION:** Deliver a structured recommendation, explicitly naming the disambiguation rationale if the problem involved a confused pair.
**WHY:** A recommendation without the disambiguation rationale leaves the developer unable to defend the choice to teammates, and unable to re-evaluate if the situation changes (e.g., if a design-time Bridge later needs to become an Adapter because existing classes were introduced).
**Recommendation format:**
```
## Structural Pattern Recommendation: [Pattern Name]
**Intent match:** [One sentence connecting the pattern's intent to the specific problem]
**Why not [confused alternative]:** [The specific discriminator that ruled it out — timing, recursion, identity, etc.]
**Applicability check:** [Confirm the specific conditions from Step 4 are met]
**What it enables:** [What the design can do after applying this pattern]
**Key trade-off:** [What complexity or constraint this pattern introduces]
**Patterns to combine with:** [Any structural or behavioral patterns that complement this, if applicable]
```
---
## Examples
### Example 1: Third-Party Payment SDK with Wrong Interface
**Scenario:** A team is integrating a payment provider's SDK. The SDK's `PaymentGateway` class has a `processCharge(amount, token)` method. The application's internal interface is `IPaymentProcessor` with a `charge(request: PaymentRequest)` method. The SDK cannot be modified.
**Trigger:** "We need to use this payment SDK but it doesn't match our internal interface."
**Process:**
- Step 2: Interface translation — two existing incompatible interfaces
- Step 3: Adapter vs Bridge discriminator — classes are already designed and built; the incompatibility is unforeseen. Bridge is for pre-design. This is post-design. → Adapter
- Step 4: Adapter applicability — want to use an existing class whose interface doesn't match; object adapter appropriate since we don't own the SDK
**Output:**
```
## Structural Pattern Recommendation: Adapter
**Intent match:** The PaymentGateway SDK already exists with a fixed interface;
Adapter translates the existing IPaymentProcessor interface to the SDK's API
without modifying either.
**Why not Bridge:** Bridge is used before design, to keep two hierarchies from being
permanently coupled. These classes are already built — the incompatibility is
discovered, not anticipated. "Adapter makes things work after they're designed;
Bridge makes them work before."
**Applicability check:** Existing class (SDK) with non-matching interface — confirmed.
Object adapter used because subclassing PaymentGateway is not appropriate.
**What it enables:** The application only ever sees IPaymentProcessor. Swapping
the payment provider means swapping the Adapter class, with zero changes to application code.
**Key trade-off:** Each method added to IPaymentProcessor requires a corresponding
translation in the Adapter. If the interfaces diverge significantly, the Adapter
grows. Consider whether the internal interface can be kept stable.
**Patterns to combine with:** Factory Method (to instantiate the correct Adapter
based on configured payment provider)
```
---
### Example 2: UI Component Tree with Uniform Operations
**Scenario:** A document editor has two kinds of graphic objects: primitive shapes (Line, Rectangle, Text) and grouped graphics (Picture, which contains other graphics). The editor needs to move, resize, and draw both primitive and grouped objects through the same interface. Nested groups must work (a Picture can contain another Picture).
**Trigger:** "We need to treat individual shapes and groups of shapes the same way, including nested groups."
**Process:**
- Step 2: Part-whole uniformity — individual and composed objects must behave identically
- Step 3: Composite vs Decorator vs Proxy discriminator:
- Need to hold multiple children? Yes (Picture holds a list of Graphic)
- Recursive composition essential? Yes (nested groups required)
- Purpose: structuring a hierarchy, not adding behavior or controlling access → Composite
- Step 4: Composite applicability — part-whole hierarchy and uniform client treatment — both confirmed
**Output:**
```
## Structural Pattern Recommendation: Composite
**Intent match:** The editor needs to treat individual shapes and grouped shapes
uniformly through a Graphic interface, with arbitrarily deep nesting of groups.
**Why not Decorator:** Decorator also uses recursive composition but its purpose
is embellishment — adding behaviors. Composite's purpose is representation —
organizing objects into a tree where nodes and leaves are interchangeable.
No responsibilities are being added here; a structure is being organized.
**Why not Proxy:** Proxy is a one-to-one stand-in for one Subject, not a
container for multiple children.
**Applicability check:** Part-whole hierarchy (Picture contains Graphics) and
uniform client treatment (editor calls draw() on any Graphic) — both confirmed.
**What it enables:** The editor code can call draw(), move(), or resize() on any
Graphic without knowing whether it's a primitive or a group. Adding new primitive
types requires no changes to the editor.
**Key trade-off:** Makes the design overly general — it becomes hard to restrict
what can be added to a Composite (e.g., preventing a Line from being added as a
child of a TextGroup). If type-safety in the hierarchy matters, extra validation
is needed.
**Patterns to combine with:** Decorator (to add responsibilities to individual
Graphic objects without subclassing), Iterator (to traverse the tree)
```
---
### Example 3: Rate Limiting and Logging Added to Service Calls
**Scenario:** A developer has a `ReportService` that generates reports. They need to add rate limiting (no more than 10 calls per minute), request logging, and authentication checking — all without modifying `ReportService`. The behaviors may need to be applied in different combinations to different service instances.
**Trigger:** "I need to add rate limiting, logging, and auth to our service without touching the service class, and I need to be able to mix and match which ones apply."
**Process:**
- Step 2: Dynamic responsibility addition — behaviors added at runtime without subclassing, and must be composable
- Step 3: Composite vs Decorator vs Proxy discriminator:
- Adding/removing responsibilities dynamically? Yes → Decorator candidate
- Controlling access to the real object? Rate limiting and auth do gate access → Proxy candidate
- Key discriminator: need to stack multiple behaviors and mix-and-match → Proxy is one-to-one and static; Decorator is designed for stacking → **Decorator**
- Step 4: Decorator applicability — responsibilities to add/remove dynamically, subclassing impractical for all combinations — confirmed
**Output:**
```
## Structural Pattern Recommendation: Decorator
**Intent match:** Rate limiting, logging, and auth are responsibilities that need
to be added dynamically and in different combinations — exactly what Decorator
is designed for. Each concern becomes a decorator wrapping the service.
**Why not Proxy:** Proxy controls access through a static one-to-one relationship
to one Subject. It is not designed for recursive composition or for mixing and
matching multiple behaviors. A ProtectionProxy could handle auth alone, but cannot
stack with a RateLimitingProxy without compromising both patterns' intent.
**Applicability check:** Responsibilities are to be added transparently without
affecting other objects; responsibilities can be withdrawn (remove a decorator to
remove the behavior); subclassing for all combinations is impractical (3 behaviors
= 7 possible combinations). Confirmed.
**What it enables:** RateLimitingDecorator(LoggingDecorator(AuthDecorator(reportService)))
composes all three behaviors. Each can be added or removed independently at
construction time.
**Key trade-off:** A decorator and its component are not identical objects — code
that uses object identity (==) to find a specific layer will fail. Also, many small
objects can be harder to debug (stepping through the decorator chain). Keep
decorator chains short and document the wrapping order.
**Patterns to combine with:** Factory Method (to construct the right decorator
combination based on configuration)
```
---
## Key Principles
- **Timing is the Adapter/Bridge axis** — "Adapter makes things work *after* they're designed; Bridge makes them work *before*." If the incompatibility was discovered (classes already exist), that's Adapter territory. If the separation is intentional design-time planning, that's Bridge territory.
- **Intent beats mechanics** — Composite, Decorator, and Proxy all wrap objects and forward calls. The right choice is determined by what you're trying to accomplish: structure a hierarchy (Composite), add behaviors (Decorator), or control access (Proxy). Choosing by implementation similarity produces structurally valid but semantically wrong code.
- **Flyweight is an AND-gate, not an OR-gate** — All five conditions must hold simultaneously. The most commonly missed is identity: if any part of the application needs to distinguish between two flyweight instances representing different logical entities, shared objects will cause silent correctness bugs.
- **Facade defines new; Adapter reuses existing** — Facade creates a convenient new interface over a subsystem where none existed. Adapter makes a class conform to an existing interface someone else owns. If clients already have a target interface they code to, it's Adapter. If you're inventing the interface, it's Facade.
- **Recursive composition signals Composite or Decorator, not Proxy** — Proxy's subject relationship is static and one-to-one. If the design requires stacking, nesting, or building trees of wrapped objects, the choice is between Composite (structural hierarchy) and Decorator (behavioral layering), not Proxy.
---
## References
- For full applicability conditions and disambiguation detail for all seven patterns, see [structural-disambiguation.md](references/structural-disambiguation.md)
- For the complete 23-pattern catalog (purpose, scope, intent), consult `design-pattern-selector/references/pattern-catalog.md`
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-design-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/structural-disambiguation.md
# Structural Pattern Disambiguation Reference
> Source: Design Patterns: Elements of Reusable Object-Oriented Software (Gamma, Helm, Johnson, Vlissides), Chapter 4 — Structural Patterns, pages 133–207.
> This file contains detailed applicability conditions, disambiguation criteria, and cross-pattern relationships for the seven structural GoF patterns. Load this file when the SKILL.md decision tree produces ambiguity or when you need to explain a recommendation in depth.
---
## Overview: The Seven Structural Patterns
Structural patterns are concerned with how classes and objects are composed to form larger structures.
- **Class patterns** use inheritance to compose interfaces or implementations (Adapter class variant)
- **Object patterns** use composition to realize new functionality, gaining the ability to change composition at runtime
| Pattern | Core structural problem | Mechanism |
|---------|------------------------|-----------|
| Adapter | Interface incompatibility between existing classes | Wraps one object to present another interface |
| Bridge | Abstraction and implementation tightly bound | Separates into two independent hierarchies |
| Composite | Part-whole hierarchies need uniform treatment | Recursive tree of Component objects |
| Decorator | Adding responsibilities without subclassing | Wraps a Component and adds before/after behavior |
| Facade | Complex subsystem needs a simple unified entry point | Single object that delegates to subsystem classes |
| Flyweight | Large numbers of fine-grained objects are expensive | Shared objects with extrinsic state passed in |
| Proxy | Need a stand-in, gateway, or access-control layer | One-to-one wrapper that controls access to Subject |
---
## Adapter
### Intent
Convert the interface of a class into another interface clients expect. Adapter lets classes work together that couldn't otherwise because of incompatible interfaces.
Also known as: **Wrapper**
### Applicability
Use Adapter when:
- You want to use an existing class, and its interface does not match the one you need
- You want to create a reusable class that cooperates with unrelated or unforeseen classes that don't have compatible interfaces
- *(Object adapter only)* You need to use several existing subclasses, but it's impractical to adapt their interface by subclassing each one — an object adapter can adapt the interface of its parent class
### Class Adapter vs Object Adapter
| | Class Adapter | Object Adapter |
|--|--------------|---------------|
| Mechanism | Multiple inheritance | Composition (holds Adaptee reference) |
| Adapts | A single Adaptee class | Any Adaptee subclass |
| Can override Adaptee behavior | Yes | No (must subclass Adaptee) |
| Language support | C++ (multiple inheritance), not Java/C# | Universal |
| When to prefer | When you need to override specific Adaptee behavior | When you need to work with Adaptee subclasses |
### Disambiguation: Adapter vs Facade
Both Adapter and Facade wrap other objects and present a different interface. The distinction:
- **Adapter** reuses an *existing* target interface — there is already a contract (e.g., `IPaymentProcessor`) that other code depends on, and you need to make a new class satisfy it
- **Facade** defines a *new* interface — there is no pre-existing contract; you are creating a convenient entry point for a subsystem that previously had no unified access point
> From the GoF (p.206): "You might think of a facade as an adapter to a set of other objects. But that interpretation overlooks the fact that a facade defines a *new* interface, whereas an adapter reuses an old interface. Remember that an adapter makes two *existing* interfaces work together as opposed to defining an entirely new one."
### Disambiguation: Adapter vs Bridge (THE KEY DISTINCTION)
> From the GoF (p.206): "The Adapter pattern makes things work *after* they're designed; Bridge makes them work *before*."
| Axis | Adapter | Bridge |
|------|---------|--------|
| **When applied** | Post-design — classes exist independently and must be integrated | Pre-design — structure is intentionally separated before classes are built |
| **Origin of incompatibility** | Unforeseen — coupling was not anticipated | Intentional — separation is a deliberate architectural decision |
| **Goal** | Make two independently designed classes work together without reimplementing either | Keep abstraction and implementation hierarchies separately extensible |
| **Relationship to change** | Resolves a fixed incompatibility | Accommodates ongoing independent variation |
Both patterns provide a level of indirection by forwarding requests. The difference is entirely in intent and timing.
---
## Bridge
### Intent
Decouple an abstraction from its implementation so that the two can vary independently.
### Applicability
Use Bridge when:
- You want to avoid a permanent binding between an abstraction and its implementation — for example, when the implementation must be selected or switched at runtime
- Both the abstractions and their implementations should be extensible by subclassing, and the different combinations should be composable independently
- Changes in the implementation of an abstraction should have no impact on clients — their code should not have to be recompiled
- *(C++ context)* You want to hide the implementation of an abstraction completely from clients
- You have a proliferation of classes ("nested generalizations") — a class hierarchy with subclasses for each combination of abstraction variant and implementation variant. Bridge breaks this into two independent hierarchies
- You want to share an implementation among multiple objects and this fact should be hidden from the client
### The "Nested Generalization" Smell That Bridge Solves
If you see a class hierarchy like:
```
Shape
├── RedCircle
├── BlueCircle
├── RedSquare
└── BlueSquare
```
And it is growing as both shapes and colors are added, Bridge is likely the solution. Separate the two dimensions:
```
Shape (abstraction) Color (implementation)
├── Circle ├── Red
└── Square └── Blue
```
Circle and Square hold a reference to a Color implementation. 2 shapes + 2 colors = 4 classes, not 4 subclasses (and it scales linearly, not combinatorially).
### Participants
- **Abstraction** — defines the abstraction's interface; maintains a reference to an Implementor
- **RefinedAbstraction** — extends the Abstraction interface
- **Implementor** — defines the interface for implementation classes (does not need to match Abstraction's interface; typically provides primitive operations that Abstraction uses to define higher-level operations)
- **ConcreteImplementor** — implements the Implementor interface
---
## Composite
### Intent
Compose objects into tree structures to represent part-whole hierarchies. Composite lets clients treat individual objects and compositions of objects uniformly.
### Applicability
Use Composite when:
- You want to represent part-whole hierarchies of objects
- You want clients to be able to ignore the difference between compositions of objects and individual objects — clients will treat all objects in the composite structure uniformly
### Participants
- **Component** — declares the interface for objects in the composition; implements default behavior for the interface common to all classes
- **Leaf** — represents leaf objects in the composition; a Leaf has no children
- **Composite** — defines behavior for components having children; stores child components; implements child-related operations
- **Client** — manipulates objects in the composition through the Component interface
### Key Consequence: The Uniformity Principle
The power of Composite is that clients never need to ask "is this a leaf or a composite?" — they call the same interface either way. The Composite propagates operations through its children automatically (`for g in children: g.Operation()`).
### Disambiguation: Composite vs Decorator vs Proxy
See dedicated section below.
---
## Decorator
### Intent
Attach additional responsibilities to an object dynamically. Decorator provides a flexible alternative to subclassing for extending functionality.
Also known as: **Wrapper**
### Applicability
Use Decorator:
- To add responsibilities to individual objects dynamically and transparently, without affecting other objects
- For responsibilities that can be withdrawn
- When extension by subclassing is impractical — sometimes a large number of independent extensions are possible, producing an explosion of subclasses to support every combination; or a class definition may be hidden or otherwise unavailable for subclassing
### Why Decorator Instead of Subclassing
Subclassing adds behavior statically — all instances of the subclass have the additional behavior, and it cannot be removed. Decorator adds behavior at object level, dynamically. Three separate responsibilities (logging, rate limiting, auth) would require 7 subclasses to cover all combinations ($2^3 - 1$). With Decorator, you wrap the object with whichever decorators you need.
### The Recursive Composition Structure
A Decorator conforms to the interface of the Component it decorates. This means decorators can be nested:
```
AuthDecorator(LoggingDecorator(RateLimitDecorator(realService)))
```
Each layer adds its behavior before/after calling the next layer's method. The real object is at the center. The nesting depth is open-ended — this is the essential feature that distinguishes Decorator from Proxy.
### Important Consequence: Identity Breaks
A Decorator and its Component are not the same object. Code that uses object identity (`decorator == component`) will behave unexpectedly. This is a documented trade-off: Decorator sacrifices identity in exchange for compositional flexibility.
---
## Facade
### Intent
Provide a unified interface to a set of interfaces in a subsystem. Facade defines a higher-level interface that makes the subsystem easier to use.
### Applicability
Use Facade when:
- You want to provide a simple interface to a complex subsystem. Subsystems often get more complex as they evolve. Most patterns, when applied, result in more and smaller classes. This makes the subsystem more reusable and easier to customize, but also harder to use for clients that don't need to customize it. A facade can provide a simple default view of the subsystem that is good enough for most clients.
- There are many dependencies between clients and the implementation classes of an abstraction. Introduce a facade to decouple the subsystem from clients and other subsystems, thereby promoting subsystem independence and portability.
- You want to layer your subsystems. Use a facade to define an entry point to each subsystem level. If subsystems are dependent, you can simplify the dependencies between them by making them communicate with each other solely through their facades.
### Facade Does Not Encapsulate the Subsystem
Unlike other structural patterns that fully hide the wrapped object, Facade does not prevent clients from using subsystem classes directly if they need to. Facade only provides a convenient default — clients needing advanced customization can bypass the Facade and work with subsystem classes directly.
### Disambiguation: Facade vs Adapter (Summary)
| | Facade | Adapter |
|--|--------|---------|
| Interface it provides | A new, simpler interface that didn't exist before | An existing interface that another class should satisfy |
| What it wraps | A subsystem (multiple classes) | A single class (or class hierarchy) |
| Goal | Simplify usage | Resolve incompatibility |
| Pre-existing target interface? | No — Facade creates the interface | Yes — Adapter maps to an existing interface |
---
## Flyweight
### Intent
Use sharing to support large numbers of fine-grained objects efficiently.
### The AND-Gate: All Five Conditions Must Hold
The Flyweight pattern's effectiveness depends heavily on how and where it's used. Apply Flyweight only when **all** of the following are true:
1. **Large number of objects** — the application creates and manages thousands or more instances
2. **High storage cost** — memory consumption from this quantity of objects is measurably significant
3. **Most state is extrinsic** — the majority of each object's state can be computed or passed in from outside, rather than stored per-instance
4. **Many objects can be replaced by few shared objects** — once extrinsic state is separated out, many "different" objects turn out to be identical in their intrinsic state
5. **No object identity dependency** — the application does not use identity tests (`==`, reference equality) to distinguish objects that represent conceptually different entities
**Condition 5 is the most commonly missed.** Flyweight objects are shared — multiple logical "instances" map to the same physical object. If any code does:
```python
character_a1 = flyweight_factory.get('a')
character_a2 = flyweight_factory.get('a')
assert character_a1 is not character_a2 # FAILS — they're the same object
```
or relies on object identity for tracking, ordering, or comparison, Flyweight will produce silent correctness bugs that are difficult to diagnose.
### Intrinsic vs Extrinsic State
| | Intrinsic state | Extrinsic state |
|--|----------------|----------------|
| Stored in | The flyweight object itself | Passed to flyweight methods by the client |
| Shared? | Yes — all clients sharing this flyweight use the same intrinsic state | No — different for each context |
| Example (character glyph) | Character code ('a', 'b', etc.) | Position, font, color in the document |
Separating intrinsic from extrinsic state is the central design task when implementing Flyweight.
---
## Proxy
### Intent
Provide a surrogate or placeholder for another object to control access to it.
### Four Types of Proxy
1. **Remote proxy** — provides a local representative for an object in a different address space. The proxy handles network communication transparently; clients see a local interface.
2. **Virtual proxy** — creates expensive objects on demand. The proxy stands in until the real object is actually needed, then creates and delegates to it (lazy initialization). Example: ImageProxy that shows a placeholder until the image is loaded.
3. **Protection proxy** — controls access to the original object based on access rights. Useful when objects should have different access rights.
4. **Smart reference** — replacement for a bare pointer that performs additional actions when an object is accessed: reference counting, locking, logging, caching.
### Disambiguation: Proxy vs Decorator
This is the most subtle distinction among structural patterns.
> From the GoF (p.207): "Like Decorator, the Proxy pattern composes an object and provides an identical interface to clients. Unlike Decorator, the Proxy pattern is not concerned with attaching or detaching properties dynamically, and it's not designed for recursive composition."
| Axis | Proxy | Decorator |
|------|-------|-----------|
| **Purpose** | Control or mediate access to the Subject | Add responsibilities to the Component |
| **Who defines core behavior** | The Subject — Proxy only provides/refuses access to it | Both Component and Decorators — Decorator adds to what Component provides |
| **Recursive composition** | Not designed for it — relationship is static one-to-one | Essential to the pattern — decorators stack |
| **Known at compile time** | Often — the proxy-subject relationship is frequently fixed | Typically no — decoration applied at runtime |
| **Can be stacked** | Not the intent — multiple proxies can exist but each is independent | Yes — stacking is the primary value proposition |
**Decision rule:** If you need one wrapper that gates, defers, or monitors access to one object → Proxy. If you need to stack multiple wrappers that each add a piece of behavior → Decorator.
### Proxy vs Adapter
Both Proxy and Adapter wrap an object. The difference:
- **Proxy** implements the *same interface* as its Subject — it is a transparent stand-in
- **Adapter** implements a *different interface* than its Adaptee — it is an interface translator
---
## Cross-Pattern Relationships Summary
| Pattern pair | Relationship | When both are used together |
|-------------|-------------|----------------------------|
| Composite + Decorator | Complementary | Build a tree of objects (Composite), then add behaviors to individual nodes (Decorator) without modifying the tree structure |
| Composite + Iterator | Complementary | Define a Composite tree structure, use Iterator to traverse it |
| Decorator + Strategy | Alternatives for behavioral extension | Decorator changes an object's *skin* (wraps it); Strategy changes an object's *guts* (replaces its algorithm). Use Decorator when the behavior wraps/extends the component's own behavior; use Strategy when the component's behavior needs to be replaced entirely |
| Facade + Singleton | Often combined | A Facade is often implemented as a Singleton — one unified entry point per subsystem |
| Proxy + Decorator | Can be combined (carefully) | A proxy-decorator adds functionality to a proxy, or a decorator-proxy embellishes a remote object. The GoF notes this is possible but recommends treating them separately unless needed |
| Adapter + Factory Method | Common combination | Use Factory Method to instantiate the right Adapter based on runtime configuration |
| Bridge + Abstract Factory | Common combination | Use Abstract Factory to create the ConcreteImplementor for a Bridge |
---
## Quick Disambiguation Card
```
STRUCTURAL PROBLEM
│
├─ Incompatible interface?
│ ├─ Classes already designed, incompatibility unforeseen? → ADAPTER
│ └─ Designing now, need long-term independent variation? → BRIDGE
│
├─ Need part-whole tree (nodes and leaves behave the same)? → COMPOSITE
│
├─ Need to add/remove behaviors dynamically without subclassing?
│ ├─ Need to STACK multiple behaviors? → DECORATOR
│ └─ Need ONE gateway (access control, lazy init, remote)? → PROXY
│
├─ Complex subsystem needs a simple unified entry point? → FACADE
│ └─ (Note: Facade creates NEW interface; Adapter reuses EXISTING interface)
│
└─ Thousands of fine-grained objects, memory is the problem,
ALL 5 conditions hold (including no identity dependency)? → FLYWEIGHT
```
Implement the Strategy pattern to encapsulate a family of interchangeable algorithms behind a common interface. Use when you have multiple conditional branch...
---
name: strategy-pattern-implementor
description: |
Implement the Strategy pattern to encapsulate a family of interchangeable algorithms behind a common interface. Use when you have multiple conditional branches selecting between algorithm variants, when algorithms need different space-time trade-offs, when a class has multiple behaviors expressed as conditionals, or when you need to swap algorithms at runtime. Detects the key code smell — switch/if-else chains selecting behavior — and refactors to Strategy objects.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/strategy-pattern-implementor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- behavioral-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [5]
pages: [48-50, 292-300]
tags: [design-patterns, behavioral, gof, strategy, refactoring, algorithms, composition]
execution:
tier: 2
mode: full
inputs:
- type: code
description: "Existing code with conditional branches selecting between algorithm variants, or a description of the algorithmic variation that needs encapsulation"
tools-required: [TodoWrite, Read]
tools-optional: [Grep, Edit]
mcps-required: []
environment: "Any codebase. Language-agnostic — examples use Python but the pattern applies universally."
---
# Strategy Pattern Implementor
## When to Use
You have a class that selects between algorithm variants and need to refactor it so algorithms can vary independently from the clients that use them. This skill applies when:
- A class contains `switch` statements or `if/else` chains that select between behavioral variants — **the primary code smell**
- Multiple related classes differ only in their behavior (same structure, different computation)
- You need different space-time trade-offs for the same operation (fast-and-approximate vs. slow-and-precise)
- An algorithm uses data clients should not know about (complex internal data structures that would leak through the interface)
- You need to swap the algorithm at runtime without changing the client
Before starting, confirm:
- Is this a **client-selected** algorithm switch (Strategy) rather than an **object self-transitioning** based on internal state (State)? If unsure, use `behavioral-pattern-selector` first.
- Does the variation belong in the class itself, or is it better expressed via a superclass skeleton with subclass steps (Template Method via inheritance)? Strategy uses composition — the algorithm is a separate object, not a subclass.
---
## Process
### Step 1: Set Up Tracking and Identify the Conditional
**ACTION:** Use `TodoWrite` to track progress, then locate the conditional code that signals the need for Strategy.
**WHY:** The conditional block is the concrete artifact you are eliminating. Identifying it precisely — its location, what it selects between, and what data it touches — drives every subsequent decision about interface design and data passing. Without pinning this down first, the refactoring risks producing a Strategy structure that does not fit the actual variation.
```
TodoWrite:
- [ ] Step 1: Locate the conditional and characterize the variation
- [ ] Step 2: Define the Strategy interface
- [ ] Step 3: Implement ConcreteStrategy classes
- [ ] Step 4: Refactor the Context class
- [ ] Step 5: Validate and apply optional optimizations
```
Locate the conditional using `Read` or `Grep`, then document:
| Element | What to capture |
|---------|----------------|
| **Location** | File, class, method where the conditional lives |
| **Variants** | How many branches? What does each compute? |
| **Data required** | What data does each branch use from the context? |
| **Selection point** | Who chooses the variant — the caller, a config value, or a runtime condition? |
| **Change frequency** | How often are new variants added? Does this class change every time? |
If working from a description rather than existing code, draft a minimal "before" sketch showing the conditional structure.
---
### Step 2: Define the Strategy Interface
**ACTION:** Design the abstract Strategy interface — the single method (or small set of methods) that all ConcreteStrategies must implement.
**WHY:** The Strategy interface is the contract that decouples the Context from all algorithm implementations. If it is too narrow (missing data the strategy needs), ConcreteStrategies will have to reach into the Context through back-channels, increasing coupling. If it is too wide (exposing data no strategy uses), the communication overhead between Strategy and Context grows. The interface must be general enough that you will not have to change it when adding new ConcreteStrategies — this is the central design decision of the pattern.
**Two data-passing approaches — choose one:**
| Approach | How it works | When to prefer |
|----------|-------------|----------------|
| **Take data to the strategy** | Context passes all needed data as parameters to the strategy method | Algorithm needs a bounded, well-known set of inputs; keeps Strategy and Context decoupled |
| **Pass context as argument** | Strategy receives the Context object itself and requests what it needs | Algorithm needs variable or large amounts of context data; reduces parameter proliferation but couples Strategy to Context's interface |
Design the interface method signature:
```python
# Approach 1: explicit parameters (preferred when data set is bounded)
class SortStrategy:
def sort(self, data: list) -> list:
raise NotImplementedError
# Approach 2: context reference
class SortStrategy:
def sort(self, context: "Sorter") -> None:
raise NotImplementedError
```
**Check:** Will a new ConcreteStrategy ever need data this interface does not provide? If yes, revise — you should not need to change the interface when adding strategies.
---
### Step 3: Implement ConcreteStrategy Classes
**ACTION:** Extract each conditional branch into its own ConcreteStrategy class that implements the Strategy interface.
**WHY:** Each branch in the original conditional is an independently varying algorithm. Making each one a class gives it a name (making the code self-documenting), an isolated scope for its data structures (preventing algorithm-specific complexity from polluting the Context), and a testable boundary (each strategy can be tested in isolation without instantiating the full Context).
For each branch in the original conditional:
1. Create a class named after the algorithm it encapsulates (not after what it replaces — use domain terminology, e.g., `BubbleSortStrategy`, not `OldSortBranch`)
2. Implement the interface method with the branch's logic
3. Move any algorithm-specific private data into the ConcreteStrategy (not the Context)
```python
class BubbleSortStrategy(SortStrategy):
"""Simple sort — O(n²), stable, no extra memory."""
def sort(self, data: list) -> list:
result = data[:]
n = len(result)
for i in range(n):
for j in range(0, n - i - 1):
if result[j] > result[j + 1]:
result[j], result[j + 1] = result[j + 1], result[j]
return result
class QuickSortStrategy(SortStrategy):
"""Fast sort — O(n log n) average, not stable, in-place."""
def sort(self, data: list) -> list:
result = data[:]
self._quicksort(result, 0, len(result) - 1)
return result
def _quicksort(self, arr, low, high):
if low < high:
pi = self._partition(arr, low, high)
self._quicksort(arr, low, pi - 1)
self._quicksort(arr, pi + 1, high)
def _partition(self, arr, low, high):
pivot = arr[high]
i = low - 1
for j in range(low, high):
if arr[j] <= pivot:
i += 1
arr[i], arr[j] = arr[j], arr[i]
arr[i + 1], arr[high] = arr[high], arr[i + 1]
return i + 1
```
**Stateless strategies as Flyweights:** If a ConcreteStrategy holds no instance state (all data comes from parameters), it can be shared across multiple Context instances. A single shared instance replaces per-context allocation. This optimization applies when strategies are lightweight and stateless — do not force it if the strategy needs per-invocation state.
---
### Step 4: Refactor the Context Class
**ACTION:** Replace the conditional in the Context with a reference to a Strategy object and a delegation call.
**WHY:** This is the core structural change. The Context no longer *knows* how to perform the algorithm — it only knows *that* an algorithm exists and how to invoke it. The conditional disappears entirely, replaced by a single delegation call. The Context becomes open to extension (new strategies) without modification, satisfying the open-closed principle.
**Refactoring steps:**
1. Add a `strategy` attribute (initialized via constructor or setter)
2. Replace the entire conditional block with a single call to `self.strategy.algorithm_method(...)`
3. Remove any algorithm-specific data that has moved into ConcreteStrategy classes
4. Optionally provide a default strategy so clients that do not care about the choice are unaffected
```python
# BEFORE (conditional in Context)
class DataProcessor:
def __init__(self, algorithm: str):
self._algorithm = algorithm
def process(self, data: list) -> list:
if self._algorithm == "bubble":
# bubble sort logic — 10 lines
...
elif self._algorithm == "quick":
# quicksort logic — 20 lines
...
elif self._algorithm == "merge":
# merge sort logic — 25 lines
...
# AFTER (Strategy delegation)
class DataProcessor:
def __init__(self, strategy: SortStrategy = None):
# Default strategy makes Strategy optional for clients
self._strategy = strategy or BubbleSortStrategy()
def set_strategy(self, strategy: SortStrategy) -> None:
"""Swap algorithm at runtime without reconstructing the context."""
self._strategy = strategy
def process(self, data: list) -> list:
return self._strategy.sort(data)
```
**Making Strategy optional:** If it is meaningful for the Context to have no strategy, check before delegating and execute default behavior when none is set. This keeps clients that do not need strategy selection from having to deal with Strategy objects at all.
---
### Step 5: Validate and Apply Optional Optimizations
**ACTION:** Verify the refactoring is complete, then apply optimizations if warranted.
**WHY:** The refactoring is only complete when the original conditional is fully gone and all variants are covered. Partial migration — leaving one branch in the Context "for now" — defeats the pattern and re-introduces the code smell. Validation catches this before new code is written against the incomplete structure.
**Validation checklist:**
- [ ] The original conditional block is entirely removed from the Context
- [ ] Every branch of the original conditional has a corresponding ConcreteStrategy
- [ ] The Context delegates to the strategy via the interface — it imports no ConcreteStrategy class directly
- [ ] Each ConcreteStrategy can be instantiated and tested independently
- [ ] Adding a new algorithm requires only a new class, not a change to the Context
**Optional optimizations:**
| Optimization | When to apply |
|--------------|--------------|
| **Flyweight sharing** | ConcreteStrategies are stateless — share one instance across all Contexts instead of allocating per-Context |
| **Default strategy** | Most clients use one algorithm — set it as default so callers who do not care are unaffected |
| **Factory for selection** | Strategy selection logic is complex — extract it into a factory rather than leaving it scattered in callers |
| **Abstract base with shared logic** | Multiple ConcreteStrategies share common steps — add a base ConcreteStrategy class with shared implementation (not the interface) |
**Drawback awareness — address before finishing:**
- **Client awareness:** Clients must now know which ConcreteStrategy to instantiate. If this knowledge is non-trivial (depends on configuration, environment, or runtime data), introduce a factory or builder to centralize selection logic.
- **Communication overhead:** The Strategy interface is shared by all ConcreteStrategies regardless of complexity. Simple strategies may receive parameters they do not use (e.g., `ArrayCompositor` in the Lexi case ignores stretchability data). If this overhead is significant, consider tighter coupling between specific strategies and the context — but accept that this reduces generality.
---
## Examples
### Example 1: Payment Processing (Classic Conditional to Strategy)
**Scenario:** An e-commerce checkout has a `PaymentService` with a large `if/else` chain selecting between credit card, PayPal, and bank transfer logic. A new payment method is requested every quarter, and each addition requires modifying `PaymentService` directly.
**Trigger:** "Every time we add a payment provider, we touch the same `process_payment` method and it keeps growing."
**Before (the code smell):**
```python
class PaymentService:
def process_payment(self, amount: float, method: str, details: dict):
if method == "credit_card":
# validate card, charge via Stripe, handle 3DS...
elif method == "paypal":
# OAuth flow, PayPal API call, webhook...
elif method == "bank_transfer":
# IBAN validation, SEPA/ACH routing...
else:
raise ValueError(f"Unknown method: {method}")
```
**After (Strategy):**
```python
class PaymentStrategy:
def process(self, amount: float, details: dict) -> PaymentResult:
raise NotImplementedError
class CreditCardStrategy(PaymentStrategy):
def process(self, amount: float, details: dict) -> PaymentResult:
# Stripe integration, 3DS, card validation
class PayPalStrategy(PaymentStrategy):
def process(self, amount: float, details: dict) -> PaymentResult:
# OAuth, PayPal API, webhook handling
class BankTransferStrategy(PaymentStrategy):
def process(self, amount: float, details: dict) -> PaymentResult:
# IBAN validation, SEPA/ACH routing
class PaymentService:
def __init__(self, strategy: PaymentStrategy):
self._strategy = strategy
def process_payment(self, amount: float, details: dict) -> PaymentResult:
return self._strategy.process(amount, details)
# Caller selects the strategy — PaymentService never changes
service = PaymentService(CreditCardStrategy())
result = service.process_payment(99.99, {"card_number": "..."})
```
**Output:** `PaymentService` is closed to modification. Adding `CryptoStrategy` requires zero changes to existing code.
---
### Example 2: Lexi Document Formatter (GoF Case Study)
**Scenario:** Lexi, a WYSIWYG document editor, must break text into lines. Several algorithms exist: a simple greedy line-breaker, a full TeX-quality algorithm optimizing paragraph-level color, and a fixed-interval breaker for icon grids. The algorithms have different speed-quality trade-offs and the user may switch between them.
**Trigger:** "The formatting algorithm needs to change at runtime depending on document type, and we need to add new algorithms without touching the document structure."
**Design (from GoF, pages 48-50 and 296-298):**
```
Context: Composition (holds content glyphs, line width)
Strategy: Compositor (abstract — Compose() interface)
ConcreteA: SimpleCompositor — greedy, line-at-a-time, fast
ConcreteB: TeXCompositor — paragraph-at-a-time, optimizes whitespace color
ConcreteC: ArrayCompositor — fixed interval, for icon grids
```
The Compositor interface takes all layout data as explicit parameters (Approach 1 — data to the strategy):
```cpp
virtual int Compose(
Coord natural[], Coord stretch[], Coord shrink[],
int componentCount, int lineWidth, int breaks[]
) = 0;
```
This interface is wide enough to support all three algorithms. `SimpleCompositor` ignores stretchability; `ArrayCompositor` ignores everything except component count. This is the **communication overhead trade-off** — some ConcreteStrategies receive data they do not use. The GoF accept this because the alternative (passing `this` as a context reference) would couple Compositor subclasses to Composition's full interface.
**Runtime swap:**
```cpp
// Switching quality at runtime — no reconstruction of Composition
composition->SetCompositor(new TeXCompositor());
composition->Repair(); // delegates to compositor->Compose(...)
```
**Output:** Adding a new linebreaking algorithm is a new `Compositor` subclass. Neither `Composition` nor the glyph classes are ever modified.
---
### Example 3: Report Exporter with Runtime Selection
**Scenario:** A reporting tool exports data as CSV, JSON, or PDF. The format is determined by user selection at runtime. Currently the export logic lives in a single `export()` method with a `format` string parameter driving a conditional.
**Trigger:** "Users can choose the export format from a dropdown — the format isn't known until they click Export."
**Strategy interface (data to strategy, Approach 1):**
```python
class ExportStrategy:
def export(self, records: list[dict], output_path: str) -> None:
raise NotImplementedError
class CsvExportStrategy(ExportStrategy):
def export(self, records, output_path):
import csv
with open(output_path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=records[0].keys())
writer.writeheader()
writer.writerows(records)
class JsonExportStrategy(ExportStrategy):
def export(self, records, output_path):
import json
with open(output_path, "w") as f:
json.dump(records, f, indent=2)
class PdfExportStrategy(ExportStrategy):
def export(self, records, output_path):
# reportlab or weasyprint integration
...
class ReportExporter:
_STRATEGIES = {
"csv": CsvExportStrategy,
"json": JsonExportStrategy,
"pdf": PdfExportStrategy,
}
def __init__(self):
self._strategy: ExportStrategy = CsvExportStrategy() # default
def set_format(self, format_key: str) -> None:
"""Runtime swap — called when user selects format in UI."""
cls = self._STRATEGIES.get(format_key)
if not cls:
raise ValueError(f"Unknown format: {format_key}")
self._strategy = cls()
def export(self, records: list[dict], output_path: str) -> None:
self._strategy.export(records, output_path)
```
**Key decision:** The `_STRATEGIES` registry centralizes selection logic in the Context, keeping callers simple. An alternative is to move this registry to a factory, which is preferable when selection logic becomes complex (e.g., involving feature flags or user tier).
**Output:** The UI dropdown maps to `set_format()`. Adding XML export is a new `XmlExportStrategy` class plus one entry in `_STRATEGIES` — the rest of the system is unchanged.
---
## Strategy vs. Template Method — When to Use Which
Both patterns encapsulate algorithmic variation. The choice is **composition vs. inheritance**:
| Dimension | Strategy | Template Method |
|-----------|----------|----------------|
| Mechanism | Composition — context holds a strategy object | Inheritance — subclass overrides abstract steps |
| Runtime swap | Yes — install a different strategy object | No — fixed at class definition |
| Granularity | Entire algorithm is replaced | Only specific steps are replaced; skeleton is fixed |
| Class count | One class per algorithm variant | One subclass per algorithm variant |
| Coupling | Strategy and Context are loosely coupled | Subclass is tightly coupled to superclass skeleton |
| Evolution path | Start with Template Method → migrate to Strategy when inheritance becomes limiting | — |
**Choose Strategy when** the algorithm must be swappable at runtime, when you do not control the class hierarchy, or when the number of variants is large and growing.
**Choose Template Method when** the algorithm skeleton is fixed and only specific steps vary, and when compile-time binding is acceptable.
---
## Reference Files
| File | Contents |
|------|----------|
| `references/strategy-implementation-guide.md` | Participants, full consequences catalog, interface design decision tree, Flyweight optimization, language-specific notes |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-behavioral-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/strategy-implementation-guide.md
# Strategy Pattern — Implementation Guide
Full reference for the `strategy-pattern-implementor` skill. Contains the complete participants catalog, all 7 consequences, the interface design decision tree, Flyweight optimization details, and language-specific implementation notes.
---
## Participants
### Strategy (e.g., `Compositor`, `SortStrategy`, `PaymentStrategy`)
- Declares an interface common to all supported algorithms
- Context uses this interface exclusively — it never imports a ConcreteStrategy directly
- The interface method signature is the central design decision (see Interface Design section below)
### ConcreteStrategy (e.g., `SimpleCompositor`, `TeXCompositor`, `QuickSortStrategy`)
- Implements the algorithm using the Strategy interface
- Each class encapsulates one algorithm and its algorithm-specific data
- ConcreteStrategies do not know about each other (unlike State, where state objects may initiate transitions to siblings)
### Context (e.g., `Composition`, `DataProcessor`, `PaymentService`)
- Configured with a ConcreteStrategy object at construction or via a setter
- Maintains a reference to a Strategy object
- May define an interface that lets Strategy access its data (relevant for Approach 2 — passing context as argument)
- Forwards algorithm execution to the Strategy; does not implement the algorithm itself
---
## Collaborations
- Strategy and Context interact to implement the chosen algorithm
- A context may pass all required data to the strategy when the algorithm is called (Approach 1), or pass itself as an argument and let the strategy request what it needs (Approach 2)
- Clients usually create and pass a ConcreteStrategy object to the context; thereafter, clients interact with the context exclusively
- There is often a family of ConcreteStrategy classes for a client to choose from — a factory or registry can centralize this selection
---
## All 7 Consequences
### Benefits
**1. Families of related algorithms**
Strategy hierarchies define a reusable family of algorithms or behaviors. Inheritance can factor out common functionality of the algorithms into a shared base ConcreteStrategy class (not the interface). This keeps common logic out of both the Context and the individual concrete implementations.
**2. An alternative to subclassing**
Without Strategy, you would subclass the Context directly to give it different behaviors (e.g., `BubbleSortDataProcessor`, `QuickSortDataProcessor`). This hard-wires the algorithm into the Context: it mixes algorithm implementation with Context logic, makes Context harder to understand and extend, and prevents runtime algorithm swapping. You end up with many related subclasses whose only difference is the algorithm they employ. Strategy via composition avoids all of this.
**3. Strategies eliminate conditional statements**
When different behaviors are lumped into one class, conditional statements to select the right behavior are unavoidable. Encapsulating behaviors in separate Strategy classes eliminates these conditionals entirely.
Example — without Strategy (the code smell):
```cpp
void Composition::Repair() {
switch (_breakingStrategy) {
case SimpleStrategy:
ComposeWithSimpleCompositor();
break;
case TeXStrategy:
ComposeWithTeXCompositor();
break;
// ...
}
// merge results with existing composition, if necessary
}
```
With Strategy (the refactored form):
```cpp
void Composition::Repair() {
_compositor->Compose();
// merge results with existing composition, if necessary
}
```
Code containing many conditional statements often indicates the need to apply the Strategy pattern.
**4. A choice of implementations**
Strategies can provide different implementations of the same behavior. The client can choose among strategies with different time and space trade-offs — for example, a fast-but-approximate algorithm vs. a slower but globally optimal one (SimpleCompositor vs. TeXCompositor).
### Drawbacks
**5. Clients must be aware of different Strategies**
The pattern has a potential drawback: a client must understand how Strategies differ before it can select the appropriate one. Clients may be exposed to implementation details. Therefore, use Strategy only when the variation in behavior is relevant to clients. If clients do not care about algorithm selection and a sensible default exists, provide a default strategy so they never have to deal with strategy objects at all.
**6. Communication overhead between Strategy and Context**
The Strategy interface is shared by all ConcreteStrategy classes whether the algorithms they implement are trivial or complex. Hence it is likely that some ConcreteStrategies will not use all the information passed to them through this interface — simple ConcreteStrategies may use none of it.
In the Lexi case, `SimpleCompositor` ignores stretchability data; `ArrayCompositor` ignores everything. This means the context creates and initializes parameters that never get used. If this overhead is an issue, tighter coupling between Strategy and Context is needed — but this reduces the generality of the pattern.
**7. Increased number of objects**
Strategies increase the number of objects in an application. Sometimes you can reduce this overhead by implementing strategies as stateless objects that contexts can share (see Flyweight Optimization below). Any residual state is maintained by the context, which passes it in each request to the Strategy object.
---
## Interface Design Decision Tree
Use this to choose between Approach 1 (pass data as parameters) and Approach 2 (pass context reference):
```
Does the strategy need a bounded, well-known set of inputs?
├── YES → Use Approach 1: explicit parameters
│ Keeps Strategy and Context decoupled.
│ Context passes data[], counts, and output buffers.
│ Risk: some ConcreteStrategies receive parameters they don't use.
│
└── NO → Does the data set vary significantly across ConcreteStrategies?
├── YES → Use Approach 2: pass context as argument
│ Strategy calls back on the context to get what it needs.
│ Context must expose a richer interface to its data.
│ Risk: tighter coupling between Strategy and Context.
│
└── Can Strategy store a reference to Context at construction?
YES → Initialize strategy with context reference.
Eliminates per-call parameter passing entirely.
Strategy can request exactly what it needs on demand.
Risk: tightest coupling — Strategy is bound to one Context type.
```
**General rule:** The needs of the particular algorithm and its data requirements determine the best technique. Prefer Approach 1 when parameters are bounded and stable. Move to Approach 2 when parameter lists grow unwieldy or vary significantly across strategies.
---
## Flyweight Optimization
Strategies increase object count. When strategies are stateless — all working data comes from parameters, nothing is stored between calls — they can be shared as Flyweights:
**When it applies:**
- The ConcreteStrategy has no instance variables
- All state is passed in via the interface method parameters
- The same algorithm is used by many Context instances simultaneously
**When it does NOT apply:**
- The ConcreteStrategy stores configuration (e.g., `ArrayCompositor` stores an interval count — this is constructor-injected state, not per-invocation state, but it makes the strategy non-shareable unless all contexts want the same interval)
- The algorithm accumulates state across multiple invocations
**Implementation pattern:**
```python
# Stateless strategy — safe to share
class JsonExportStrategy(ExportStrategy):
# No instance variables — all data comes through export()
def export(self, records, output_path):
import json
with open(output_path, "w") as f:
json.dump(records, f, indent=2)
# Shared instance registry — one object per algorithm
class StrategyRegistry:
_instances: dict[str, ExportStrategy] = {}
@classmethod
def get(cls, key: str) -> ExportStrategy:
if key not in cls._instances:
cls._instances[key] = _STRATEGY_CLASSES[key]()
return cls._instances[key]
```
Shared strategies should not maintain state across invocations. The Flyweight pattern (GoF p. 195) describes this approach in detail.
---
## Making Strategy Optional
The Context class can be simplified when it is meaningful to have no strategy. The Context checks for a strategy before delegating; if none is set, it executes default behavior:
```python
class DataProcessor:
def __init__(self, strategy: SortStrategy = None):
self._strategy = strategy # None is valid
def process(self, data: list) -> list:
if self._strategy is None:
# Default behavior — no strategy object required
return sorted(data)
return self._strategy.sort(data)
```
The benefit: clients who are satisfied with the default behavior never need to deal with Strategy objects at all. Only clients that need non-default behavior need to know strategies exist.
---
## Strategy as Template Parameters (C++ / Generics)
In statically typed languages with generics, Strategy can be expressed as a template/generic parameter rather than a runtime interface:
```cpp
template <class AStrategy>
class Context {
void Operation() { theStrategy.DoAlgorithm(); }
// ...
private:
AStrategy theStrategy;
};
// Configured at compile time — no virtual dispatch overhead
Context<MyStrategy> aContext;
```
This approach is applicable only when:
1. The strategy is selected at compile time (not runtime)
2. It does not need to be changed at runtime
The benefit: static binding eliminates virtual dispatch overhead. Strategy and Context are bound at compile time — this can increase efficiency significantly in performance-critical code.
---
## Strategy vs. State — The Critical Distinction
Strategy and State have identical class diagrams. The difference is intent and who initiates the swap:
| Question | Strategy | State |
|----------|----------|-------|
| Who changes the behavior? | The **client** installs a strategy explicitly | The **context object** transitions itself based on internal logic |
| Do concrete variants know about each other? | No — strategies are independent | Yes — state objects often initiate transitions to sibling states |
| Is the variation about selecting an algorithm? | Yes — same interface, different computation | No — about representing a lifecycle that drives behavior |
| Can the same "strategy" be active indefinitely? | Yes | No — states have transition logic |
**Rule:** If the behavioral swap is driven by external selection (caller chooses the algorithm), it is Strategy. If the behavioral swap is driven by the object's internal state machine (the object decides when to change its own behavior), it is State.
---
## Known Uses
- **Lexi document editor (GoF case study):** `Composition`/`Compositor` — formatting algorithm encapsulated as interchangeable Compositor strategies (SimpleCompositor, TeXCompositor, ArrayCompositor)
- **RTL compiler optimizer:** `RegisterAllocator` and `RISCScheduler`/`CISCScheduler` — register allocation and instruction scheduling as strategies, enabling retargeting to different machine architectures
- **ObjectWindows dialog validation:** `Validator` objects encapsulate field validation strategies (`RangeValidator`, custom validators) — attached optionally to entry fields
- **ET++SwapsManager:** Financial instrument pricing strategies — `YieldCurve` subclasses encapsulate discount factor computation strategies, mixed and matched per instrument type
- **Java `Comparator`:** `Collections.sort(list, comparator)` — the Comparator is a Strategy object encapsulating the comparison algorithm
- **Python `key=` parameter in `sorted()`:** A function reference as a lightweight Strategy (first-class function as ConcreteStrategy, no subclass required)
---
## Language-Specific Notes
### Python
- First-class functions eliminate the need for explicit ConcreteStrategy classes in simple cases: `processor.set_strategy(lambda data: sorted(data))` is a valid Strategy
- Use `abc.ABC` and `@abstractmethod` for the Strategy interface to enforce implementation at class-definition time
- Stateless strategies can be module-level singletons (avoid repeated instantiation)
### Java
- Use an interface (not abstract class) for the Strategy unless shared implementation is needed
- Java lambdas (single-method functional interfaces) make lightweight strategies trivial: `sorter.setStrategy(data -> Arrays.sort(data))`
- `java.util.Comparator` is the canonical Strategy in the standard library
### TypeScript / JavaScript
- Strategy interface expressed as a TypeScript interface or type alias with a single call signature
- Functions as strategies are idiomatic: `context.setStrategy((data) => data.sort())`
### Go
- Strategy interface is a Go interface — implicit implementation, no declaration required
- Structs implementing the interface are ConcreteStrategies
- Function types can serve as lightweight strategies for single-method interfaces
Detect common OO design smells in a codebase and recommend corrective design patterns. Use when reviewing code for design quality, investigating why changes...
---
name: oo-design-smell-detector
description: |
Detect common OO design smells in a codebase and recommend corrective design patterns. Use when reviewing code for design quality, investigating why changes are difficult, or auditing coupling and inheritance depth. Scans for 8 categories of design fragility — hardcoded object creation, operation dependencies, platform coupling, representation leaks, algorithm dependencies, tight coupling, subclass explosion, and inability to alter classes — each mapped to specific GoF patterns that fix them. Trigger when the user asks about design review, code smells, design anti-patterns, why a class is hard to change, why adding a feature requires modifying many files, class hierarchy review, inheritance overuse, why changes cascade, tight coupling, poor encapsulation, object creation rigidity, platform-dependent code, algorithmic coupling, or difficulty extending a third-party class.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/oo-design-smell-detector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [1]
pages: [29-31, 34-36, 41]
tags: [design-patterns, design-smells, code-review, oo-design, refactoring, gof-patterns, inheritance, coupling, anti-patterns]
depends-on: []
execution:
tier: 2
mode: full
inputs:
- type: codebase
description: "A software project or set of class/module descriptions to scan for design smells"
- type: none
description: "Alternatively, a textual description of a design problem or class structure"
tools-required: [Read, Grep, Write, TodoWrite]
tools-optional: [Glob, Bash]
mcps-required: []
environment: "Best results inside a codebase directory. Can also work from user-provided class descriptions."
---
# OO Design Smell Detector
## When to Use
You are auditing an object-oriented codebase — or a specific class or subsystem within one — for structural design problems that make change expensive. Typical triggers:
- "Every time I add a feature, I have to change 10 files"
- "This class hierarchy is getting out of control"
- "Changing the database means rewriting half the codebase"
- "I can't swap out this algorithm without touching everything that calls it"
- "We need to extend this library class but can't modify it"
- "New platforms require total rewrites"
- A pre-refactoring audit before a redesign sprint
- A code review specifically targeting design quality, not just style
Before starting, verify:
- Is there a specific class, subsystem, or entire codebase to scan?
- Is the user looking for an exhaustive audit or a targeted check on one smell category?
## Context & Input Gathering
### Required Context (must have before proceeding)
- **Scan target:** What is to be analyzed — a directory, a single class, a described design?
-> Check prompt for: class names, package/directory references, file paths, language mentions
-> Check environment for: src/ directories, class files, module structures
-> If still missing, ask: "Which directory, class, or codebase should I scan for design smells?"
- **Programming language:** Determines what syntax patterns and idioms to look for.
-> Check prompt for: language mentions ("Java", "Python", "C#", "TypeScript", etc.)
-> Check environment for: file extensions (.java, .py, .cs, .ts), build files (pom.xml, setup.py)
-> If still missing, ask: "What programming language is the codebase written in?"
### Observable Context (gather from environment if available)
- **Class hierarchy depth:** How deep is the inheritance tree?
-> Look for: files with `extends`, `implements`, `: BaseClass` patterns; count hierarchy levels
-> If unavailable: note as unobservable, flag for user to check manually
- **Object creation sites:** Where are concrete classes instantiated?
-> Look for: `new ConcreteClass(`, constructor calls using specific class names (not interfaces)
-> If unavailable: rely on user description
- **Platform or environment references:** Direct OS/API calls embedded in business logic?
-> Look for: imports of platform-specific packages, direct file system or OS calls in domain classes
-> If unavailable: assume unknown, flag for inspection
- **Algorithm implementations:** Are algorithms embedded inside data-holding classes?
-> Look for: sort methods, transformation logic, parsing routines inside entity/model classes
-> If unavailable: rely on user description
### Default Assumptions
- If no codebase is accessible -> perform assessment from user's description and produce a report with "described" rather than "observed" evidence
- If language cannot be determined -> ask (language affects detection patterns significantly)
- If no specific scope is given -> scan the primary source directory
### Sufficiency Threshold
```
SUFFICIENT when ALL of these are true:
- Scan target is identified (directory, class, or description)
- Programming language is known
- At least one source of evidence exists (code or description)
PROCEED WITH DEFAULTS when:
- Target is identified
- Language is known or strongly implied by file extensions
- Some source files are readable even if the full codebase is unavailable
MUST ASK when:
- No scan target is specified at all
- Language is ambiguous and detection patterns differ significantly between candidates
```
## Process
Use TodoWrite to track progress through the 8 smell categories. Complete all steps before producing the report.
### Step 1: Initialize Smell Scan Checklist
**ACTION:** Create a todo list tracking each of the 8 smell categories. Mark each as pending.
**WHY:** With 8 independent detection passes, it is easy to skip one when a particularly significant smell is found early. The checklist ensures systematic coverage and prevents the cognitive bias of stopping after the first hit. The final report is only credible if all 8 categories were actively checked, not just the ones that happened to be obvious.
Use TodoWrite to create tasks:
1. Detect hardcoded object creation (Smell DS-01)
2. Detect operation/request hardcoding (Smell DS-02)
3. Detect hardware/software platform coupling (Smell DS-03)
4. Detect representation and implementation leaks (Smell DS-04)
5. Detect algorithmic dependencies (Smell DS-05)
6. Detect tight coupling (Smell DS-06)
7. Detect subclass explosion from inheritance overuse (Smell DS-07)
8. Detect inability to alter classes (Smell DS-08)
### Step 2: Scan for Hardcoded Object Creation (DS-01)
**ACTION:** Search for `new ConcreteClass(` patterns (or language equivalent) outside of dedicated factory or builder classes. Look for concrete class names used as variable types instead of interface types. Note every location where object creation is hardwired to a specific implementation.
**WHY:** When code names a concrete class at instantiation, it commits to that implementation. If the class needs to change — for testing, configuration, or runtime variation — every call site must change too. This is not a hypothetical risk: it is one of the most common causes of "I can't write unit tests without running the whole stack." Detecting it now enables targeted introduction of Abstract Factory, Factory Method, or Prototype to centralize and isolate creation decisions.
**IF** codebase is accessible -> use Grep to search for `new [A-Z][a-zA-Z]+\(` patterns; filter out creation inside factory/builder classes; list remaining call sites
**ELSE** -> ask the user to describe how objects of key types are created in their design
Mark DS-01 todo item complete. Note findings.
### Step 3: Scan for Operation Dependencies (DS-02)
**ACTION:** Look for direct, hardcoded method calls that could instead be requests dispatched through a generic mechanism. Specifically: conditionals that branch on operation type (`if action == "save" ... else if action == "delete"`), hardcoded event handling with explicit method names per event type, or long chains of if/switch that encode what to do for each case.
**WHY:** Hardcoded request handling means every new operation requires a source change — you cannot add behavior at runtime or configuration time. This is how systems accumulate "god switches" where a central coordinator knows about every possible action. Chain of Responsibility and Command patterns address this by turning requests into objects and dispatching them polymorphically, which lets you add, remove, or reorder operations without touching existing code.
**IF** codebase accessible -> Grep for large switch/case blocks on operation type, long if/else-if chains on string or enum values representing actions
**ELSE** -> ask: "Are there places where the code explicitly branches on what operation to perform, using if/switch statements?"
Mark DS-02 complete. Note findings.
### Step 4: Scan for Platform Coupling (DS-03)
**ACTION:** Look for platform-specific imports, API calls, or OS dependencies embedded directly in business-logic classes. Examples: file system access in a domain model, direct database driver calls in a service class, OS-specific path constructions, platform API calls (Windows registry, Unix signals) in application code.
**WHY:** Platform coupling is the most expensive smell to fix late. Every time the operating environment changes — a cloud migration, a new OS target, a framework upgrade — classes that directly reference platform APIs require full rewrite rather than configuration. Abstracting platform access behind an interface (Abstract Factory for platform families, Bridge for independent variation) means platform changes stay contained to adapters, not spread across business logic.
**IF** codebase accessible -> Read key business/domain classes; check their import sections for OS, file system, or driver-level packages; Grep for direct platform API patterns
**ELSE** -> ask: "Do your domain or business-logic classes contain direct OS calls, file I/O, or database driver code?"
Mark DS-03 complete. Note findings.
### Step 5: Scan for Representation and Implementation Leaks (DS-04)
**ACTION:** Identify places where a class's internal data structure or implementation details are exposed to clients. Signs: returning internal collection references directly, public fields on objects that should be opaque, clients checking the internal state of an object to decide how to use it, or clients knowing which concrete implementation is behind an interface.
**WHY:** When clients know how an object is represented internally, any change to that representation requires changing every client. This is the classic encapsulation failure. It is also how simple refactors — "let's change from ArrayList to LinkedList" — become large multi-file changes. Patterns that address this: Abstract Factory (clients know only the interface), Bridge (separates interface from implementation), Memento (exposes state through a controlled opaque token), Proxy (controls access to the real object).
**IF** codebase accessible -> Look for public fields in domain classes; methods that return raw internal data structures (List, Map) rather than copies or views; clients using instanceof checks on returned objects
**ELSE** -> ask: "Do your classes expose internal data structures directly? Can clients see the concrete type behind an abstraction?"
Mark DS-04 complete. Note findings.
### Step 6: Scan for Algorithmic Dependencies (DS-05)
**ACTION:** Find algorithms embedded inside classes that are not specifically about algorithm management. Look for: sorting logic inside entity classes, serialization/parsing code mixed with domain logic, encryption or hashing routines inlined in services, data transformation code duplicated across multiple classes.
**WHY:** Algorithms change more often than the data they operate on. When an algorithm is embedded in a class, changing the algorithm means modifying the class — even if the class's real responsibility is elsewhere. This also prevents swapping algorithms at runtime (e.g., choosing a compression strategy based on data size). Strategy, Template Method, Iterator, Builder, and Visitor all isolate algorithms from the data they process, so algorithms can be extended, replaced, or composed without touching the data classes.
**IF** codebase accessible -> Grep for sort, parse, serialize, transform, encrypt, compress inside entity/model/domain classes; Read identified files to confirm algorithm embedding
**ELSE** -> ask: "Are there algorithms (sorting, parsing, serialization, encryption) embedded directly inside your data or entity classes?"
Mark DS-05 complete. Note findings.
### Step 7: Scan for Tight Coupling (DS-06)
**ACTION:** Identify classes with high fan-out (depending on many other concrete classes), classes that cannot be unit-tested without instantiating many collaborators, or classes that reference downstream classes by concrete type rather than interface. Also look for bidirectional dependencies between classes (A knows about B and B knows about A).
**WHY:** Tight coupling creates monolithic systems in disguise. Even if the code is in separate files, high concrete coupling means you cannot change, test, or reuse any class in isolation. The cost is paid every time: harder testing, cascading changes, inability to port or reuse components. The patterns that address this — Abstract Factory, Bridge, Chain of Responsibility, Command, Facade, Mediator, Observer — all share a common mechanism: they introduce an indirection layer that decouples the parties that need to collaborate.
**IF** codebase accessible -> Count imports in each class (high import counts with concrete class names are a signal); look for bidirectional references; identify classes that require many constructor arguments of concrete types
**ELSE** -> ask: "Which classes have the most dependencies on other concrete classes? Are there bidirectional references between classes?"
Mark DS-06 complete. Note findings.
### Step 8: Scan for Subclass Explosion (DS-07)
**ACTION:** Count the number of direct subclasses for key base classes. Flag any class with more than 5-7 direct subclasses as a candidate. Look for subclass names that suggest Cartesian products of features (e.g., `RedCircle`, `BlueCircle`, `RedSquare`, `BlueSquare`). Check whether inheritance is being used where composition or delegation would serve better.
**WHY:** This is the most common misuse of inheritance. When a class is extended to customize behavior for each variation of each dimension, the number of subclasses grows multiplicatively. Adding a new dimension (e.g., a new color when shapes already vary by size and color) requires doubling the number of subclasses. GoF describes this as inheritance "leading to an explosion of classes." The diagnostic: if subclass names look like they encode combinations of features, inheritance is doing work that composition (Bridge, Decorator, Strategy, Composite, Observer) should be doing. Delegation provides the same power as inheritance without the class-count explosion.
**IF** codebase accessible -> Glob for all class files; Grep for `extends BaseClass` or `: BaseClass` per base class; count subclass depth using a depth-first scan; flag hierarchies more than 3-4 levels deep or more than 7 direct subclasses
**ELSE** -> ask: "Are there base classes with many direct subclasses? Do subclass names suggest combinations of features?"
Also check: are there abstract classes with no behavior that exist solely to create a naming hierarchy? This is a sign of "classification for classification's sake" rather than behavior reuse.
Mark DS-07 complete. Note findings.
### Step 9: Scan for Inability to Alter Classes (DS-08)
**ACTION:** Identify classes that cannot be modified easily — either because they are third-party/library classes without source access, because any change would require modifying many existing subclasses in unison, or because the class is so entangled that isolating a change is risky. Look for: direct inheritance from library classes (rather than wrapping them), large numbers of subclasses that would all need updating, or classes referenced so widely that any change is feared.
**WHY:** Some classes simply cannot be changed: you lack source access, or changes would break a versioned API. When this happens, you need patterns that let you extend or adapt without modification. Adapter wraps the class to present a different interface. Decorator adds behavior by wrapping rather than subclassing. Visitor adds operations to a class hierarchy without changing the hierarchy. Detecting this smell early prevents teams from building deep inheritance chains on top of classes that cannot absorb change.
**IF** codebase accessible -> Grep for direct extension of library/framework classes; identify classes that appear in many files as imports but whose source is in a vendor or dependency directory
**ELSE** -> ask: "Are there classes (from libraries or elsewhere) that you need to extend or modify behavior for, but cannot change directly?"
Mark DS-08 complete. Note findings.
### Step 10: Assess Severity and Produce Report
**ACTION:** For each detected smell, assess severity using the two-factor rubric:
- **Spread:** How many locations in the codebase exhibit this smell? (1-2: isolated / 3-10: moderate / 10+: pervasive)
- **Friction:** How often does this smell cause actual change pain? (Low: rarely touched / Medium: touched occasionally / High: changed frequently)
Severity = Spread x Friction:
- Low spread + Low friction = **Advisory** (document, address opportunistically)
- Any Moderate combination = **Warning** (plan to address in next refactoring sprint)
- High spread + High friction = **Critical** (address before next feature work, blocks velocity)
Produce the Design Smell Report (see Outputs section).
**WHY:** Not all smells deserve equal urgency. A class instantiated concretely in 100 places that nobody ever needs to swap out is less urgent than a tightly coupled pair of classes that must change together every sprint. Severity assessment prevents the common failure mode of fixing cosmetically obvious smells while ignoring the high-friction ones that actually slow the team down.
### Step 11: Apply Pattern Overuse Warning
**ACTION:** For each recommended pattern, add a calibration note: is the flexibility the pattern provides actually needed? If the smell affects code that is stable and rarely changes, note this explicitly and downgrade the recommendation to "low priority."
**WHY:** Design patterns introduce indirection. Indirection adds complexity and performance overhead. A pattern should only be applied when the flexibility it provides is genuinely needed — not as a demonstration of pattern knowledge. GoF explicitly states: "A design pattern should only be applied when the flexibility it affords is actually needed." If a class creates a concrete object that will never need to be swapped, wrapping it in an Abstract Factory adds complexity with zero benefit. The report must distinguish between smells that need fixing now (high friction) versus theoretical smells in code that works fine (leave alone).
## Inputs
- Source directory or class files to scan
- Programming language of the codebase
- Optionally: specific concerns from the user (e.g., "focus on coupling" or "we're seeing subclass explosion")
## Outputs
### Design Smell Report
```markdown
# OO Design Smell Report: {System/Scope Name}
**Date:** {date}
**Scope:** {what was scanned}
**Language:** {language}
**Scan method:** {codebase analysis / description-based / hybrid}
## Executive Summary
{1-3 sentence summary of overall design health and top priority smell}
## Smell Findings
### DS-01: Hardcoded Object Creation
**Status:** {Detected / Not detected / Unable to assess}
**Severity:** {Advisory / Warning / Critical}
**Locations:** {list of specific files/classes/line ranges}
**Evidence:** {what was found — e.g., "new OrderRepository() called in 14 places across service layer"}
**Recommended patterns:** {Abstract Factory / Factory Method / Prototype}
**Justification:** {why this pattern fits this specific instance}
**Pattern overuse check:** {Is this flexibility actually needed? If the object never varies, note this.}
### DS-02: Operation Dependencies
**Status:** {Detected / Not detected / Unable to assess}
**Severity:** {Advisory / Warning / Critical}
**Locations:** {list of files/classes}
**Evidence:** {e.g., "45-line switch statement on event type in EventDispatcher.java"}
**Recommended patterns:** {Chain of Responsibility / Command}
**Justification:** {why this pattern fits}
**Pattern overuse check:** {If operations are fixed and never extended, flag as low priority.}
### DS-03: Platform Coupling
[same structure]
### DS-04: Representation Leaks
[same structure]
### DS-05: Algorithmic Dependencies
[same structure]
### DS-06: Tight Coupling
[same structure]
### DS-07: Subclass Explosion
**Status:** {Detected / Not detected / Unable to assess}
**Severity:** {Advisory / Warning / Critical}
**Hierarchy map:** {BaseClass -> list of subclasses, depth}
**Feature dimensions identified:** {e.g., "Shape x Color x Size"}
**Inheritance vs composition diagnosis:** {are subclasses encoding combinations?}
**Recommended patterns:** {Bridge / Decorator / Strategy / Composite / Observer}
**Refactoring sketch:** {high-level approach — e.g., "extract Color as a strategy, inject into Shape"}
**Pattern overuse check:** {If hierarchy is stable and combinations are fixed, note it.}
### DS-08: Inability to Alter Classes
[same structure]
## Priority Matrix
| Smell | Spread | Friction | Severity | Action |
|-------|:------:|:--------:|:--------:|--------|
| DS-01 Hardcoded creation | {isolated/moderate/pervasive} | {low/medium/high} | {Advisory/Warning/Critical} | {action} |
| DS-02 Operation coupling | ... | ... | ... | ... |
| DS-03 Platform coupling | ... | ... | ... | ... |
| DS-04 Representation leak | ... | ... | ... | ... |
| DS-05 Algorithmic dependency | ... | ... | ... | ... |
| DS-06 Tight coupling | ... | ... | ... | ... |
| DS-07 Subclass explosion | ... | ... | ... | ... |
| DS-08 Frozen classes | ... | ... | ... | ... |
## Top 3 Recommended Actions
1. **[Highest priority smell]** — {specific refactoring action} using {pattern}
Expected benefit: {what changes become easier once this is addressed}
2. **[Second priority]** — {specific action}
Expected benefit: {concrete improvement}
3. **[Third priority]** — {specific action}
Expected benefit: {concrete improvement}
## What NOT to Fix (Pattern Overuse Caution)
{List any detected smells that do not warrant pattern application because the flexibility is not needed, the code is stable, or the cost of introducing indirection exceeds the benefit.}
```
## Key Principles
- **Smells signal future pain, not present failure** — A design smell does not mean the code is broken today. It means that when the inevitable change arrives, this code will resist it. Severity must account for how frequently that change arrives. Dead code with every smell is still lower priority than a hot path with even one moderate smell.
- **Fix the cause of redesign, not the symptom** — Each smell maps to a reason why the system will require redesign. The goal is not to apply patterns everywhere but to remove the specific fragility that the pattern addresses. Always state which redesign risk is being mitigated.
- **Favor composition over inheritance for variation** — Inheritance is a compile-time, white-box relationship. When classes vary along multiple independent dimensions, composition lets each dimension vary independently without class proliferation. Detect this early: subclass names that combine attributes are the clearest signal that inheritance is doing composition's job.
- **Program to interfaces, not implementations** — The consistent thread across all 8 smells is coupling to concrete things: concrete classes, concrete operations, concrete platforms, concrete representations, concrete algorithms. Every pattern recommendation in this skill reduces coupling to concrete things and raises it to abstract interfaces.
- **Do not apply patterns when flexibility is not needed** — Patterns introduce indirection. Indirection adds complexity. The pattern is justified only when the design needs to vary in a specific way. An Abstract Factory for an object that will never change is engineering overhead, not engineering quality. Always evaluate need before recommending.
- **Inheritance depth is a risk multiplier** — Every level of inheritance depth increases the probability that a change in the parent ripples unexpectedly to a descendant. Hierarchies deeper than 3-4 levels should be examined for whether intermediate levels are adding behavior or merely classification names. Classification-only intermediate classes are candidates for removal.
- **Detection evidence must be specific** — A smell report that says "this class might have coupling issues" is useless. A report that says "CustomerService.java imports 12 concrete classes and cannot be instantiated in a unit test without a real database connection" is actionable. Always produce file names, line references, or concrete observations — not impressions.
## Examples
**Scenario: E-commerce codebase pre-refactoring audit**
Trigger: "We're about to add a second payment provider. Before we start, can you review our checkout code for design problems?"
Process: Scanned the checkout directory. DS-01: Found `new StripePaymentGateway()` called in 5 places across CheckoutService, OrderService, and RefundService — Critical. DS-02: Found switch on `paymentMethod` string in PaymentRouter — Warning. DS-06: CheckoutService imports 9 concrete classes including specific gateway, tax calculator, and inventory service — Warning. DS-07: No subclass explosion detected. DS-04: PaymentResult returned as raw HashMap to callers — Advisory. Applied pattern overuse check: flexibility in payment provider is confirmed needed (stated goal), so Abstract Factory recommendation is fully justified.
Output: Report flagging DS-01 as Critical (5 sites, high friction as gateway swap is imminent), recommending Abstract Factory for payment gateway creation with a sketch showing PaymentGatewayFactory interface with StripeGatewayFactory and the new provider's factory as implementations.
**Scenario: UI component library hierarchy review**
Trigger: "We have Button, PrimaryButton, SecondaryButton, DisabledButton, LargeButton, LargePrimaryButton, SmallButton, SmallPrimaryButton... the list keeps growing. Something is wrong."
Process: Mapped the hierarchy. Base: Button (1 class). Dimension 1: Size (Small, Medium, Large). Dimension 2: Variant (Primary, Secondary, Ghost). Dimension 3: State (Default, Disabled, Loading). Current subclass count: 27 and growing multiplicatively. DS-07 detected as Critical. Inheritance encodes combinations of independent dimensions. Adding any new size requires 9 new subclasses; any new state requires 9 new subclasses. Recommended Bridge pattern: separate Button from its rendering variant; use composition (size strategy + variant strategy + state strategy) instead of subclassing each combination. Applied pattern overuse check: combinations do grow (stated), so the flexibility is needed.
Output: Report with hierarchy map showing the combinatorial explosion, Bridge pattern sketch reducing 27 classes to Button + 3 strategy interfaces + concrete strategy implementations.
**Scenario: Targeted single-class review**
Trigger: "I think our ReportGenerator class is badly designed. Can you look at it?"
Process: Read ReportGenerator.java. Found: direct instantiation of PDFWriter (DS-01, isolated), Excel export logic embedded in the class alongside SQL query building (DS-05, isolated), direct file system calls to write output paths (DS-03, isolated), and formatter class imported concretely (DS-06, isolated). All smells are isolated but co-located in one class. Assessed friction: this class is modified every time a new report format or output destination is added — high friction despite low spread. Applied pattern overuse check: format and output destination do vary (confirmed by the user's change pattern), so the flexibility is genuinely needed. Recommended: Strategy for report formatting (isolates Excel/PDF/CSV algorithms), Abstract Factory for output destination (file system vs S3 vs email), Builder for report construction.
Output: Report for a single class showing 4 smells all rated Warning (isolated but high friction), with a refactoring roadmap prioritizing the algorithm extraction first as it affects the most frequent changes.
## References
- For the complete 8-smell detection catalog with grep patterns, code examples in multiple languages, and full pattern mapping tables, see [references/design-smell-catalog.md](references/design-smell-catalog.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/design-smell-catalog.md
# Design Smell Catalog — OO Design Smell Detector
Complete detection reference for all 8 design smells. Each entry includes: definition, detection criteria, language-specific grep patterns, code examples showing the smell and the fix, severity calibration guidance, and pattern mapping.
Source: "Design Patterns: Elements of Reusable Object-Oriented Software" (Gamma, Helm, Johnson, Vlissides), Chapter 1, pages 34-36 (8 causes of redesign), pages 29-31 (inheritance vs composition).
---
## DS-01: Hardcoded Object Creation
### Definition
Object creation is coupled to a specific concrete class at the call site. Code that creates objects by naming a concrete class at instantiation commits to that implementation — changing the implementation requires changing every creation site.
### Why It Causes Redesign
When a new variant of the created object is needed (e.g., a mock for testing, a different database implementation, a platform-specific adapter), every call site must be updated. Systems with many creation sites cannot swap implementations without a large, risky multi-file edit.
### Detection Criteria
- `new ConcreteClass()` used outside a factory or builder class
- Variable declared as a concrete type: `StripeGateway gateway = new StripeGateway()` instead of `PaymentGateway gateway = ...`
- Constructor calls where the class name encodes a specific implementation choice (database vendor, external service, specific algorithm)
### Grep Patterns
**Java / Kotlin:**
```
grep -rn "= new [A-Z][a-zA-Z]*(" src/
grep -rn "new [A-Z][a-zA-Z]*Repository(" src/
grep -rn "new [A-Z][a-zA-Z]*Service(" src/
```
**Python:**
```
grep -rn "[a-z_]\+ = [A-Z][a-zA-Z]*(" src/
# Look for instantiation without a factory function
```
**TypeScript / JavaScript:**
```
grep -rn "= new [A-Z][a-zA-Z]*(" src/
```
**C#:**
```
grep -rn "= new [A-Z][a-zA-Z]*(" src/
```
### False Positive Filters
- Creation inside a class named `*Factory`, `*Builder`, `*Provider`, `*Creator` — these are intended creation sites
- Creation of value objects (String, Integer, simple DTOs) — these rarely need to vary
- Test setup code — concrete instantiation in tests is expected
### Code Example: Smell
```java
// Java — hardcoded creation scattered across service layer
public class OrderService {
public void processOrder(Order order) {
PaymentResult result = new StripePaymentGateway().charge(order.getTotal());
// ^ commits to Stripe everywhere this method is called
new SQLOrderRepository().save(order);
// ^ commits to SQL everywhere
}
}
```
### Code Example: Fixed (Factory Method)
```java
// Java — creation delegated to factory
public class OrderService {
private final PaymentGateway gateway; // interface type
private final OrderRepository repository; // interface type
public OrderService(PaymentGateway gateway, OrderRepository repository) {
this.gateway = gateway;
this.repository = repository;
}
// Concrete classes chosen once at configuration time, not scattered at use sites
}
```
### Severity Calibration
- **Advisory:** 1-2 creation sites for an implementation that is unlikely to change
- **Warning:** 3-10 creation sites, or any creation site for an external dependency (database, payment, notification)
- **Critical:** 10+ creation sites, or if a change to the created class is already planned/needed
### Applicable Patterns
| Pattern | When to use |
|---------|-------------|
| **Abstract Factory** | Creating families of related objects (e.g., UI widgets for a platform, persistence objects for a database vendor) |
| **Factory Method** | Subclasses should decide which class to instantiate (e.g., a framework that lets users plug in their own creator) |
| **Prototype** | When the class to instantiate is specified at runtime (clone a prototype rather than naming a class) |
---
## DS-02: Operation Dependencies
### Definition
Code explicitly encodes which operations to perform using conditionals on operation type. Instead of dispatching requests polymorphically, the code names specific operations in if/switch branches.
### Why It Causes Redesign
Every new operation requires modifying the dispatch logic. There is a central point of knowledge about all possible operations, which becomes a change magnet. Operations cannot be reordered, added, or removed without editing the dispatcher.
### Detection Criteria
- Switch statements on an operation type, action name, command string, or event type
- Long if/else-if chains where each branch calls a different operation
- String-based dispatch: `if (action.equals("save")) ... else if (action.equals("delete")) ...`
- Integer/enum-based dispatch where the enum represents what to do (not what state to be in)
### Grep Patterns
**Java:**
```
grep -rn "switch.*action\|switch.*command\|switch.*event\|switch.*operation" src/
grep -rn "\.equals(\"save\")\|\.equals(\"delete\")\|\.equals(\"update\")" src/
```
**Python:**
```
grep -rn "if action ==\|elif action ==" src/
grep -rn "if command ==\|elif command ==" src/
```
**TypeScript:**
```
grep -rn "switch.*action\|switch.*type\|switch.*event" src/
grep -rn "case 'save'\|case 'delete'\|case 'update'" src/
```
### Code Example: Smell
```java
// Java — operation hardcoded as string dispatch
public class RequestHandler {
public void handle(String action, Request request) {
if (action.equals("save")) {
saveRecord(request);
} else if (action.equals("delete")) {
deleteRecord(request);
} else if (action.equals("validate")) {
validateRecord(request);
}
// Adding "export" requires editing this class
}
}
```
### Code Example: Fixed (Command)
```java
// Java — operations as objects
public interface Command {
void execute(Request request);
}
public class SaveCommand implements Command { ... }
public class DeleteCommand implements Command { ... }
public class RequestHandler {
private final Map<String, Command> commands;
public void handle(String action, Request request) {
Command command = commands.get(action);
if (command != null) command.execute(request);
// Adding "export" = register a new ExportCommand. No change here.
}
}
```
### Severity Calibration
- **Advisory:** Switch on a stable, closed set of operations that will not grow
- **Warning:** Switch with 5+ cases, or cases that represent extensible operations
- **Critical:** Central dispatcher that all request handling flows through, with operations actively being added
### Applicable Patterns
| Pattern | When to use |
|---------|-------------|
| **Command** | Turn operations into first-class objects for queuing, undoing, logging, or dynamic registration |
| **Chain of Responsibility** | Operations should be tried in sequence until one handles the request; handlers don't need to know about each other |
---
## DS-03: Platform and Hardware Coupling
### Definition
Business logic classes contain direct references to platform-specific APIs: operating system interfaces, specific database drivers, file system calls, hardware APIs, or external service SDKs embedded in domain code.
### Why It Causes Redesign
Platform changes — cloud migrations, OS upgrades, switching ORMs, moving from local storage to S3 — require rewriting classes whose real job is not platform management. Platform-coupled code also resists testing: you cannot run unit tests without the real platform available.
### Detection Criteria
- Domain or service classes importing platform-specific packages (`java.io.File`, `os.path`, `System.IO`)
- Direct JDBC/SQL calls in domain classes (not in repository classes)
- OS-specific constructs: path separators, environment variable access, process spawning
- External service SDK calls (`stripe.charge()`, `s3.putObject()`) in classes whose primary responsibility is business logic
### Grep Patterns
**Java:**
```
grep -rn "import java.io\.\|import java.nio\." src/main/ # in non-IO classes
grep -rn "import com.mysql\.\|import org.postgresql\." src/domain/
grep -rn "System.getenv\|Runtime.getRuntime" src/domain/
```
**Python:**
```
grep -rn "import os\|import subprocess\|import sys" src/domain/
grep -rn "open(\|os.path\." src/domain/
```
**TypeScript / Node.js:**
```
grep -rn "require('fs')\|require('os')\|require('path')" src/domain/
grep -rn "import.*from 'fs'" src/domain/
```
### Code Example: Smell
```python
# Python — file system hardcoded in domain class
class ReportGenerator:
def generate(self, report_data):
result = self._build_report(report_data)
with open(f"/tmp/reports/{report_data.id}.pdf", "wb") as f:
f.write(result)
# Cannot test without a real file system; cannot switch to S3 without changing this class
```
### Code Example: Fixed (Abstract Factory / Bridge)
```python
# Python — output destination abstracted
class ReportGenerator:
def __init__(self, output_store: OutputStore): # interface, not concrete
self.output_store = output_store
def generate(self, report_data):
result = self._build_report(report_data)
self.output_store.write(report_data.id, result)
# FileOutputStore, S3OutputStore, InMemoryOutputStore (for tests) all work
```
### Severity Calibration
- **Advisory:** Platform calls in a dedicated infrastructure/adapter class (may be acceptable)
- **Warning:** Platform calls in a service class that has non-platform responsibilities
- **Critical:** Platform calls directly in domain model or business logic classes
### Applicable Patterns
| Pattern | When to use |
|---------|-------------|
| **Abstract Factory** | The platform difference involves families of related objects (UI widgets, persistence layer, messaging) |
| **Bridge** | The abstraction and implementation vary independently along separate dimensions |
---
## DS-04: Representation and Implementation Leaks
### Definition
A class exposes its internal data structure, concrete type, or implementation details to clients. Clients make decisions based on how the object is represented, not what it can do.
### Why It Causes Redesign
Any change to the internal representation requires changing every client that knows about it. This is how `ArrayList` to `LinkedList` migrations become 20-file changes. It is also how schema migrations break the entire application instead of staying in the persistence layer.
### Detection Criteria
- Public fields (non-final, non-constant) on domain objects
- Methods that return raw collection references (`List<Item>` returned directly, not a copy or unmodifiable view)
- Clients using `instanceof` to branch behavior based on concrete type
- Getter methods that expose internal implementation details (returning a connection object, an internal cache, a raw byte array)
- Clients knowing the concrete class behind an interface (e.g., casting an interface to its implementation)
### Grep Patterns
**Java:**
```
grep -rn "public [A-Za-z<>]* [a-z]" src/domain/ # public fields
grep -rn "instanceof [A-Z]" src/ # type-checking callers
grep -rn "return [a-z]*List\b\|return [a-z]*Map\b" src/ # returning internal collections
```
**TypeScript:**
```
grep -rn "public [a-z]" src/domain/ # public properties
grep -rn "instanceof [A-Z]" src/
```
### Code Example: Smell
```java
// Java — internal List exposed directly
public class ShoppingCart {
public List<Item> items = new ArrayList<>(); // public field — clients can mutate internals
// OR:
public List<Item> getItems() {
return items; // returns live reference — external code can clear the list
}
}
// Client — depends on ArrayList-ness
((ArrayList<Item>) cart.getItems()).trimToSize(); // casts to concrete type
```
### Code Example: Fixed
```java
// Java — representation hidden, behavior exposed
public class ShoppingCart {
private List<Item> items = new ArrayList<>();
public void addItem(Item item) { items.add(item); }
public void removeItem(Item item) { items.remove(item); }
public List<Item> getItems() { return Collections.unmodifiableList(items); }
public int getItemCount() { return items.size(); }
// Switching to LinkedList requires no client changes
}
```
### Severity Calibration
- **Advisory:** Representation leaks in internal utility classes not exposed across module boundaries
- **Warning:** Public fields or raw collection returns in domain objects
- **Critical:** Clients using `instanceof` or casting to concrete types — this breaks the moment the implementation changes
### Applicable Patterns
| Pattern | When to use |
|---------|-------------|
| **Abstract Factory** | Hide which concrete family of objects is in use |
| **Bridge** | Separate the interface from its implementation so each can vary independently |
| **Memento** | Expose object state through a controlled, opaque token rather than direct field access |
| **Proxy** | Control and mediate access to the underlying object |
---
## DS-05: Algorithmic Dependencies
### Definition
Algorithms are embedded inside classes whose primary responsibility is data or domain logic — not algorithm management. The class cannot change data structure without changing the algorithm, and the algorithm cannot be replaced without changing the class.
### Why It Causes Redesign
Algorithms are among the most frequently replaced things in software: sorting strategies change for performance, serialization formats change for compatibility, encryption algorithms change for security. When an algorithm is embedded in a class, any algorithm change requires modifying the class — even if the class's job has nothing to do with the algorithm.
### Detection Criteria
- Sort routines inside entity or model classes
- Serialization/deserialization logic in domain classes (not in serializer classes)
- Encryption, hashing, or compression routines in service classes with other responsibilities
- Data transformation logic (mapping, filtering, aggregating) duplicated across multiple classes
- Template methods that mix structural logic with variable algorithms in the same class
### Grep Patterns
**Java:**
```
grep -rn "\.sort(\|Collections.sort\|Comparator" src/domain/ # sorting in domain
grep -rn "serialize\|deserialize\|toJson\|fromJson\|marshal\|unmarshal" src/domain/
grep -rn "encrypt\|decrypt\|hash\|compress" src/service/ # look for mixed responsibilities
```
**Python:**
```
grep -rn "sorted(\|\.sort(\|json.dumps\|json.loads\|pickle" src/domain/
grep -rn "hashlib\|hmac\|cryptography" src/domain/
```
### Code Example: Smell
```python
# Python — sorting algorithm embedded in domain class
class ProductCatalog:
def __init__(self, products):
self.products = products
def get_products(self):
# Bubble sort hardcoded — changing algorithm requires changing this class
items = self.products[:]
for i in range(len(items)):
for j in range(len(items) - 1 - i):
if items[j].price > items[j+1].price:
items[j], items[j+1] = items[j+1], items[j]
return items
```
### Code Example: Fixed (Strategy)
```python
# Python — sorting strategy injected
from abc import ABC, abstractmethod
class SortStrategy(ABC):
@abstractmethod
def sort(self, items): ...
class PriceSortStrategy(SortStrategy):
def sort(self, items):
return sorted(items, key=lambda x: x.price)
class PopularitySortStrategy(SortStrategy):
def sort(self, items):
return sorted(items, key=lambda x: x.view_count, reverse=True)
class ProductCatalog:
def __init__(self, products, sort_strategy: SortStrategy):
self.products = products
self.sort_strategy = sort_strategy
def get_products(self):
return self.sort_strategy.sort(self.products)
# Changing the sort = swap the strategy, no change to ProductCatalog
```
### Severity Calibration
- **Advisory:** Algorithm embedded but the algorithm is fixed by the problem domain and will never change
- **Warning:** Algorithm embedded and has changed at least once, or is known to be implementation-specific
- **Critical:** Algorithm duplicated across multiple classes — any change must be made in multiple places
### Applicable Patterns
| Pattern | When to use |
|---------|-------------|
| **Strategy** | Choose an algorithm at runtime from a family of interchangeable algorithms |
| **Template Method** | Define the skeleton of an algorithm in a base class; let subclasses override specific steps |
| **Builder** | Construct complex objects step-by-step; separate construction algorithm from the representation |
| **Iterator** | Abstract iteration algorithm from the collection being iterated |
| **Visitor** | Add new operations to a class hierarchy without changing the classes |
---
## DS-06: Tight Coupling
### Definition
Classes depend directly on many other concrete classes, making it impossible to change, test, or reuse any class in isolation. Tightly coupled systems are monolithic in practice even if they appear modular in structure.
### Why It Causes Redesign
A class that imports 12 concrete collaborators cannot be unit-tested without instantiating all 12. Changing any one of the 12 may require changing the dependent. When coupling is bidirectional, the system becomes a dense graph where changes propagate unpredictably. GoF: "The system becomes a dense mass that's hard to learn, port, and maintain."
### Detection Criteria
- High import count (10+ concrete class imports in a single class)
- Classes that cannot be instantiated without setting up a full object graph
- Bidirectional dependencies (A imports B, B imports B)
- Concrete type references in method signatures (parameter types or return types are concrete classes, not interfaces)
- No use of dependency injection — all collaborators instantiated internally
### Grep Patterns
**Java:**
```
# Count imports per file — high counts (10+) warrant inspection
grep -c "^import" src/service/OrderService.java
# Find bidirectional dependencies
grep -rn "import com.example.serviceA" src/service/ServiceB.java
grep -rn "import com.example.serviceB" src/service/ServiceA.java
```
**Python:**
```
# Count imports
grep -c "^from\|^import" src/service/order_service.py
# Find non-interface dependencies
grep -rn "from services\." src/domain/
```
**Manual check:** Can this class be instantiated in a unit test without a database, network, or file system?
### Code Example: Smell
```java
// Java — tightly coupled service
public class OrderService {
private StripePaymentGateway stripeGateway = new StripePaymentGateway(); // concrete
private MySQLOrderRepository mysqlRepository = new MySQLOrderRepository(); // concrete
private TwilioNotificationService twilioService = new TwilioNotificationService(); // concrete
private PDFReportGenerator pdfGenerator = new PDFReportGenerator(); // concrete
// Unit test requires Stripe, MySQL, Twilio, and PDF generation
}
```
### Code Example: Fixed (Dependency Injection + Interfaces)
```java
// Java — loosely coupled service
public class OrderService {
private final PaymentGateway gateway; // interface
private final OrderRepository repository; // interface
private final NotificationService notifications; // interface
private final ReportGenerator reports; // interface
public OrderService(PaymentGateway gateway, OrderRepository repository,
NotificationService notifications, ReportGenerator reports) {
this.gateway = gateway;
this.repository = repository;
this.notifications = notifications;
this.reports = reports;
}
// Unit test: inject mocks for all four interfaces — no real dependencies needed
}
```
### Severity Calibration
- **Advisory:** High import count but imports are all stable library types (collections, utils)
- **Warning:** Concrete imports of domain or infrastructure classes; any bidirectional dependency
- **Critical:** Bidirectional dependencies between core services; class cannot be tested without full infrastructure
### Applicable Patterns
| Pattern | When to use |
|---------|-------------|
| **Abstract Factory** | Decouple clients from concrete collaborator families |
| **Bridge** | Decouple an abstraction from its implementation |
| **Chain of Responsibility** | Decouple request senders from handlers |
| **Command** | Decouple operation invokers from operation implementors |
| **Facade** | Provide a simple interface to a complex, tightly coupled subsystem |
| **Mediator** | Centralize complex interactions between objects so they don't reference each other directly |
| **Observer** | Decouple subjects from observers — subjects don't know who is listening |
---
## DS-07: Subclass Explosion (Inheritance Overuse)
### Definition
Class hierarchies grow into an unmanageable number of subclasses because inheritance is used to encode combinations of independent variations. Each new variation along any dimension multiplies the number of subclasses required.
### Why It Causes Redesign
GoF (p. 36): "Customizing an object by subclassing often isn't easy. Every new class has a fixed implementation overhead. Subclassing can lead to an explosion of classes, because you might have to introduce many new subclasses for even a simple extension."
If a hierarchy varies along N independent dimensions, subclass count grows as the product of those dimensions. Adding a new dimension doubles (or multiplies) the hierarchy. This is unsustainable.
Also relevant (p. 30-31): "Designers overuse inheritance as a reuse technique, and designs are often made more reusable (and simpler) by depending more on object composition." Composition provides the same reuse without the class-count cost, and it allows variation at runtime rather than compile time.
### Detection Criteria
- More than 7 direct subclasses of a single base class
- Subclass names that encode combinations: `{Dimension1}{Dimension2}{BaseName}` (e.g., `LargeRedButton`, `SmallBlueButton`)
- Inheritance depth greater than 4 levels
- Abstract intermediate classes that add only naming, not behavior
- Hierarchies where adding a new "color" requires adding one new class per "shape"
### Grep Patterns
**Java:**
```
# Find all classes that extend a specific base
grep -rn "extends Button\|extends Widget\|extends Shape" src/
# Count subclasses per base class
grep -rh "extends [A-Z][a-zA-Z]*" src/ | sort | uniq -c | sort -rn | head -20
```
**Python:**
```
grep -rn "class [A-Z][a-zA-Z]*(Button)\|class [A-Z][a-zA-Z]*(Widget)" src/
```
**TypeScript:**
```
grep -rn "extends [A-Z][a-zA-Z]*" src/ | awk '{print $NF}' | sort | uniq -c | sort -rn
```
### Dimension Analysis
When subclass explosion is detected, identify the independent dimensions:
1. List all subclass names
2. Extract adjectives/modifiers (Large, Small, Red, Blue, Disabled, Loading)
3. Group modifiers by type (size group: Large/Small/Medium; color group: Red/Blue/Green; state group: Default/Disabled/Loading)
4. Count dimensions and their cardinalities
5. Current subclass count ≈ cardinality(dim1) × cardinality(dim2) × ... × cardinality(dimN)
**Example:** 27 Button subclasses = 3 sizes × 3 variants × 3 states. Adding a 4th state requires 9 new classes with inheritance, vs 1 new state implementation with composition.
### Code Example: Smell
```typescript
// TypeScript — 9 classes for 3x3 combination
class Button {}
class PrimaryButton extends Button {}
class SecondaryButton extends Button {}
class GhostButton extends Button {}
class LargeButton extends Button {}
class LargePrimaryButton extends LargeButton {} // combining dimensions
class LargeSecondaryButton extends LargeButton {}
class SmallButton extends Button {}
class SmallPrimaryButton extends SmallButton {}
class SmallSecondaryButton extends SmallButton {}
// Adding "Disabled" state: 9 more classes. Adding "XL" size: 3 more classes.
```
### Code Example: Fixed (Bridge + Strategy composition)
```typescript
// TypeScript — dimensions composed, not subclassed
interface ButtonVariant { getStyles(): VariantStyles; }
interface ButtonSize { getDimensions(): SizeMetrics; }
class PrimaryVariant implements ButtonVariant { ... }
class SecondaryVariant implements ButtonVariant { ... }
class GhostVariant implements ButtonVariant { ... }
class LargeSize implements ButtonSize { ... }
class SmallSize implements ButtonSize { ... }
class Button {
constructor(
private variant: ButtonVariant,
private size: ButtonSize,
private disabled: boolean = false
) {}
// Adding new variant: 1 new class. Adding new size: 1 new class.
// Total: 3 + 2 = 5 classes instead of 9, and scales linearly not multiplicatively.
}
```
### Severity Calibration
- **Advisory:** Hierarchy with 5-7 subclasses, single dimension of variation
- **Warning:** Hierarchy with 8-20 subclasses or 2+ independent variation dimensions
- **Critical:** Hierarchy with 20+ subclasses, clear combinatorial pattern, actively growing
### Applicable Patterns
| Pattern | When to use |
|---------|-------------|
| **Bridge** | Two independent dimensions of variation: separate the abstraction from its implementation |
| **Decorator** | Adding responsibilities dynamically by wrapping, not subclassing |
| **Strategy** | Vary one dimension of behavior at runtime by injecting a strategy object |
| **Composite** | Tree structures where branches and leaves share an interface |
| **Observer** | Notify multiple dependents without subclassing the subject |
---
## DS-08: Inability to Alter Classes
### Definition
A class that needs new behavior or a different interface cannot be modified: source code is unavailable (third-party library), modification would require changing many existing subclasses, or the class is so widely depended upon that any change is too risky.
### Why It Causes Redesign
Sometimes change is not possible through the class itself. A commercial library provides a class with the wrong interface. An internal "foundation" class has 50 downstream dependents. The class was designed without extension points. Without patterns to address this, the team either freezes their design (no new behavior) or copies the class (maintenance nightmare).
### Detection Criteria
- Classes that inherit directly from third-party or framework classes to add behavior
- Classes in vendor/dependency directories being subclassed in application code
- Statements in the codebase that suggest fear: "don't touch this class," "this is from the library," "we can't change this"
- Classes with no extension points (no abstract methods, no strategy injection, no event hooks) that callers need to customize
- Large numbers of existing subclasses that would all need updating if the base class changed
### Grep Patterns
**Java:**
```
# Find classes extending vendor/library classes
grep -rn "extends com\.\|extends org\.\|extends javax\." src/main/
# Find classes from vendor directories being extended
grep -rn "import com\.thirdparty\." src/ | grep -v vendor/
```
**Python:**
```
# Find classes extending third-party bases
grep -rn "class [A-Z][a-zA-Z]*(django\.\|flask\.\|sqlalchemy\." src/app/
```
**General signal:** Any class in `node_modules/`, Maven dependencies, pip packages that application classes inherit from directly.
### Code Example: Smell
```java
// Java — extending a library class to change its behavior
// Library: ThirdPartyLogger — source unavailable, cannot modify
public class ApplicationLogger extends ThirdPartyLogger {
// overriding to add our formatting
@Override
public void log(String message) {
super.log("[APP] " + message);
}
// If ThirdPartyLogger changes its log() signature: our class breaks
// If we need logging in tests: must have ThirdPartyLogger available
}
```
### Code Example: Fixed (Adapter + Decorator)
```java
// Java — wrap the library class, don't inherit it
public interface Logger {
void log(String message);
}
public class ThirdPartyLoggerAdapter implements Logger {
private final ThirdPartyLogger delegate;
public ThirdPartyLoggerAdapter(ThirdPartyLogger delegate) {
this.delegate = delegate;
}
@Override
public void log(String message) {
delegate.log("[APP] " + message);
}
// ThirdPartyLogger changes: only this adapter needs updating
// Tests: mock the Logger interface — no dependency on the library
}
```
### Severity Calibration
- **Advisory:** Application extends a stable, well-maintained framework class for a well-defined extension point (e.g., extending a servlet, controller base class) — this is often intentional
- **Warning:** Application extends a third-party class to change behavior rather than use an extension point
- **Critical:** A class that many subclasses depend on needs to change, but changing it requires updating all subclasses; or vendor dependency is extended and the vendor is actively changing their API
### Applicable Patterns
| Pattern | When to use |
|---------|-------------|
| **Adapter** | Convert the interface of a class you can't change into the interface your code expects |
| **Decorator** | Add behavior to a class you can't modify by wrapping it, not subclassing it |
| **Visitor** | Add new operations to a class hierarchy without modifying the hierarchy's classes |
---
## Pattern Overuse Warning (from GoF p. 41)
Design patterns achieve flexibility and variability by introducing additional levels of indirection. This can complicate a design and cost performance.
**A design pattern should only be applied when the flexibility it affords is actually needed.**
Before recommending any pattern, apply this check:
| Question | If YES | If NO |
|----------|--------|-------|
| Does this code actually need to vary in the way the pattern enables? | Pattern is justified | Flag as low priority or skip |
| Has this code already needed to change in the way the smell predicts? | Raise priority | Keep as Advisory |
| Would a developer without pattern knowledge understand the result? | Acceptable trade-off | Consider simpler approach |
| Does the performance cost of indirection matter here? | Evaluate Consequences section of the pattern | Proceed |
The Consequences section of each GoF pattern entry describes its liabilities explicitly. Always review liabilities before recommending.
---
## Severity Reference Matrix
| Spread | Friction | Severity | Recommended Action |
|:------:|:--------:|:--------:|-------------------|
| Isolated (1-2 sites) | Low | Advisory | Document; address opportunistically |
| Isolated (1-2 sites) | High | Warning | Plan for next refactoring sprint |
| Moderate (3-10 sites) | Low | Warning | Plan for next refactoring sprint |
| Moderate (3-10 sites) | High | Critical | Address before next feature work |
| Pervasive (10+ sites) | Low | Warning | Plan phased refactoring |
| Pervasive (10+ sites) | High | Critical | Address immediately — blocking velocity |
**Friction:** How often does this smell cause actual change pain?
- Low: Code is stable; this area is rarely modified
- Medium: Code changes occasionally; smell has caused rework 1-2 times
- High: Code changes frequently; smell is actively slowing the team
---
## Pattern Quick Reference (smell-to-pattern mapping)
| Smell | Primary Patterns | Secondary Patterns |
|-------|-----------------|-------------------|
| DS-01 Hardcoded creation | Abstract Factory, Factory Method | Prototype |
| DS-02 Operation dependencies | Command, Chain of Responsibility | — |
| DS-03 Platform coupling | Abstract Factory, Bridge | — |
| DS-04 Representation leaks | Abstract Factory, Bridge | Memento, Proxy |
| DS-05 Algorithmic dependencies | Strategy, Template Method | Builder, Iterator, Visitor |
| DS-06 Tight coupling | Abstract Factory, Bridge, Mediator | Command, CoR, Facade, Observer |
| DS-07 Subclass explosion | Bridge, Decorator, Strategy | Composite, Observer |
| DS-08 Frozen classes | Adapter, Decorator | Visitor |
Evaluate object-oriented designs against the two foundational GoF principles: 'Program to an interface, not an implementation' and 'Favor object composition...
---
name: oo-design-principle-evaluator
description: |
Evaluate object-oriented designs against the two foundational GoF principles: 'Program to an interface, not an implementation' and 'Favor object composition over class inheritance.' Use when reviewing class hierarchies, assessing reuse strategies, or deciding between inheritance and composition for a specific design problem. Identifies violations like white-box reuse, broken encapsulation from subclassing, concrete class coupling, delegation overuse, and inheritance hierarchies that should be flattened. Use when someone says 'should I use inheritance or composition here', 'is this class hierarchy right', 'my subclass is breaking when the parent changes', 'how do I make this more flexible', 'is this design too rigid', 'I want to change behavior at runtime', or 'review my OO design for best practices'. Produces a design principle compliance report with specific violations and recommended refactoring.
model: sonnet
context: 1M
execution:
tier: 1
mode: hybrid
inputs:
- type: codebase
description: "Class definitions, inheritance hierarchies, or design descriptions to evaluate"
- type: none
description: "Skill can also work from a verbal description of the design"
tools-required: [Read, TodoWrite]
tools-optional: [Grep, Bash]
environment: "Can run from any directory; codebase access improves analysis depth"
---
# OO Design Principle Evaluator
## When to Use
Use this skill when you are:
- **Reviewing a class hierarchy** — assessing whether inheritance is being used correctly or overused
- **Deciding between inheritance and composition** — need a principled recommendation for a specific design choice
- **Diagnosing brittleness** — a subclass is breaking when parent classes change, indicating broken encapsulation
- **Designing for flexibility** — want behavior that can change at runtime, which inheritance cannot provide
- **Code reviewing OO designs** — evaluating whether a design follows established reuse best practices
Preconditions: you have at least one of:
- Source code with class definitions and inheritance relationships
- A verbal or written description of the class hierarchy and how classes relate
- A UML diagram or design document describing the object model
**Agent:** Before starting, confirm whether you have access to source code or only a description. Code access enables deeper analysis (scanning for concrete class instantiation, checking method override patterns). A description enables principle-level analysis.
## Context & Input Gathering
### Input Sufficiency Check
```
User prompt → Extract the design being evaluated (classes, relationships, context)
↓
Environment → Scan for class files, inheritance declarations, interface definitions
↓
Gap analysis → Do I know WHAT is being evaluated and WHY they're concerned?
↓
Missing critical info? ──YES──→ ASK (one question at a time)
│
NO
↓
PROCEED with analysis
```
### Required Context (must have — ask if missing)
- **The design being evaluated:** What classes, relationships, or hierarchy is under review?
→ Check prompt for: class names, inheritance descriptions, "extends"/"inherits" language, code snippets
→ Check environment for: source files with class definitions, UML files, design docs
→ If still missing, ask: "Can you describe the class hierarchy or share the relevant code? I need to know which classes inherit from which, and what behavior each class is trying to reuse or extend."
- **The design goal:** What is the design trying to achieve? What problem does the hierarchy solve?
→ Check prompt for: stated purpose, feature description, problem context
→ If still missing, ask: "What behavior are you trying to reuse or share across these classes? Understanding the goal helps me assess whether inheritance is the right mechanism."
### Observable Context (gather from environment)
- **Concrete class instantiation:** Look for `new ConcreteClass()` in client code — signals coupling to implementation rather than interface.
→ Look for: constructor calls, direct class references in field declarations, factory-less instantiation
→ If unavailable: assess from description
- **Abstract class / interface usage:** Check whether clients depend on abstractions or concrete types.
→ Look for: abstract classes, interfaces, protocol declarations, abstract base classes
→ If unavailable: assume based on user description
- **Override patterns:** Identify whether subclasses merely add behavior or override parent internals.
→ Look for: method overrides that call `super`, overrides that replace parent logic entirely
→ If unavailable: note as unverifiable
### Default Assumptions
- If language is not specified: assume a statically typed OO language (Java, C#, C++) where the inheritance-vs-composition distinction is most consequential.
- If no design context given: assume production code under active development (not a throwaway prototype), so flexibility and maintainability matter.
- If no explicit goal stated: assume the design should support future extension without modification.
### Questioning Guidelines
Ask ONE question at a time, most critical first. Show what you already know before asking. State why you need the information.
## Process
Use `TodoWrite` to track steps before beginning.
```
TodoWrite([
{ id: "1", content: "Gather design context and identify all classes/relationships", status: "pending" },
{ id: "2", content: "Apply Principle 1: Program to interface, not implementation", status: "pending" },
{ id: "3", content: "Apply Principle 2: Favor composition over inheritance", status: "pending" },
{ id: "4", content: "Evaluate delegation usage (simplifies vs complicates)", status: "pending" },
{ id: "5", content: "Assess run-time vs compile-time flexibility needs", status: "pending" },
{ id: "6", content: "Produce compliance report with violations and recommendations", status: "pending" }
])
```
---
### Step 1: Map the Design
**ACTION:** Build a structural map of the design — all classes, their inheritance relationships, interfaces or abstract types they implement, and how clients reference them.
**WHY:** You cannot evaluate OO principles without first understanding the structure. Many violations are invisible until you see the full inheritance tree. A class that looks reasonable in isolation may reveal a 5-level deep hierarchy when mapped completely — a strong signal of inheritance overuse.
**AGENT: EXECUTES** — read source files or extract from user description
If code is available:
- Identify all classes and their parent classes
- Identify all interfaces/abstract classes used
- Note which types appear in field declarations, method parameters, and return types in client code
- Note any `new ConcreteClass()` calls in clients
If only description is available:
- Reconstruct the hierarchy from user's explanation
- Note any ambiguities explicitly in the report
Mark Step 1 complete in TodoWrite.
---
### Step 2: Apply Principle 1 — Program to an Interface, Not an Implementation
**ACTION:** For each relationship where a client holds a reference to another object, determine whether the client is coupled to an interface/abstract type or a concrete class.
**WHY:** When clients depend on concrete classes, changes to those classes force changes in clients. This creates cascading modifications — the opposite of reusable design. Programming to an interface means clients only know about the abstract contract, not the specific class behind it. Two benefits follow: (1) clients remain unaware of the specific type of object they use, and (2) clients remain unaware of which class implements that object. This greatly reduces implementation dependencies between subsystems.
**Violation signals to look for:**
| Signal | What it means |
|--------|---------------|
| Field declared as `ConcreteClass foo` | Client coupled to implementation |
| Method parameter typed as `ConcreteClass` | Caller must provide a specific class, not any compatible type |
| `new ConcreteClass()` scattered through client code | Object creation mixed with business logic; no interface boundary |
| No abstract class or interface in the hierarchy | Nothing defines a contract independent of implementation |
| Subclass inheriting from a concrete class | Inheriting implementation details, not just interface |
**Class vs Interface Inheritance — critical distinction:**
- **Class inheritance** defines an object's implementation in terms of another's implementation. It is a mechanism for code and representation sharing.
- **Interface inheritance** (subtyping) describes when an object can be used in place of another — it defines substitutability, not implementation sharing.
Confusing these two is a root cause of brittle designs. A class can inherit an interface (what it can do) without inheriting implementation (how it does it). Pure interface inheritance in C++ means inheriting from abstract classes with pure virtual functions. In Java/C#, it means implementing interfaces. In Python/duck-typed languages, structural compatibility serves the same role.
**Detection question:** "Is the client bound to HOW this object does its job, or only to WHAT it promises to do?"
Mark Step 2 complete in TodoWrite.
---
### Step 3: Apply Principle 2 — Favor Composition over Inheritance
**ACTION:** For each use of class inheritance (not interface inheritance), assess whether it is justified or whether object composition would serve better.
**WHY:** Inheritance is defined at compile-time and cannot change at run-time. It also breaks encapsulation: parent classes often define at least part of a subclass's physical representation, exposing internal details. This makes subclasses so bound to their parent's implementation that any change in the parent forces a change in the subclass — the "fragile base class" problem. Object composition, by contrast, is defined dynamically at run-time. Objects acquired through composition are accessed only through their interfaces, so encapsulation is preserved. Any object can be replaced at run-time by another of the same type.
**Four inheritance failure modes to check:**
1. **Cannot change behavior at run-time** — Does the design require behavior to switch dynamically? Inheritance fixes behavior at compile-time; composition allows swapping collaborators at run-time.
2. **Broken encapsulation** — Does the subclass depend on parent internals? Inheriting from a concrete class often means the subclass must understand the parent's implementation, not just its interface. This is white-box reuse.
3. **Subclass permanently bound to parent** — If ANY aspect of the inherited implementation is inappropriate for new problem domains, the parent must be rewritten or replaced. This limits reusability.
4. **Implementation dependency chain** — Does the design have deeply nested concrete inheritance? Each level adds a dependency. The cure: inherit only from abstract classes, which provide little or no implementation and therefore create no implementation dependencies.
**White-box vs black-box reuse:**
- **White-box reuse (inheritance):** Internal details of the parent are visible to the subclass. The subclass knows HOW the parent works. Changes to parent internals break subclasses.
- **Black-box reuse (composition):** Composed objects are accessed only through their interfaces. The composing object knows only WHAT the component does, not HOW.
**Assessment questions:**
- "If the parent class changes its internal implementation, does the subclass break?"
- "Does the subclass need to override methods that call other parent methods (fragile override chains)?"
- "Could this behavior be achieved by holding a reference to a collaborator object instead of inheriting from it?"
- "Does this hierarchy keep each class encapsulated and focused on one task, or does it create large entangled hierarchies?"
Mark Step 3 complete in TodoWrite.
---
### Step 4: Evaluate Delegation Usage
**ACTION:** Identify any use of delegation (a class forwarding requests to a component it holds a reference to). Assess whether each delegation point simplifies or complicates the design.
**WHY:** Delegation is a way to make composition as powerful as inheritance for reuse. The key test is: "Does this delegation simplify more than it complicates?" Delegation makes software more flexible but also harder to understand — dynamic, highly parameterized software is harder to read than more static software. There are also run-time inefficiencies. Delegation is a good design choice only when it simplifies more than it complicates. It works best when used in well-established patterns (Strategy, State, Visitor) rather than ad hoc.
**How delegation works:** In delegation, two objects handle a request: the receiver delegates the operation to its delegate. The receiver passes itself to the delegate so the delegated operation can refer back to the receiver. The main advantage: behaviors can be composed and changed at run-time by replacing the delegate object (as long as the replacement has the same type/interface).
**The Window/Rectangle test:** Instead of making Window a subclass of Rectangle (because windows are rectangular), Window holds a Rectangle instance and delegates area calculations to it. Window "has a" Rectangle rather than "is a" Rectangle. This is delegation enabling composition. The question: is this clearer than inheritance would have been?
**Delegation assessment signals:**
- Delegation that enables run-time behavior change → justified
- Delegation following a named pattern (Strategy, State, Bridge) → justified, works in "stylized ways"
- Ad hoc delegation chains with many small forwarding methods → may complicate more than simplify
- Delegation introduced where a simple method call would suffice → likely overengineered
Mark Step 4 complete in TodoWrite.
---
### Step 5: Assess Run-Time vs Compile-Time Structure
**ACTION:** Identify where the design's flexibility requirements are run-time vs compile-time, and check whether the mechanism used matches the need.
**WHY:** An OO program's run-time structure often bears little resemblance to its compile-time code structure. Compile-time structure is frozen in inheritance relationships. Run-time structure is the network of communicating objects — which is far more dynamic. Inheritance cannot provide run-time flexibility; composition can. Designs that try to use inheritance to achieve run-time variability end up requiring new subclasses for every behavioral variant — a combinatorial explosion.
**Also assess aggregation vs acquaintance:**
- **Aggregation:** One object owns or is responsible for another; they share identical lifetimes. Stronger relationship — changes to the aggregate affect the component.
- **Acquaintance (association):** One object merely knows about another; they aren't responsible for each other. Weaker relationship, more dynamic — acquaintances are made and remade more frequently, sometimes only for the duration of one operation.
Both are often implemented the same way in code (as references), but their INTENT differs. Designs that treat acquaintances as if they were aggregates create unnecessary coupling.
Mark Step 5 complete in TodoWrite.
---
### Step 6: Produce the Compliance Report
**ACTION:** Write a structured design principle compliance report covering all findings from Steps 2-5.
**WHY:** The report must be specific and actionable. Identifying "uses inheritance" is not useful. Identifying "class `ReportGenerator` inherits from `DatabaseConnection` to reuse connection pooling, but connection strategy cannot change at run-time and a change to `DatabaseConnection`'s pool sizing logic will force changes in `ReportGenerator`" is useful — and immediately points to the refactoring (extract an interface, inject a connection strategy via composition).
**HANDOFF TO HUMAN** — the agent prepares the report; the human reviews, prioritizes, and executes refactoring.
**Report format:**
```markdown
# OO Design Principle Compliance Report
## Design Under Review
[Brief description of what was evaluated]
## Principle 1: Program to an Interface, Not an Implementation
**Compliance:** [Compliant / Partial / Violated]
### Violations Found
- [Class/relationship]: [Specific violation description and why it matters]
### Recommendations
- [Specific refactoring: what to change and what pattern/structure to use instead]
## Principle 2: Favor Composition over Inheritance
**Compliance:** [Compliant / Partial / Violated]
### Inheritance Failure Modes Detected
- [ ] Cannot change behavior at run-time
- [ ] Broken encapsulation (white-box reuse)
- [ ] Subclass permanently bound to parent implementation
- [ ] Implementation dependency chain (concrete inheritance depth > 1)
### Violations Found
- [Class/relationship]: [Specific violation with failure mode]
### Recommendations
- [Specific refactoring: what composition structure to use, which pattern applies]
## Delegation Assessment
**Usage:** [Not present / Appropriate / Overused / Underused]
- [Finding]: [Simplifies or complicates, and why]
## Run-Time vs Compile-Time Structure
**Flexibility match:** [Matched / Mismatched]
- [Finding]: [Where run-time flexibility is needed but compile-time mechanism was used, or vice versa]
## Aggregation vs Acquaintance
- [Any relationships where intent doesn't match implementation]
## Summary
**Overall compliance:** [High / Medium / Low]
**Priority refactorings:**
1. [Most impactful change]
2. [Second priority]
3. [Third priority]
**What to preserve:** [Aspects of the design that are already well-structured]
```
Mark Step 6 complete in TodoWrite.
## Inputs
- **Class definitions or design description:** source code, UML diagram, verbal description of the hierarchy
- **Design context:** what the hierarchy is trying to achieve (reuse goal, feature being modeled)
- **Language/framework:** helps calibrate which mechanisms are available (interfaces, abstract classes, generics/templates)
## Outputs
- **Design Principle Compliance Report** (Markdown) — structured assessment with violations, failure modes, and specific refactoring recommendations
- **Decision rationale** — for each recommendation, the WHY (which principle is violated, which failure mode applies)
## Key Principles
- **Distinguish class inheritance from interface inheritance** — class inheritance shares implementation and breaks encapsulation; interface inheritance defines substitutability. A design can use interface inheritance without white-box reuse. The violation is class inheritance where only interface inheritance is needed.
- **Inheritance is a compile-time commitment; composition is a run-time commitment** — when behavior must vary at run-time, composition is not optional. Inheritance cannot fulfill this need without creating new subclasses for every variant. Evaluate what flexibility the design actually requires before prescribing either mechanism.
- **White-box reuse is a red flag, not a feature** — inheriting from a concrete class means the subclass knows the parent's internals. This is usually described as "convenient" but is actually a coupling problem. Evaluate whether the convenience is worth the brittleness.
- **Delegation is good only when it simplifies more than it complicates** — not every method forwarding is delegation in the GoF sense. True delegation passes `self` to the delegate so the delegate can refer back to the receiver. Assess whether the indirection is earning its complexity cost.
- **Small, encapsulated classes with clear interfaces beat deep hierarchies** — favoring composition keeps each class focused on one task. Hierarchies grow into unmanageable monsters when inheritance is the default reuse mechanism. The question is not "can I inherit?" but "should I inherit, or does holding a reference serve better?"
- **Abstract classes are the safe inheritance target** — inheriting from abstract classes creates minimal implementation dependencies because abstract classes provide little or no implementation. When inheritance is genuinely warranted, the parent should be abstract.
## Examples
**Scenario: Animal hierarchy with behavior coupling**
Trigger: "I have `Animal → Bird → Duck` with `fly()` defined on `Bird`. Now I need `Penguin` and `Ostrich` and I'm stuck."
Process:
1. Map hierarchy: `Animal` (abstract) → `Bird` (concrete, has `fly()`) → `Duck`, `Penguin`, `Ostrich`
2. Principle 1 check: `Bird` is concrete with `fly()` implementation — subclasses inherit implementation, not just interface. If clients reference `Bird`, they're coupled to the flying implementation.
3. Principle 2 check: `fly()` behavior cannot change at run-time. `Penguin` must override `fly()` to throw or do nothing — a violation of substitutability. The hierarchy is trying to share behavior that not all subtypes share.
4. Failure modes: broken encapsulation (Penguin must override parent internals), subclass bound to parent (Penguin must know Bird's flying assumption).
5. Recommendation: Extract a `FlyingBehavior` interface. `Bird` holds a reference to `FlyingBehavior` (composition). `Duck` gets `StandardFlight`, `Penguin` gets `NoFlight`, new birds get the right behavior without touching the hierarchy. Birds that develop flight mutations at runtime can swap behaviors.
Output: Compliance report identifying white-box reuse violation, recommending Strategy pattern for flying behavior.
---
**Scenario: Window inheriting from Rectangle**
Trigger: "My `Window` class extends `Rectangle` because windows are rectangular and I wanted the area calculation."
Process:
1. Map hierarchy: `Window extends Rectangle`. Client code uses `Window` directly.
2. Principle 1 check: `Window` is coupled to `Rectangle`'s implementation. If window shape changes (not all windows are rectangular), `Window` must change its parent or override geometry methods.
3. Principle 2 check: Window "is a" Rectangle is the claim, but "Window has a shape" is more accurate. The reuse is purely for `Area()` — implementation reuse, not subtype substitutability. If `Rectangle`'s area formula changes (e.g., to account for different coordinate systems), `Window` inherits the change whether it wants it or not.
4. Failure modes: cannot change shape at run-time (a `Window` is always a `Rectangle`); white-box reuse (Window depends on Rectangle's implementation of area).
5. Recommendation: `Window` holds a `Shape` interface reference. Area calculation is delegated to the `Shape`. At run-time, `Window` can be circular by replacing its `Shape` delegate. This is the GoF delegation pattern.
Output: Report recommending delegation over inheritance; notes this is exactly the Window/Rectangle canonical delegation example.
---
**Scenario: Plugin system using abstract base class correctly**
Trigger: "We have an abstract `DataExporter` with concrete `CsvExporter`, `JsonExporter`, `XmlExporter`. Is this design OK?"
Process:
1. Map hierarchy: `DataExporter` (abstract, defines interface) → `CsvExporter`, `JsonExporter`, `XmlExporter` (concrete, all siblings, no further inheritance).
2. Principle 1 check: If client code references `DataExporter` (the abstract type), not the concrete exporters — compliant. Creational patterns or factories handle instantiation, keeping client code decoupled.
3. Principle 2 check: Each concrete exporter inherits from an abstract class (no implementation to inherit, only interface). No white-box reuse. Each exporter is focused and encapsulated. No deep hierarchy.
4. Failure modes: None detected. Runtime flexibility is preserved — a new exporter can be added without modifying clients.
5. Recommendation: Compliant design. Verify client code uses `DataExporter` type, not concrete types. If new exporters need shared utility methods, add them to `DataExporter` as concrete methods carefully, or extract to a separate utility class to avoid introducing implementation coupling.
Output: Report with High compliance rating; notes the design correctly uses abstract class as a pure interface target.
## References
- For detailed comparison of all reuse mechanisms with decision criteria, see [reuse-mechanisms.md](references/reuse-mechanisms.md)
- Source: *Design Patterns: Elements of Reusable Object-Oriented Software*, GoF (Gang of Four), Chapter 1, pages 28-34
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns Gof by Unknown.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/reuse-mechanisms.md
# Reuse Mechanisms in Object-Oriented Design
Detailed comparison of the four mechanisms for reusing functionality in OO systems. Read this when evaluating which mechanism to recommend for a specific design, or when a violation requires a specific refactoring path.
Source: *Design Patterns: Elements of Reusable Object-Oriented Software* (GoF), Chapter 1, pages 28-34.
---
## Overview: The Four Mechanisms
| Mechanism | Defined at | Encapsulation | Run-time flexibility | Implementation dependency |
|-----------|-----------|---------------|---------------------|--------------------------|
| Class inheritance | Compile-time | Broken (white-box) | None | High |
| Interface inheritance | Compile-time | Preserved | None | None |
| Object composition | Run-time | Preserved (black-box) | High | Low |
| Parameterized types (generics/templates) | Compile-time | Preserved | None | Low |
---
## 1. Class Inheritance (White-Box Reuse)
**What it is:** Defining one class's implementation in terms of another's. The subclass inherits the parent's data representation and operations.
**Why it's called white-box:** The internals of the parent class are often visible to (and depended on by) the subclass. The subclass can see HOW the parent works, not just WHAT it promises.
**Advantages:**
- Defined statically at compile-time — straightforward and directly supported by the language
- Easy to modify the implementation being reused — subclass can override specific operations
- Less code — inheriting most of what you need from an existing class
**Disadvantages:**
1. **Cannot change at run-time** — the class relationship is fixed at compile-time. If you need a different behavior, you need a different class.
2. **Breaks encapsulation** — parent classes often define part of subclasses' physical representation. Because inheritance exposes a subclass to details of its parent's implementation, it is said that "inheritance breaks encapsulation." (Snyder, 1986)
3. **Subclass bound to parent implementation** — the implementation of a subclass becomes so bound up with the implementation of its parent class that any change in the parent's implementation will force the subclass to change.
4. **Implementation dependency** — if any aspect of the inherited implementation is not appropriate for new problem domains, the parent class must be rewritten or replaced. This dependency limits flexibility and ultimately reusability.
**When to use:**
- The subclass is truly a subtype of the parent (satisfies the Liskov Substitution Principle)
- The parent is abstract (provides no implementation — only interface) — this is the safe case
- You need to create families of objects with identical interfaces (polymorphism)
- Implementation sharing is secondary; the primary goal is subtype relationship
**The cure for inheritance misuse:** Inherit only from abstract classes, since they usually provide little or no implementation. This eliminates implementation dependencies while preserving the interface relationship.
---
## 2. Interface Inheritance (Subtyping)
**What it is:** Describing when an object can be used in place of another — defining substitutability, not implementation sharing.
**How it differs from class inheritance:** Class inheritance defines implementation. Interface inheritance defines type compatibility. An object can have many types; objects of different classes can have the same type.
**In practice:**
- **C++/Eiffel:** Inherit from pure abstract classes (classes with only pure virtual functions). Public inheritance from a class with pure virtual functions approximates interface inheritance.
- **Java/C#:** Implement interfaces (keyword: `implements`).
- **Python/Smalltalk/duck-typed languages:** Structural compatibility — if an object responds to the messages expected, it satisfies the type.
**Benefits:**
- Clients remain unaware of the specific types of objects they use (as long as objects adhere to the expected interface)
- Clients remain unaware of which class implements those objects
- Greatly reduces implementation dependencies between subsystems
**The GoF principle it enables:** *Program to an interface, not an implementation.*
**Design patterns that depend on this distinction:**
- Chain of Responsibility — objects must have a common type but usually don't share a common implementation
- Composite — Component defines a common interface; Composite often defines a common implementation
- Command, Observer, State, Strategy — often implemented with abstract classes that are pure interfaces
---
## 3. Object Composition (Black-Box Reuse)
**What it is:** Assembling or composing objects to get more complex functionality. New functionality is obtained by assembling existing objects rather than defining it in a new class.
**Why it's called black-box:** Objects are accessed solely through their interfaces. No internal details of composed objects are visible to the composing object. Objects appear only as "black boxes."
**How it works:** Requires that objects being composed have well-defined interfaces. The composing object holds a reference to a component object and calls its interface to get behavior. The component can be replaced at run-time by any other object with the same interface/type.
**Advantages:**
1. **Run-time flexibility** — any object can be replaced at run-time by another as long as it has the same type. Behavior can change dynamically.
2. **Preserved encapsulation** — objects are accessed only through interfaces; no white-box visibility into internals.
3. **Fewer implementation dependencies** — because implementation is written in terms of object interfaces, there are substantially fewer implementation dependencies.
4. **Smaller, focused classes** — favoring composition keeps each class encapsulated and focused on one task. Class hierarchies remain small and manageable.
5. **Behavior defined by interrelationships** — a design based on object composition has more objects (if fewer classes), and the system's behavior depends on their interrelationships instead of being defined in one large class.
**Disadvantages:**
- More objects to manage and understand
- Dynamic, highly parameterized software is harder to understand than more static software
- Requires carefully designed interfaces that don't stop you from using one object with many others
**When to use:**
- Behavior needs to vary at run-time
- The reuse relationship is "has-a" rather than "is-a"
- You want to keep classes encapsulated and focused on single responsibilities
- You want to avoid the fragile base class problem
- New components need to be composed with old ones (inheritance and composition work together here)
**The GoF principle it enables:** *Favor object composition over class inheritance.*
**Practical note from GoF:** Ideally, you shouldn't have to create new components to achieve reuse — just assemble existing components through composition. In practice, available components are never quite rich enough. Reuse by inheritance makes it easier to create new components that can be composed with old ones. Inheritance and composition thus work together.
---
## 4. Delegation
**What it is:** A specific form of composition where a receiving object delegates operations to a delegate object. Delegation makes composition as powerful for reuse as inheritance.
**How it differs from simple composition:** In true delegation, the receiver passes itself (`this` in C++, `self` in Smalltalk) to the delegate, so the delegated operation can refer back to the receiver. This is analogous to subclasses deferring requests to parent classes, but without inheritance.
**The Window/Rectangle canonical example:**
- Instead of: `class Window extends Rectangle` (Window is a Rectangle — class inheritance)
- Use: `class Window { private Rectangle rectangle; }` (Window has a Rectangle — delegation)
- Window delegates area calculations to its Rectangle instance
- Window can become circular at run-time by replacing its Rectangle with a Circle instance (same type)
**Main advantage:** Easy to compose behaviors at run-time and to change the way they're composed. Behaviors can be swapped by swapping the delegate.
**The critical test:** "Does this delegation simplify more than it complicates?"
- Delegation has a disadvantage it shares with other composition techniques: dynamic, highly parameterized software is harder to understand than more static software
- There are also run-time inefficiencies
- Delegation is a good design choice only when it simplifies more than it complicates
- It isn't easy to give rules for when to use delegation — it depends on context and experience
- **Delegation works best when it's used in highly stylized ways — that is, in standard patterns**
**Patterns that use delegation heavily:**
- **State** — an object delegates requests to a State object representing its current state. As state changes, the delegate changes, so behavior changes at run-time without changing the object's class.
- **Strategy** — an object delegates a specific request to an object representing a strategy for carrying out that request. An object can have many strategies for different requests.
- **Visitor** — the operation performed on each element of an object structure is always delegated to the Visitor object.
**Patterns that use delegation lightly:**
- **Mediator** — sometimes implements operations by forwarding to other objects; other times passes a reference to itself (true delegation).
- **Chain of Responsibility** — forwards requests from one object to another; sometimes carries a reference to the original receiver (delegation).
- **Bridge** — if abstraction and implementation are closely matched, the abstraction may simply delegate operations to that implementation.
---
## 5. Parameterized Types (Generics / Templates)
**What it is:** A third way (in addition to class inheritance and object composition) to compose behavior in OO systems. Lets you define a type without specifying all the other types it uses — unspecified types are supplied as parameters at the point of use.
**Known as:**
- **Generics** in Ada, Eiffel, Java, C#
- **Templates** in C++
- Not natively supported in Smalltalk (which doesn't have compile-time type checking)
**Example:** A `List` class parameterized by element type. `List<Integer>` vs `List<String>` — the List algorithm is the same; the element type is a parameter.
**Key comparison with inheritance and composition:**
| | Class inheritance | Object composition | Parameterized types |
|---|---|---|---|
| When bound | Compile-time | Run-time | Compile-time |
| Can change behavior at run-time | No | Yes | No |
| Default implementations | Yes (inherited) | No (must delegate) | Language-specific |
| Changes types used | No | No | Yes |
**Limitations:** Neither inheritance nor parameterized types can change at run-time. Which approach is best depends on design and implementation constraints.
**GoF note:** None of the GoF patterns concern parameterized types, though they are used occasionally to customize C++ implementations. Parameterized types are not needed at all in languages like Smalltalk that lack compile-time type checking.
---
## Decision Guide: Which Mechanism to Use
```
Need to vary behavior at RUN-TIME?
YES → Object composition (or delegation)
NO → Continue below
Is this a genuine IS-A subtype relationship?
YES → Consider interface inheritance (define the contract)
→ If implementation sharing is also needed: class inheritance from ABSTRACT class only
NO → Object composition ("has-a" relationship)
Is the parent class concrete (not abstract)?
YES → Red flag: white-box reuse. Consider extracting interface + composition.
NO → Abstract inheritance is safe.
Does the subclass need to understand parent internals to work correctly?
YES → White-box reuse violation. Refactor to composition.
NO → Inheritance may be appropriate.
Does the hierarchy have more than 2 levels of concrete inheritance?
YES → Likely overuse of inheritance. Evaluate flattening with composition.
NO → Acceptable if each level represents a genuine subtype.
Does the reuse need to vary by type parameter (generic algorithm, different element types)?
YES → Parameterized types (generics/templates)
NO → Inheritance or composition as above.
```
---
## Run-Time vs Compile-Time Structures
An OO program's **run-time structure** (the network of communicating objects) often bears little resemblance to its **compile-time structure** (classes in fixed inheritance relationships).
- **Compile-time:** Frozen in inheritance relationships. Static. Determined by the code.
- **Run-time:** Rapidly changing networks of communicating objects. Dynamic. Determined by object interactions.
The two structures are largely independent. Understanding one from the other is difficult — like trying to understand the dynamism of a living ecosystem from the static taxonomy of plants and animals.
Many design patterns (especially those with object scope) capture the distinction between compile-time and run-time structures explicitly:
- **Composite** and **Decorator** — especially useful for building complex run-time structures
- **Observer** — involves run-time structures that are often hard to understand unless you know the pattern
- **Chain of Responsibility** — results in communication patterns that inheritance doesn't reveal
**Implication for evaluation:** When assessing a design, ask not just "what does the class hierarchy look like?" but "what does the run-time object graph look like?" A design that looks clean statically may have brittle or inflexible run-time behavior if inheritance was used where composition was needed.
---
## Aggregation vs Acquaintance
Both are forms of object relationships implemented as references, but their intent differs significantly.
**Aggregation:**
- One object *owns* or is *responsible for* another
- An object *has* or is *part of* another
- Aggregate object and its owner have identical lifetimes
- Stronger relationship — fewer, more permanent
- Notation: arrowhead line with a diamond at the base
**Acquaintance (association / "using" relationship):**
- One object merely *knows of* another
- Acquainted objects may request operations of each other, but aren't responsible for each other
- Weaker coupling — acquaintances are made and remade frequently, sometimes only for the duration of an operation
- More dynamic, harder to discern in source code
- Notation: plain arrowhead line
**Why the distinction matters:** Acquaintance and aggregation are often implemented the same way in code (both as pointers or references). The distinction is determined more by intent than by explicit language mechanisms. Treating an acquaintance as if it were an aggregate creates unnecessary coupling. Treating an aggregate as a mere acquaintance risks lifetime management bugs.
In evaluation: ask whether a reference represents ownership (aggregate — shared lifetime, owning class responsible) or knowledge (acquaintance — loose, dynamic, temporary). Designs that conflate these tend to have lifecycle bugs or unnecessarily tight coupling.
Implement the Observer pattern to establish one-to-many dependencies where changing one object automatically notifies and updates all dependents. Use when a...
---
name: observer-pattern-implementor
description: |
Implement the Observer pattern to establish one-to-many dependencies where changing one object automatically notifies and updates all dependents. Use when a spreadsheet model needs to update multiple chart views, when UI components must react to data changes, when event systems need publish-subscribe, or when you need to decouple a subject from an unknown number of observers. Handles 10 implementation concerns including push vs pull models, dangling reference prevention, update cascade avoidance, and the ChangeManager compound pattern.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/observer-pattern-implementor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- design-pattern-selector
- behavioral-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [5]
pages: [273, 282]
tags: [design-patterns, behavioral, gof, observer, publish-subscribe, event-system, mvc, change-manager, mediator, singleton]
execution:
tier: 2
mode: full
inputs:
- type: code
description: "Existing code where one object's state changes need to propagate to dependents — tightly coupled notification logic, direct method calls between model and views, or a missing event/callback system"
tools-required: [TodoWrite, Read, Write]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Requires a codebase or a sufficiently detailed design description. Code reading tools required to inspect existing structure before refactoring."
---
# Observer Pattern Implementor
## When to Use
A data source is directly calling update methods on its consumers, making it impossible to add or remove consumers without modifying the source. Apply this skill when:
- An abstraction has two separable aspects — a subject (data) and observers (views/reactions) — and tightly coupling them limits reuse of each independently
- A change to one object requires notifying an unknown or variable number of other objects, and you don't want to hard-code who those are
- Objects should be able to notify others without making assumptions about who those objects are — you want zero coupling from subject to observer
Do not apply if:
- Only one consumer will ever exist and that is certain by design (direct call is simpler)
- Observers trigger further changes that trigger more observers — cascade control is complex; use ChangeManager (Step 9)
- The language/framework already provides this mechanism natively (e.g., React state, RxJS, Qt signals) — use those instead
Before starting, verify you can answer: What is the subject (the thing that changes)? What are the observers (the things that react)? What state do observers need from the subject after a change?
---
## Process
### Step 1: Set Up Tracking and Audit Existing Coupling
**ACTION:** Use `TodoWrite` to track progress, then read the codebase to map current subject-observer coupling.
**WHY:** Observer refactoring touches at least three participants (Subject, ConcreteSubject, ConcreteObserver) and requires consistent interface changes across all of them. Without tracking, it is easy to partially refactor — adding the Observer interface while leaving direct calls in the subject — which produces a hybrid that is worse than either approach. The audit reveals whether existing coupling is simple (one direct call) or complex (multiple consumers with different update logic), which determines whether ChangeManager is needed.
```
TodoWrite:
- [ ] Step 1: Audit existing coupling — map subjects and consumers
- [ ] Step 2: Define Subject and Observer interfaces
- [ ] Step 3: Implement ConcreteSubject with Attach/Detach/Notify
- [ ] Step 4: Resolve trigger strategy (who calls Notify)
- [ ] Step 5: Ensure state consistency before Notify
- [ ] Step 6: Choose push vs pull model and implement Update
- [ ] Step 7: Apply aspect-based registration if multiple event types
- [ ] Step 8: Evaluate whether ChangeManager is needed
- [ ] Step 9: Implement ChangeManager if warranted
- [ ] Step 10: Wire up and verify — no dangling references
```
Use `Read` and `Grep` to answer:
- Which class(es) change state and need to broadcast changes? (Subject candidates)
- Which class(es) react to those changes? (Observer candidates)
- How many consumers are there? Are they stable or do they change at runtime?
- Are there multiple types of events, or one general "something changed" notification?
---
### Step 2: Define the Subject and Observer Interfaces
**ACTION:** Create abstract `Subject` and `Observer` base classes (or interfaces). Do not write any concrete logic yet — define only the contracts.
**WHY:** The power of Observer comes from the subject knowing only the abstract `Observer` interface — not any concrete observer type. If you skip this and call concrete methods directly, you recreate the coupling you're trying to eliminate. Defining the interfaces first forces you to commit to the notification contract before any implementation details leak in.
**Subject interface** — three responsibilities:
```cpp
class Subject {
public:
virtual ~Subject();
virtual void Attach(Observer*); // register a new observer
virtual void Detach(Observer*); // deregister an observer
virtual void Notify(); // broadcast change to all observers
protected:
Subject();
private:
List<Observer*> _observers; // or a hash map — see Step 3
};
```
**Observer interface** — one responsibility:
```cpp
class Observer {
public:
virtual ~Observer();
virtual void Update(Subject* theChangedSubject) = 0;
protected:
Observer();
};
```
Note: `Update` receives the subject pointer (not state data). This is the **pull model** default — observers call back on the subject for what they need. See Step 6 for the push vs pull decision.
---
### Step 3: Implement ConcreteSubject and the Observer List
**ACTION:** Implement the concrete subject with state, state accessors (`GetState`/`SetState`), and a working `Notify`. Choose the storage strategy for the observer list.
**WHY:** The naive approach — storing observers directly in every subject instance — works when subjects are few and observers are many per subject. When subjects are many and observers are few (e.g., many small model objects each observed by one dashboard), the per-subject list wastes memory. The hash map trade-off is real: choose based on the ratio of subjects to observers in your system.
**Storage decision:**
| Situation | Storage | Trade-off |
|-----------|---------|-----------|
| Few subjects, many observers each | `List<Observer*>` in Subject | Simple; O(1) attach; O(n) notify |
| Many subjects, few observers each | External hash map (subject → observer list) | Eliminates per-subject overhead; increases access cost |
**Standard Subject implementation:**
```cpp
void Subject::Attach(Observer* o) { _observers->Append(o); }
void Subject::Detach(Observer* o) { _observers->Remove(o); }
void Subject::Notify() {
ListIterator<Observer*> i(_observers);
for (i.First(); !i.IsDone(); i.Next()) {
i.CurrentItem()->Update(this);
}
}
```
**ConcreteSubject** adds only state and accessors:
```cpp
class ConcreteSubject : public Subject {
public:
SubjectState GetState() const;
void SetState(SubjectState s);
private:
SubjectState _subjectState;
};
```
---
### Step 4: Decide Who Triggers Notify
**ACTION:** Choose and implement one of two trigger strategies: automatic (subject triggers) or manual (client triggers).
**WHY:** This is the most common source of Observer bugs. Automatic triggering ensures observers are never out of sync but fires notifications on every state mutation — including intermediate states in a multi-step update. Manual triggering gives the client control to batch multiple mutations before broadcasting, but adds an obligation: every code path that mutates state must remember to call `Notify`. Forgetting one call produces stale observers that are hard to debug.
**Option A — Automatic trigger (subject calls Notify in SetState):**
```cpp
void ConcreteSubject::SetState(SubjectState s) {
_subjectState = s;
Notify(); // automatic — client never calls Notify
}
```
Use when: mutations are typically single-step, or missing a notification is worse than extra notifications.
**Option B — Manual trigger (client calls Notify):**
```cpp
// Client code
subject->SetState(s1);
subject->SetBoundary(b); // multiple mutations
subject->Notify(); // one notification for both changes
```
Use when: performance matters and multiple state changes should batch into one notification round.
**Recommendation:** Default to Option A. Switch to Option B only if profiling shows notification overhead is significant, or if the use case explicitly requires batching (e.g., multi-step transaction commit).
---
### Step 5: Ensure Subject State Is Self-Consistent Before Notify
**ACTION:** Audit every operation that calls `Notify` (directly or via `SetState`) and verify that all subject state is fully updated before `Notify` fires.
**WHY:** Observers call `GetState` on the subject inside their `Update` method to synchronize. If `Notify` fires while the subject is in a partially-updated state — for example, after a base class operation but before a subclass operation completes — observers will read inconsistent data and enter an inconsistent state of their own. This is a silent bug: observers receive a valid-looking notification but pull stale or partial data.
**Common violation in inheritance:**
```cpp
// WRONG — notification fires before subclass state is updated
void MySubject::Operation(int newValue) {
BaseClassSubject::Operation(newValue); // triggers Notify here
_myInstVar += newValue; // too late — observers already ran
}
```
**Fix — use Template Method to control notification timing:**
```cpp
// In abstract Subject — Notify is last
void Text::Cut(TextRange r) {
ReplaceRange(r); // redefined in subclasses — all state updated here
Notify(); // fires only after the complete operation
}
```
Always ensure `Notify` is the **last operation** in any state-mutating method. Document which operations trigger notifications in the class interface comment.
---
### Step 6: Choose and Implement the Push or Pull Model
**ACTION:** Decide whether the subject sends state data to observers (push) or observers query the subject (pull). Implement `Update` accordingly.
**WHY:** This is the core design trade-off for observer usability versus coupling. The pull model makes the subject maximally ignorant of its observers (it sends no assumptions about what they need), but can be inefficient — each observer must figure out what changed on its own. The push model is efficient for observers with narrow needs but makes the subject assume it knows what observers care about, reducing reusability when different observers need different data.
**Pull model** — subject sends only its identity; observers pull what they need:
```cpp
void DigitalClock::Update(Subject* theChangedSubject) {
if (theChangedSubject == _subject) { // guard: verify it's the right subject
int hour = _subject->GetHour();
int minute = _subject->GetMinute();
Draw(); // re-render with pulled values
}
}
```
**Push model** — subject sends relevant state as Update arguments:
```cpp
// Subject
void Subject::Notify(int newValue) {
for each observer: observer->Update(this, newValue);
}
// Observer
virtual void Update(Subject*, int newValue) = 0;
```
**Decision:**
| Prefer pull when... | Prefer push when... |
|--------------------|---------------------|
| Different observers need different parts of subject state | All observers need the same data |
| Subject does not know what observers care about | The changed data is cheap to pass and expensive to re-query |
| Maximum observer reusability is important | Observer should not need to hold a reference to the subject |
Pull is the default. Push reduces reusability because it encodes assumptions about observer needs into the Subject interface.
---
### Step 7: Apply Aspect-Based Registration for Multiple Event Types
**ACTION:** If the subject produces multiple distinct event types and observers care about only some, extend `Attach` to accept an "aspect" (event category) parameter.
**WHY:** Without aspect filtering, every observer is notified of every change — even changes it does not care about. This wastes cycles and can cause incorrect behavior if an observer's `Update` logic assumes it is only called for its relevant event. Aspect-based registration lets observers declare their interest at subscription time, so `Notify` only calls the relevant observers.
**Extended interfaces:**
```cpp
// Subject registration with aspect
void Subject::Attach(Observer* o, Aspect& interest);
// Observer Update with aspect parameter
void Observer::Update(Subject* s, Aspect& interest);
```
**Notify implementation with aspect filtering:**
```cpp
void Subject::Notify(Aspect& changedAspect) {
// only notify observers registered for this aspect
for each (observer, registeredAspect) in _observerMap:
if registeredAspect == changedAspect:
observer->Update(this, changedAspect);
}
```
Use when: subjects produce heterogeneous events (e.g., a document that can have text changes, layout changes, and selection changes), and observers specialize in only one type.
---
### Step 8: Evaluate Whether ChangeManager Is Needed
**ACTION:** Assess whether the subject-observer dependency graph is simple or complex. If complex, implement ChangeManager (Step 9). If simple, skip Step 9.
**WHY:** The standard Notify-loop works well when subjects and observers are independent. It breaks down in two scenarios: (1) a single operation modifies multiple interdependent subjects, causing observers to be notified multiple times for what is logically one change; (2) observers themselves observe multiple subjects, creating diamond-shaped dependency graphs where redundant updates proliferate. ChangeManager exists to handle these cases — it absorbs the complexity so subjects and observers don't have to.
**Use ChangeManager if any of the following are true:**
- One high-level operation modifies two or more subjects whose observers overlap
- Any observer observes more than one subject (creating potential for redundant updates)
- Notification order matters and must be controlled globally
- You need a specific update strategy (e.g., "update all after all subjects have changed")
**If none of the above apply:** ChangeManager adds complexity for no benefit. Keep the simple direct Notify loop from Step 3.
---
### Step 9: Implement ChangeManager (Observer + Mediator + Singleton)
**ACTION:** If Step 8 determined ChangeManager is needed, implement it as a mediator that owns the subject-observer mapping, controls update strategy, and exists as a singleton.
**WHY:** ChangeManager combines three patterns with a specific purpose: it acts as a **Mediator** (centralizing the subject-observer communication logic), uses the **Observer** protocol (subjects and observers still implement the same interfaces), and is a **Singleton** (there is exactly one ChangeManager per subsystem — subjects and observers both need to find the same instance). This combination eliminates redundant updates when multiple subjects change in one operation, and allows the update strategy to be changed without touching Subject or Observer classes.
**ChangeManager responsibilities:**
1. Maintains the subject-to-observer mapping (replaces per-subject observer lists)
2. Defines the update strategy — when and in what order observers are notified
3. Updates all dependent observers when a subject requests it
**Two built-in strategies:**
| Strategy | Behavior | Use when |
|----------|----------|----------|
| `SimpleChangeManager` | For each subject, notify all its observers immediately | Observers observe only one subject; no redundant update risk |
| `DAGChangeManager` | Mark all affected observers; update each marked observer exactly once | Observers observe multiple subjects (directed acyclic graph dependencies) |
**ChangeManager interface:**
```cpp
class ChangeManager {
public:
static ChangeManager* Instance(); // Singleton access
virtual void Register(Subject*, Observer*);
virtual void Unregister(Subject*, Observer*);
virtual void Notify() = 0; // strategy-defined update dispatch
protected:
ChangeManager();
private:
static ChangeManager* _instance;
};
```
**Subject delegates to ChangeManager:**
```cpp
// Subject no longer stores observer list — delegates to ChangeManager
void Subject::Attach(Observer* o) {
ChangeManager::Instance()->Register(this, o);
}
void Subject::Notify() {
ChangeManager::Instance()->Notify();
}
```
**DAGChangeManager Notify — prevents redundant updates:**
```cpp
void DAGChangeManager::Notify() {
// Pass 1: mark all observers that need updating
for each subject s in _subjects:
for each observer o of s:
mark o as pending
// Pass 2: update each marked observer exactly once
for each marked observer o:
o->Update(relevantSubject);
unmark o;
}
```
---
### Step 10: Wire Up and Prevent Dangling References
**ACTION:** Wire all observers to their subjects via `Attach`. Add lifecycle management to prevent dangling references when subjects are deleted. Verify the full notification chain manually or with a test.
**WHY:** Deleting a subject that observers still hold a reference to produces dangling pointers — observers will attempt to call `GetState` on freed memory. Simply deleting the observers is not a valid solution because they may be observing other subjects or be referenced by other objects. The subject must notify observers of its own deletion so they can nullify their references.
**Deletion protocol in subject destructor:**
```cpp
Subject::~Subject() {
// Notify observers that this subject is going away
// so they can reset their stored reference
ListIterator<Observer*> i(_observers);
for (i.First(); !i.IsDone(); i.Next()) {
i.CurrentItem()->SubjectDeleted(this);
}
}
// Observer handles deletion
void ConcreteObserver::SubjectDeleted(Subject* s) {
if (s == _subject) {
_subject = nullptr; // nullify — do not delete
}
}
```
**Wiring example:**
```cpp
ClockTimer* timer = new ClockTimer;
DigitalClock* digitalClock = new DigitalClock(timer); // Attach in constructor
AnalogClock* analogClock = new AnalogClock(timer); // Attach in constructor
// When timer ticks, both clocks update automatically
```
**Verification checklist:**
- [ ] Attach is called before any state changes that should be observed
- [ ] Detach is called in every observer destructor
- [ ] Deleting a subject does not crash observers
- [ ] Adding a new observer at runtime receives subsequent updates correctly
- [ ] Notifications fire only when subject state is fully consistent (Step 5)
---
## Pitfalls
**1. Cascade updates.** Observer A's `Update` changes subject B, which notifies observer C, which changes subject D... This is the most serious Observer failure mode, and it is difficult to detect because it appears during runtime. Prevention: avoid state changes inside `Update` that would notify other observers. If cascade is unavoidable, use `DAGChangeManager` to ensure each observer updates exactly once per logical change.
**2. Dangling references to deleted subjects.** Deleting a subject without notifying observers leaves them with stale pointers. The crash appears on the next notification cycle, not at the point of deletion, making it hard to locate. Prevention: always implement the deletion-notification protocol in Step 10.
**3. Inconsistent state notification.** Calling `Notify` before all subject state is updated (typically in inherited operations — see Step 5). Observers run during `Update` and pull stale data. Prevention: always use Template Method to put `Notify` last, and audit inheritance chains for calls to `super.Operation()` that trigger notifications mid-update.
**4. Trigger ambiguity.** Mixing automatic and manual trigger strategies across different subjects in the same system. Developers expect one behavior and get the other; observers are sometimes notified twice, sometimes not at all. Prevention: pick one strategy and apply it uniformly across the system. Document it in the Subject class comment.
**5. Push overspecification.** The push model embeds assumptions about what observers need directly into the Subject's `Notify` signature. When observer needs diverge (one needs the full new state, another only needs a delta, a third only needs to know that *something* changed), the push signature becomes a compromise that serves none of them well. Prevention: default to pull. Only add push parameters that are universally needed by all observers, and only after profiling shows that the extra pull calls are a measurable cost.
---
## Examples
### Example 1: Spreadsheet Model with Multiple Chart Views
**Scenario:** A spreadsheet has a data model that three chart widgets (bar chart, pie chart, line chart) read directly. Currently the model calls `barChart.refresh()`, `pieChart.refresh()`, and `lineChart.refresh()` in its `SetCell` method. Adding a fourth chart requires modifying the model.
**Trigger:** "Every new chart type requires touching the spreadsheet model."
**Process:**
- Step 1: Audit reveals `SpreadsheetModel` directly references three concrete chart classes
- Step 2: `SpreadsheetModel` becomes `ConcreteSubject`; each chart implements `Observer`
- Step 3: `List<Observer*>` in Subject (few subjects, multiple observers)
- Step 4: Automatic trigger — `SetCell` calls `Notify()` after updating data
- Step 5: `Notify` is last in `SetCell` — state is fully updated first
- Step 6: Pull model — charts call `GetCellValue(row, col)` in their `Update`
- Step 8: No complex dependencies — skip ChangeManager
**Output:** `SpreadsheetModel` has no reference to any chart class. New charts attach themselves; model never changes.
---
### Example 2: Clock with Multiple Display Observers
**Scenario:** A `ClockTimer` (concrete subject) notifies time displays. A `DigitalClock` and `AnalogClock` both observe the same timer. This is the canonical sample from the GoF text.
**Trigger:** "Both clock displays must always show the same time, but neither should know about the other."
**Key implementation details:**
- `DigitalClock` inherits from both `Widget` (UI toolkit) and `Observer` (combining concern 10 — Subject/Observer in one class hierarchy via multiple inheritance)
- Constructor calls `_subject->Attach(this)`; destructor calls `_subject->Detach(this)` — full lifecycle management
- `Update` guards on subject identity: `if (theChangedSubject == _subject)` — handles potential future case of observing multiple subjects
```cpp
void DigitalClock::Update(Subject* theChangedSubject) {
if (theChangedSubject == _subject) {
Draw();
}
}
void DigitalClock::Draw() {
int hour = _subject->GetHour();
int minute = _subject->GetMinute();
// render digital display
}
```
**Output:** `ClockTimer::Tick()` calls `Notify()`; both displays update independently without knowing each other exist.
---
### Example 3: Multi-Subject Dashboard with ChangeManager
**Scenario:** A financial dashboard has three data subjects: `PriceSubject`, `VolumeSubject`, and `PositionSubject`. A `RiskSummaryObserver` observes all three. When a trade executes, all three subjects change in one operation. Without ChangeManager, `RiskSummaryObserver.Update` fires three times for one logical event.
**Trigger:** "Our risk view recalculates three times per trade because it observes three subjects."
**Process:**
- Step 8: Observer observes multiple subjects — DAG dependency. ChangeManager required.
- Step 9: Implement `DAGChangeManager`. Each subject calls `ChangeManager::Instance()->Notify()`. DAGChangeManager marks `RiskSummaryObserver` when `PriceSubject` changes, marks it again when `VolumeSubject` changes (already marked — no-op), then updates it once.
**Output:** One trade → three subject changes → `RiskSummaryObserver.Update` called exactly once.
---
## Reference Files
| File | Contents |
|------|----------|
| `references/observer-implementation-guide.md` | All 10 implementation concerns in full detail, ChangeManager class diagrams, push/pull protocol variants, aspect-based registration patterns, and language-specific notes |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-design-pattern-selector`
- `clawhub install bookforge-behavioral-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/observer-implementation-guide.md
# Observer Pattern — Complete Implementation Guide
Source: Design Patterns: Elements of Reusable Object-Oriented Software (GoF), pages 273–282.
---
## The 10 Implementation Concerns
### 1. Mapping Subjects to Their Observers
The simplest approach is for a subject to store references to its observers directly in a `List<Observer*>`. This incurs storage overhead per subject instance regardless of how many observers are registered.
When subjects outnumber observers significantly — for example, many small model objects each observed by one centralized view — the per-instance list wastes memory for the common case of zero or one observer.
**Alternative:** Use an external associative structure (hash table, dictionary) that maps subjects to their observer lists. This trades per-instance storage overhead for a small increase in lookup cost. The hash table approach is global — it centralizes all subject-observer mappings outside the subjects themselves.
```
// Conceptual: external mapping
Map<Subject*, List<Observer*>*> observerMap;
// Attach
observerMap[subject]->Append(observer);
// Notify (subject-side)
observerMap[this]->ForEach(o => o->Update(this));
```
This approach also makes it possible to answer "which subjects does observer O observe?" — the reverse mapping — which the in-subject list approach cannot answer efficiently.
---
### 2. Observing More Than One Subject
An observer may legitimately depend on more than one subject. A spreadsheet cell that aggregates two data streams, or a dashboard panel that shows a composite of price and volume data, is observing multiple subjects.
The basic `Update()` signature does not tell the observer which subject triggered the notification. The observer must be able to discriminate.
**Solution:** Pass the subject itself as a parameter to `Update`. The observer compares the received pointer against its stored subject references:
```cpp
void CompositeObserver::Update(Subject* theChangedSubject) {
if (theChangedSubject == _priceSubject) {
// handle price change
} else if (theChangedSubject == _volumeSubject) {
// handle volume change
}
}
```
This is why the GoF base Observer interface uses `Update(Subject*)` rather than a parameterless `Update()`.
---
### 3. Who Triggers the Update
There are exactly two options, and each has a failure mode.
**Option A — Subject triggers automatically (SetState calls Notify)**
```cpp
void Subject::SetState(State s) {
_state = s;
Notify();
}
```
- Advantage: Clients never forget to trigger. Observers are always in sync.
- Disadvantage: If a client calls `SetA()`, then `SetB()`, then `SetC()`, three separate notification rounds fire for what is logically one composite change. Each intermediate state triggers observer updates, which may be incorrect or wasteful.
**Option B — Client triggers manually**
```cpp
subject->SetA(...);
subject->SetB(...);
subject->SetC(...);
subject->Notify(); // one round for all three changes
```
- Advantage: Batching is explicit. One notification per logical operation.
- Disadvantage: Every code path that mutates state must remember to call `Notify`. Missing one call silently leaves observers stale. This is a correctness hazard.
**Recommendation:** Default to Option A. Move to Option B only when profiling reveals notification overhead is a measurable cost, or when a use case genuinely requires atomic multi-step state changes (e.g., a transaction commit that sets multiple fields before broadcasting).
---
### 4. Dangling References to Deleted Subjects
When a subject is deleted, observers may still hold pointers to it. On the next notification cycle — or when any observer attempts to call `GetState` — the program accesses freed memory. The crash appears after the deletion, not at it, making the bug hard to locate.
**Wrong approach:** Deleting the observers when the subject is deleted. Observers may be registered with multiple subjects, or referenced by other objects. Deleting them unilaterally is incorrect.
**Correct approach:** The subject notifies its observers of its own deletion so each observer can nullify its stored reference:
```cpp
Subject::~Subject() {
ListIterator<Observer*> i(_observers);
for (i.First(); !i.IsDone(); i.Next()) {
i.CurrentItem()->SubjectDeleted(this);
}
}
void Observer::SubjectDeleted(Subject* s) {
// Default: no-op. Subclasses override if they store the subject.
}
void ConcreteObserver::SubjectDeleted(Subject* s) {
if (s == _subject) {
_subject = nullptr; // nullify — observer still exists, subject is gone
}
}
```
Observers must guard `_subject != nullptr` before using it in `Update`.
---
### 5. Making Sure Subject State Is Self-Consistent Before Notification
Observers call `GetState` on the subject inside `Update`. If `Notify` fires while the subject is partially updated, observers will read a mix of old and new state.
This most commonly occurs in inheritance chains where a subclass operation calls an inherited operation that triggers `Notify`, then continues updating subclass-specific state:
```cpp
// VIOLATION — notification fires at BaseClassSubject::Operation,
// before _myInstVar is updated
void MySubject::Operation(int newValue) {
BaseClassSubject::Operation(newValue); // Notify fires here via SetState
_myInstVar += newValue; // already too late
}
```
**Fix using Template Method:** Define an abstract primitive for subclasses to override, and put `Notify` at the end of the template method in the base class:
```cpp
// Base class controls notification timing
void Subject::DoOperation(int newValue) {
SetInternalState(newValue); // calls subclass override — all state updated
Notify(); // fires only when self-consistent
}
// Subclass — overrides the primitive, not the template method
void MySubject::SetInternalState(int newValue) {
BaseClassSubject::SetInternalState(newValue);
_myInstVar += newValue; // subclass state updated before Notify
}
```
**Rule:** Always document which Subject operations trigger notifications. This is part of the public contract.
---
### 6. Avoiding Observer-Specific Update Protocols — Push and Pull Models
The subject can pass more or less information to observers during notification. The two extremes are:
**Push model:** Subject sends detailed change information as arguments to `Update`.
```cpp
// Subject pushes what it assumes observers need
void Subject::Notify(int changedValue, int oldValue) {
for each o: o->Update(this, changedValue, oldValue);
}
```
- Advantage: Observer does not need to query the subject — efficient when all observers need the same data.
- Disadvantage: The subject must assume it knows what observers need. If different observers need different data, the push parameters become a lowest-common-denominator compromise. Subjects become less reusable because they're coupled to observer data needs.
**Pull model:** Subject sends only its identity. Observers query what they need.
```cpp
// Subject sends only itself
void Subject::Notify() {
for each o: o->Update(this);
}
// Observer pulls what it needs
void ConcreteObserver::Update(Subject* s) {
_cachedValue = ((ConcreteSubject*)s)->GetRelevantState();
Render();
}
```
- Advantage: Subject remains ignorant of observer needs. Maximum reusability.
- Disadvantage: Observer must know the concrete Subject type to call specific accessors (introduces a cast). Less efficient if determining what changed requires re-reading large state.
**Recommendation:** Pull is the default. It enforces the subject's ignorance of its observers. Add push parameters only when a specific piece of changed data is needed by all observers and the cost of polling for it would be significant.
---
### 7. Specifying Modifications of Interest Explicitly (Aspect-Based Registration)
In the basic pattern, every observer is notified of every change. This is wasteful when a subject produces multiple types of events and a given observer cares about only one.
**Aspect-based registration** lets observers declare which events they care about at subscription time:
```cpp
// Extended Attach — observer registers for a specific aspect
void Subject::Attach(Observer* o, Aspect& interest);
// Extended Update — aspect passed at notification time
void Observer::Update(Subject* s, Aspect& interest);
```
At notification time, the subject supplies the changed aspect and notifies only the observers registered for that aspect:
```cpp
void Subject::Notify(Aspect& changedAspect) {
for each (observer, registeredAspect) in _observerMap:
if registeredAspect == changedAspect:
observer->Update(this, changedAspect);
}
```
This is particularly useful for document-based systems where a single document may produce text changes, layout changes, selection changes, and save-state changes — and different views care about different subsets.
---
### 8 and 9. Encapsulating Complex Update Semantics — the ChangeManager
When the dependency relationships between subjects and observers are complex, the standard direct-Notify approach breaks down in two ways:
1. **Redundant notifications:** An operation changes subjects A and B. Observer O observes both. When A notifies, O runs. When B notifies, O runs again. O has been updated twice for one logical change.
2. **Ordering problems:** If notification order matters (observer C must run before observer D because D reads data that C computes), the simple notification loop cannot express this.
**ChangeManager** addresses both. It is a compound of three patterns:
- **Observer:** subjects and observers still use the standard protocol
- **Mediator:** ChangeManager centralizes the subject-observer communication logic
- **Singleton:** exactly one ChangeManager exists per subsystem; all subjects and observers route through the same instance
**ChangeManager's three responsibilities:**
1. Maintains the subject-to-observer mapping — eliminates per-subject observer storage
2. Defines the update strategy — when and how observers are notified
3. Updates all dependent observers at the request of any subject
**Class structure:**
```
Subject ←────────────── ChangeManager ───────────────→ Observer
Attach(Observer o) Register(Subject, Observer) Update(Subject)
Detach(Observer) Unregister(Subject, Observer)
Notify() ─────────→ Notify()
chman->Notify() │
┌───┴────────┐
SimpleChangeManager DAGChangeManager
```
**SimpleChangeManager:**
```
Notify():
for each subject s in subjects:
for each observer o of s:
o->Update(s)
```
Naive — always notifies all observers. Correct when observers observe exactly one subject.
**DAGChangeManager:**
```
Notify():
Pass 1 — mark:
for each subject s in subjects:
for each observer o of s:
mark o as pending
Pass 2 — update once each:
for each marked observer o:
o->Update(relevantSubject)
unmark o
```
Handles directed acyclic graph dependencies. When an observer observes multiple subjects, it receives exactly one `Update` call regardless of how many subjects changed. `DAGChangeManager` is preferable to `SimpleChangeManager` when any observer observes more than one subject.
**Using ChangeManager:**
```cpp
// Subject delegates to ChangeManager instead of maintaining its own list
void Subject::Attach(Observer* o) {
ChangeManager::Instance()->Register(this, o);
}
void Subject::Notify() {
ChangeManager::Instance()->Notify();
}
```
The Singleton `ChangeManager::Instance()` can be initialized with either strategy at startup, making it possible to change the update strategy without touching Subject or Observer classes.
---
### 10. Combining the Subject and Observer Roles
Languages without multiple inheritance (originally Smalltalk) sometimes define Subject and Observer together in a single root class, making all objects potential subjects and observers without any additional inheritance.
In languages with multiple inheritance (C++), a class can inherit from both `Subject` and `Observer` simultaneously, producing an object that can both publish changes and react to changes from other subjects:
```cpp
class DigitalClock : public Widget, public Observer {
// Widget provides UI rendering capability
// Observer provides Update() — reacts to ClockTimer changes
// DigitalClock itself is not a Subject, but could be if it had dependents
};
```
This also applies to systems where an observer itself is a subject for downstream observers — common in reactive pipelines. In that case, the observer's `Update` method mutates its own state and then calls its own `Notify`, propagating the change downstream.
**Warning:** When an object is both subject and observer and part of a cycle, cascade updates are possible. Use `DAGChangeManager` to prevent redundant updates, or add a change-guard to prevent re-notification while an update is in progress:
```cpp
void ComboSubjectObserver::Update(Subject* s) {
if (_updating) return; // prevent re-entrant notification
_updating = true;
// update own state based on s
Notify(); // propagate to own observers
_updating = false;
}
```
---
## Participants Summary
| Participant | Role | Knows about |
|-------------|------|-------------|
| `Subject` | Maintains observer list; provides Attach/Detach/Notify | Abstract `Observer` only |
| `Observer` | Defines Update interface | Nothing about Subject state |
| `ConcreteSubject` | Stores subject state; calls Notify on change | Its own state and the abstract Observer |
| `ConcreteObserver` | Implements Update; holds reference to ConcreteSubject | ConcreteSubject (for GetState queries) |
| `ChangeManager` | Mediates subject-observer mapping; controls update strategy | Both Subject and Observer interfaces |
---
## Collaborations Sequence
```
aConcreteSubject aConcreteObserver anotherConcreteObserver
│ │ │
│← SetState() ────────│ │
│ │ │
│── Notify() ─────────→ │
│ │ │
│ ← Update() ──────│ │
│ ── GetState() ───→ │
│ │ │
│── Update() ─────────────────────────────────→ │
│ ← GetState() ────│
│ │
```
Note: the observer that initiates the change (`SetState`) postpones its own update until it receives notification from the subject. `Notify` is not always called by the subject — it can be called by an observer or by an external client (see Implementation Concern 3).
---
## Consequences
### 1. Abstract Coupling Between Subject and Observer
The subject knows only that it has a list of observers conforming to the abstract `Observer` interface. It does not know the concrete class of any observer. The coupling is minimal and abstract. This allows Subject and Observer to belong to different layers of abstraction in a system — a lower-level subject can communicate with a higher-level observer without violating layering. If tightly coupled, the combined object would be forced to exist in one layer, potentially violating the layering abstraction.
### 2. Support for Broadcast Communication
Unlike an ordinary request with one sender and one receiver, a subject's notification is broadcast to all subscribed observers. The subject does not care how many observers exist. Observers can be added and removed at any time without modifying the subject. It is up to each observer whether to handle or ignore a notification.
### 3. Unexpected Updates
Because observers have no knowledge of each other's presence, they cannot predict the ultimate cost of a change to the subject. A seemingly innocuous state change may cause a cascade of updates through multiple observers and their dependent objects. Furthermore, the simple Update protocol provides no information about *what* changed — only that *something* changed. Observers must work to discover what changed (pull model overhead), or the subject must provide that information (push model coupling). Ill-defined dependency criteria lead to spurious updates that are difficult to track down.
---
## Related Patterns
| Pattern | Relationship |
|---------|-------------|
| **Mediator** | ChangeManager acts as a Mediator between subjects and observers. It encapsulates the update communication so neither subject nor observer needs to know the other's full protocol. |
| **Singleton** | ChangeManager is typically a Singleton — there is one per subsystem, and it must be globally accessible to both subjects and observers. |
---
## Known Uses
- **Smalltalk Model/View/Controller (MVC):** The first and best-known application. Model is Subject; View is Observer. Controller mediates input to model changes. View re-renders on model notification.
- **InterViews:** Defines explicit `Observer` and `Observable` (for Subject) classes.
- **Andrew Toolkit:** Calls subjects "data objects" and observers "views."
- **Unidraw:** Splits graphical editor objects into View (observer) and Subject parts.
- **Qt (signals/slots):** A type-safe, language-integrated implementation of Observer with aspect-based registration (signals map to aspects).
- **Java EventListener model:** Observer implemented as interface; aspect-based (event type per listener interface).
- **RxJS / Reactive Extensions:** Observer generalized to streams — an observable (subject) emits values; subscribers (observers) react to each emission.
Design a system using multiple coordinated design patterns, moving beyond single-pattern application to pattern composition. Use when facing a complex system...
---
name: multi-pattern-system-designer
description: |
Design a system using multiple coordinated design patterns, moving beyond single-pattern application to pattern composition. Use when facing a complex system with multiple design problems — object creation inflexibility, structural rigidity, and behavioral coupling occurring simultaneously. Guides through the Lexi document editor methodology: decompose the system into design problems, map each to a pattern, identify pattern interaction points (e.g., Abstract Factory configures Bridge, Command uses Composite for macros, Iterator enables Visitor), and verify the patterns work together coherently.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/multi-pattern-system-designer
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- design-pattern-selector
- abstract-factory-implementor
- composite-pattern-implementor
- command-pattern-implementor
- visitor-pattern-implementor
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [2]
tags: [design-patterns, object-oriented, gof, system-design, pattern-composition, capstone, lexi, multi-pattern]
execution:
tier: 2
mode: full
inputs:
- type: none
description: "System description with multiple co-existing design problems — or a codebase to analyze"
tools-required: [Read, Write, TodoWrite]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any agent environment. A codebase is useful but not required — a verbal system description is sufficient."
---
# Multi-Pattern System Designer
## When to Use
Your system has several design problems at once, and applying patterns in isolation is not enough — the patterns need to work together. This skill is appropriate when:
- You're designing a non-trivial system from scratch and want a principled way to select and arrange patterns
- You're refactoring a complex system and need to map multiple pain points to coordinated solutions
- You understand individual patterns but aren't sure how to combine them without conflicts
- You want to validate that your chosen patterns interact correctly rather than fighting each other
- You're learning how expert designers think at the system level, not just the class level
**Prerequisite check before starting:**
- Do you have a description of the system being designed? (Even informal is fine — "a WYSIWYG document editor" is sufficient)
- Has the user already identified some design problems, or should you surface them? (This skill does both)
- Are you working with object-oriented design? (Pattern composition assumes classes, inheritance, and delegation)
This skill is a **capstone** — it assumes you can apply individual patterns. If a specific pattern is unfamiliar, consult the relevant `*-implementor` skill after the design map is complete.
---
## Context & Input Gathering
### Required Context
- **System description:** What is the system being designed or refactored? What does it do?
- **At least one known pain point or design goal:** "It needs to support multiple platforms" or "we want to add behaviors without modifying existing classes"
### Observable Context (gather from environment if available)
- **Existing codebase:** Scan for design smells — large `if/else` on types, hardcoded constructors, deep inheritance hierarchies, classes that both store data and run algorithms on it
- **Prior pattern usage:** Note which patterns are already in place — the new patterns must compose with them
- **Platform and language:** Some patterns are provided by the platform (Java `Iterator`, Python context managers); skip those and note the gap is already covered
### Sufficiency Check
You have enough to proceed with one well-described system and at least two distinguishable design problems (e.g., structure + behavior, or creation + extensibility). If only one problem is visible, proceed — the decomposition step usually surfaces others.
---
## Process
### Step 1: Decompose the System into Design Problems
**ACTION:** Use `TodoWrite` to track all steps, then list every distinct design problem in the system — not symptoms, but underlying design challenges.
**WHY:** Multi-pattern design fails when problems are not clearly separated. A single vague problem statement produces a single pattern choice, which leaves the rest of the system unaddressed. Decomposing first ensures every axis of flexibility gets its own solution rather than one pattern being stretched to cover everything.
```
TodoWrite:
- [ ] Step 1: Decompose system into design problems
- [ ] Step 2: Map each problem to a candidate pattern
- [ ] Step 3: Identify pattern interaction points
- [ ] Step 4: Verify pattern coherence
- [ ] Step 5: Produce multi-pattern design document
```
**Decomposition approach — ask these seven questions about the system:**
| Question | Design problem it surfaces | Lexi example |
|----------|---------------------------|--------------|
| How are complex objects represented internally? | Structure — how to treat composites and primitives uniformly | Documents contain characters, lines, columns — all need uniform treatment |
| How are algorithms applied to that structure? | Algorithm inflexibility — algorithm embedded in structure | Multiple formatting algorithms (simple, TeX-quality) must be swappable |
| How is behavior added to objects over time? | Embellishment rigidity — inheritance causes class explosion | Adding scroll bars + borders via subclassing → combinatorial explosion |
| How are families of related objects created? | Creational coupling — code names concrete classes | UI widgets must match look-and-feel (Motif vs. Mac) without hardcoded constructors |
| How does the system run on multiple platforms? | Platform dependency — business logic knows too much about infrastructure | Window drawing logic tied to X11, PM, Mac APIs |
| How do users trigger and undo operations? | Behavioral coupling — UI directly calls operation logic | Menu items, buttons, keyboard shortcuts all trigger the same operations; undo needed |
| How are analytical operations added to a stable structure? | Extensibility — adding analysis modifies every class | Adding spell check, hyphenation, word count should not change the structure classes |
For your system, fill in the right two columns. Not all seven will apply — typically 3–6 design problems exist in a well-scoped system. If fewer than 3 are visible, the system may be simpler than a multi-pattern design warrants.
**Output:** A numbered list of design problems, each stated as a design challenge (not a symptom). Example:
```
1. Document structure: need uniform treatment of characters, images, and composite containers
2. Formatting: formatting algorithm must be changeable independently of document structure
3. Embellishment: scroll bars and borders must be addable without subclassing
4. Look-and-feel: widget family must match platform standard without hardcoded constructors
5. Window platform: window drawing must run on multiple OS window systems
6. User operations: operations must be triggerable from multiple UI surfaces, with undo
7. Analytical operations: analysis (spell check, hyphenation) must be extensible without modifying structure classes
```
---
### Step 2: Map Each Problem to a Pattern
**ACTION:** For each design problem from Step 1, select a single primary pattern. Apply the heuristics below.
**WHY:** The mapping is not arbitrary — each pattern was discovered because it solves a recurring class of structural problem. Using the right pattern for each problem ensures the solutions are independent and can be reasoned about separately. Using the wrong pattern creates entanglement that defeats the purpose of the pattern.
**Problem-to-pattern heuristics:**
| Design problem type | Signal in the code | Primary pattern | Why this pattern |
|--------------------|--------------------|-----------------|-----------------|
| Uniform treatment of composites and primitives | Code has `if (isLeaf) ... else ...` on structure | **Composite** | Gives leaves and composites a common interface; clients never distinguish |
| Algorithm must vary independently of the structure it operates on | Algorithm is a method on the data class; swapping it means modifying that class | **Strategy** | Encapsulates each algorithm as an object; context delegates to it |
| Adding responsibilities without subclassing | Subclass names encode combinations: `ScrollableBorderedPanel` | **Decorator** | Wraps objects at run-time; responsibilities compose via chaining |
| Creating families of related objects without naming concrete classes | `new MotifButton()` scattered through client code | **Abstract Factory** | Client calls factory interface; concrete factory encapsulates platform decisions |
| Two independent hierarchies that must evolve separately | One class hierarchy tries to cover both "what" and "how" | **Bridge** | Separates abstraction hierarchy from implementation hierarchy; they vary independently |
| Decoupling request issuance from request execution, with undo | Menu items subclass per operation; no undo mechanism | **Command** | Encapsulates each request as an object with Execute/Unexecute; history enables undo |
| Traversal of structure without coupling to internal representation | Traversal logic lives inside structure classes; changing containers breaks traversal | **Iterator** | Encapsulates traversal algorithm; clients navigate without knowing the storage structure |
| Adding open-ended operations to a stable class hierarchy | New operation requires adding a method to every class in the hierarchy | **Visitor** | Encapsulates operations as visitor objects; adding analysis = new Visitor subclass |
**For each problem in your list:** state the primary pattern and a one-sentence justification connecting the pattern's intent to the specific problem. If two patterns seem equally applicable, note both and add a trade-off sentence.
**Pattern application table (fill in for your system):**
```
| # | Design Problem | Pattern | Justification |
|---|---------------|---------|---------------|
| 1 | [problem] | [name] | [one sentence] |
```
**Lexi reference:** The completed table for Lexi is in `references/lexi-case-study.md`.
---
### Step 3: Identify Pattern Interaction Points
**ACTION:** Review each pair of patterns from Step 2 and determine: do they interact? If so, how?
**WHY:** This is the step most developers skip — and where multi-pattern designs break down. Patterns are not isolated. When one pattern creates objects that another pattern uses, or when one pattern's structure is traversed by another, there is an interaction point. Failing to design these interactions explicitly leads to tight coupling between the patterns themselves, defeating their purpose.
**Three canonical interaction types:**
**Type 1 — Creational configures structural (factory creates implementation):**
A creational pattern (Abstract Factory, Factory Method) produces the concrete objects that a structural or behavioral pattern needs to function. The factory encapsulates which concrete variant is selected, and the structural pattern receives the result without knowing what it is.
- Lexi example: `WindowSystemFactory` (Abstract Factory) creates `WindowImp` objects. `Window` (Bridge abstraction) receives a `WindowImp` from the factory at construction time. The Bridge pattern doesn't know which window system is in use; the factory decides.
- Generalized: When your Bridge or Strategy uses platform-specific or environment-specific implementations, use Abstract Factory to configure it at startup.
**Type 2 — Behavioral pattern uses structural pattern (command composes commands):**
A behavioral pattern's objects are themselves organized using a structural pattern. This allows single behavioral objects and groups of behavioral objects to be treated uniformly.
- Lexi example: `MacroCommand` is a `Command` subclass that contains a `Composite` of other commands. Executing a macro executes its children. The Command pattern gains the benefit of the Composite pattern — macros can be nested, undone as a unit, and stored in history alongside atomic commands.
- Generalized: Whenever you need to batch or sequence Command objects, use Composite to build macro commands. The same Execute/Unexecute interface applies to both atomic and composite commands.
**Type 3 — Traversal enables analysis separation (iterator enables visitor):**
An Iterator provides traversal over a structure independently of how it is stored. A Visitor receives each element during that traversal and performs analysis. The two patterns divide responsibility cleanly: Iterator owns *how to move* through the structure; Visitor owns *what to do* at each element.
- Lexi example: A `PreorderIterator` traverses the Glyph structure. A `SpellingCheckingVisitor` is carried along and called at each node via `Accept(visitor)`. Adding hyphenation analysis = new Visitor subclass, not a change to Glyph or Iterator.
- Generalized: If your Visitor needs to be applied across a Composite structure, pair it with an external Iterator. The Iterator decouples traversal order from the Visitor's analysis logic — either can change independently.
**How to identify interactions in your system:**
Go through your pattern map from Step 2 and ask:
1. Does any pattern produce objects that another pattern consumes? → Creational-configures-structural interaction
2. Does any behavioral pattern need to operate on groups as well as individuals? → Behavioral-uses-structural interaction
3. Does any traversal need to carry analysis logic that should be changeable? → Traversal-enables-analysis interaction
**Output:** A list of interaction points, each with: Pattern A → Pattern B, interaction type, and which class is the "bridge" between them.
---
### Step 4: Verify Pattern Coherence
**ACTION:** Apply four coherence checks to the full pattern map.
**WHY:** Individual patterns can each be correctly chosen while still conflicting at the system level. Coherence verification surfaces these conflicts before implementation begins — when they are cheap to fix. The four checks cover the most common failure modes in multi-pattern design.
**Check 1 — No duplicate responsibilities:**
Each design problem should be assigned to exactly one pattern. If two patterns both claim to solve the same problem, one is misapplied. Example: using both Decorator and Bridge to handle "multiple rendering implementations" duplicates responsibility; only one should own that axis.
**Check 2 — Interaction points are explicit:**
Every place where one pattern's output becomes another pattern's input must be a named class or method, not an implicit assumption. The `WindowSystemFactory::CreateWindowImp()` method is explicit. An interaction that exists only in the developer's head is not a design — it's a future bug.
**Check 3 — Extension paths are clear:**
For each pattern, ask: "What happens when we add a new variant?" The answer should be a new class, not a modification to existing ones. If adding a new window system requires modifying the `Window` class, the Bridge is mis-configured. If adding a new analysis requires modifying `Glyph`, the Visitor is not set up correctly.
**Check 4 — No pattern is doing too much:**
If a single pattern (especially Composite or Strategy) appears to solve three or four design problems, that is a signal that the decomposition in Step 1 was not fine-grained enough. Re-examine whether those "problems" are actually one problem stated too broadly.
**Fail conditions:** If any check fails, return to the step where the problem originates (usually Step 1 or Step 2) and revise. Do not proceed to Step 5 with known coherence failures.
---
### Step 5: Produce the Multi-Pattern Design Document
**ACTION:** Assemble the complete design as a structured document with four sections.
**WHY:** A multi-pattern design is too complex to keep in working memory. Writing it down serves three purposes: it forces precision (vague plans don't survive structured documentation), it creates a reviewable artifact, and it acts as the blueprint for implementation. The document should be specific enough that a developer unfamiliar with the decisions can implement it correctly.
**Document structure:**
---
```markdown
# [System Name] — Multi-Pattern Design
## System Overview
[2–3 sentences: what the system does, what design challenges drove the pattern choices]
## Design Problem Map
| # | Design Problem | Pattern | Key Participant Classes |
|---|---------------|---------|------------------------|
| 1 | ... | ... | ... |
## Pattern Interaction Diagram
[Text diagram or description of the interaction points]
Key interactions:
1. [Pattern A] → configures → [Pattern B]: [class/method that links them]
2. [Pattern A] → uses structure of → [Pattern B]: [class that is both]
3. [Pattern A] → enables → [Pattern B]: [traversal mechanism]
## Extension Guide
| To add... | Action required | Classes to create |
|------------------------------------|------------------------------------|-------------------|
| New [variant of problem 1] | Subclass [participant] | [class name] |
| New [variant of problem 2] | Subclass [participant] | [class name] |
| New analysis operation | New Visitor subclass | [class name] |
## Coherence Notes
[Any trade-offs or constraints the implementation must respect]
```
---
## Examples
### Example 1: WYSIWYG Document Editor (Lexi)
**Scenario:** Building a document editor that must represent rich document structure, support multiple formatting algorithms, allow UI embellishments (borders, scroll bars), support multiple look-and-feel standards, run on multiple window systems, support undo, and allow extensible analysis (spell check, hyphenation).
**Trigger:** "We need to design Lexi, a WYSIWYG document editor. It has to handle everything from basic text rendering to platform portability to undoable editing."
**Process summary:**
Step 1 decomposed this into 7 design problems. Step 2 mapped each to a pattern. Step 3 found 3 interaction points. Step 4 confirmed coherence — each pattern owns exactly one problem, interactions are explicit through named factory and composite classes, every new variant is a new class.
**Output (abbreviated):**
```
# Lexi — Multi-Pattern Design
## Design Problem Map
| # | Design Problem | Pattern | Key Participant Classes |
|---|-------------------------------------|------------------|---------------------------------|
| 1 | Uniform treatment of doc elements | Composite | Glyph, Character, Row, Column |
| 2 | Swappable formatting algorithms | Strategy | Compositor, Composition |
| 3 | Adding UI embellishments at runtime | Decorator | MonoGlyph, Border, Scroller |
| 4 | Multiple look-and-feel widget sets | Abstract Factory | GUIFactory, MotifFactory |
| 5 | Multiple window system platforms | Bridge | Window, WindowImp, XWindowImp |
| 6 | Undoable user operations | Command | Command, MenuItem, History |
| 7 | Extensible document analysis | Visitor | Visitor, SpellingCheckingVisitor|
## Pattern Interaction Diagram
┌─────────────────┐
│ Abstract Factory │
│ (GUIFactory) │
└────────┬────────┘
│ creates WindowImp
▼
┌────────────────────────┐
│ Bridge │
│ Window ←→ WindowImp │
└────────────────────────┘
┌────────────────┐ uses Composite
│ Command │ ──────────────────► MacroCommand
│ (MenuItem) │ (Command + Composite)
└────────────────┘
┌─────────────┐ traversal
│ Iterator │ ────────────────► ┌─────────┐
│(PreorderIt.)│ │ Visitor │
└─────────────┘ │(Spelling│
│Checker) │
└─────────┘
Key interactions:
1. Abstract Factory → configures → Bridge: WindowSystemFactory::CreateWindowImp()
2. Command → uses structure of → Composite: MacroCommand contains List<Command*>
3. Iterator → enables → Visitor: PreorderIterator carries Visitor to each Glyph::Accept()
## Extension Guide
| To add... | Action required | Classes to create |
|------------------------------|-------------------------------------|---------------------------|
| New formatting algorithm | Subclass Compositor | TeXCompositor |
| New UI embellishment | Subclass MonoGlyph | ShadowDecorator |
| New look-and-feel standard | Subclass GUIFactory | MacFactory |
| New window system platform | Subclass WindowImp | WaylandWindowImp |
| New undoable operation | Subclass Command | IndentCommand |
| New document analysis | Subclass Visitor | WordCountVisitor |
```
Full details: `references/lexi-case-study.md`
---
### Example 2: Financial Portfolio System
**Scenario:** A financial system needs to represent portfolios that contain accounts and sub-portfolios (hierarchical structure), support multiple risk calculation algorithms, log every transaction with undo, and generate various regulatory reports without modifying account classes.
**Trigger:** "We need to redesign our portfolio system. It needs hierarchical holdings, pluggable risk models, an audit trail, and extensible reporting."
**Process summary:**
Step 1 surfaces 4 design problems: (1) hierarchical portfolio structure, (2) multiple risk algorithms, (3) auditable transactions with undo, (4) extensible reporting over a stable hierarchy.
Step 2 maps: Composite for hierarchy (portfolios contain portfolios and accounts uniformly), Strategy for risk models (each model is interchangeable), Command for transactions (undo/redo + audit log), Visitor for reporting (new report type = new Visitor subclass).
Step 3 interaction: Command uses Composite — a BatchTransaction is a MacroCommand holding a list of individual Commands. Visitor is enabled by Iterator traversal over the Composite hierarchy.
```
# Portfolio System — Multi-Pattern Design
## Design Problem Map
| # | Design Problem | Pattern | Key Participant Classes |
|---|--------------------------------------|-----------|--------------------------------------------|
| 1 | Uniform treatment of holdings | Composite | Holding, Account, Portfolio |
| 2 | Swappable risk calculation | Strategy | RiskModel, VaRModel, StressTestModel |
| 3 | Auditable transactions with undo | Command | Transaction, TradeCommand, BatchTransaction|
| 4 | Extensible regulatory reporting | Visitor | ReportVisitor, BaselIIIReportVisitor |
Key interaction:
1. Command → uses → Composite: BatchTransaction contains List<Transaction>
2. Iterator → enables → Visitor: PreorderIterator carries ReportVisitor through Portfolio tree
```
---
### Example 3: Game Engine with Multiple Platforms
**Scenario:** A game engine needs a scene graph (hierarchical game objects), pluggable rendering backends (OpenGL, Vulkan, Metal), UI behaviors that can be stacked (glow + shadow + border), and a scriptable action system with undo for level editing.
**Trigger:** "We're building a cross-platform game engine. We need a scene graph, multiple renderers, stackable UI effects, and undoable editor commands."
**Process summary:**
Step 1: (1) scene hierarchy, (2) rendering platform independence, (3) stackable visual effects, (4) undoable level-editor operations.
Step 2: Composite (scene graph), Bridge (game object abstraction vs. renderer implementation), Decorator (stackable visual effects), Command (undoable editor operations).
Step 3 interactions: Abstract Factory creates the concrete Renderer that Bridge uses. Command uses Composite for compound editor operations (GroupMoveCommand).
```
# Game Engine — Multi-Pattern Design
## Design Problem Map
| # | Design Problem | Pattern | Key Participant Classes |
|---|-------------------------------- |------------------|----------------------------------------------|
| 1 | Scene hierarchy | Composite | SceneNode, MeshNode, GroupNode |
| 2 | Renderer platform independence | Bridge | Renderer, RendererImpl, VulkanRendererImpl |
| 3 | Stackable visual effects | Decorator | EffectNode, GlowEffect, ShadowEffect |
| 4 | Undoable level-editor ops | Command | EditorCommand, MoveCommand, GroupMoveCommand |
Key interaction:
1. Abstract Factory → configures → Bridge: PlatformFactory::CreateRendererImpl()
2. Command → uses → Composite: GroupMoveCommand contains List<EditorCommand>
```
---
## Reference Files
| File | Contents |
|------|----------|
| `references/lexi-case-study.md` | Full Lexi pattern map: 7 problems, 8 patterns, 3 interaction points, extension guide, summary from GoF Chapter 2 |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-design-pattern-selector`
- `clawhub install bookforge-abstract-factory-implementor`
- `clawhub install bookforge-composite-pattern-implementor`
- `clawhub install bookforge-command-pattern-implementor`
- `clawhub install bookforge-visitor-pattern-implementor`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/lexi-case-study.md
# Lexi Case Study — Full Pattern Interaction Map
**Source:** Design Patterns: Elements of Reusable Object-Oriented Software, Chapter 2 (Gamma, Helm, Johnson, Vlissides)
This reference documents the complete multi-pattern design of Lexi, a WYSIWYG document editor. Eight patterns solve seven design problems in one coherent system. This is the canonical demonstration that patterns are not isolated solutions — they form a coordinated architecture.
---
## The Seven Design Problems
Lexi's designers identified these seven distinct design challenges before choosing any pattern:
1. **Document structure:** Internal representation must support text and graphics uniformly, handle arbitrary nesting (characters → rows → columns → pages), and allow traversal for rendering, formatting, and analysis.
2. **Formatting:** The algorithm that breaks text into lines and columns must be independent of the document structure it operates on. Lexi needs multiple formatting algorithms (fast/simple vs. slow/high-quality) that clients can swap at runtime.
3. **Embellishing the user interface:** Scroll bars and borders must be addable and removable without modifying the document view class. Adding them via subclassing causes combinatorial class explosion (BorderedScrollableView, etc.).
4. **Supporting multiple look-and-feel standards:** Widget glyphs (buttons, scroll bars, menus) must match the platform's look-and-feel (Motif, Presentation Manager, Mac) without hardcoded constructor calls scattered through client code.
5. **Supporting multiple window systems:** The window drawing abstraction must work across incompatible window system APIs (X11, Presentation Manager, Mac OS) without requiring a different build per platform or a parallel class hierarchy explosion.
6. **User operations:** Operations triggered from menus, buttons, and keyboard shortcuts must share a uniform mechanism. Undo and redo must be supported for document-modifying operations without an arbitrary limit on levels.
7. **Spelling checking and hyphenation:** Analysis operations (spell checking, hyphenation, word counting, etc.) must be extensible without modifying the Glyph class hierarchy every time a new analysis is needed.
---
## The Eight Pattern Solutions
### Problem 1 → Composite
**Design problem:** Uniform treatment of document elements at all levels of nesting.
**Solution:** Define `Glyph` as an abstract base class with a common interface (Draw, Bounds, Intersects, Insert, Remove, Child, Parent). Leaf glyphs (`Character`, `Image`, `Rectangle`, `Polygon`) and composite glyphs (`Row`, `Column`, `Page`) both subclass `Glyph`. Clients treat all glyphs uniformly — a `Row::Draw()` iterates over children and calls `Draw()` on each without knowing whether any child is a leaf or a composite.
**Key participants:**
- `Glyph` — Component (abstract base class)
- `Character`, `Image`, `Rectangle`, `Polygon` — Leaf
- `Row`, `Column`, `Page` — Composite
- `Window` — Client (calls Draw on the root Glyph)
**Why this works:** The tenth element in a row could be a single character or an intricate diagram with hundreds of subelements. As long as it can `Draw()` and report its `Bounds()`, its internal complexity is irrelevant to the compositor, the formatter, and the display system.
---
### Problem 2 → Strategy
**Design problem:** Formatting algorithm must be swappable independently of document structure.
**Solution:** Define a `Compositor` abstract class with a `Compose()` operation. Introduce a `Composition` Glyph subclass that delegates formatting to a `Compositor` object. An unformatted `Composition` holds visible glyphs. When formatting is needed, it calls `compositor->Compose()`, which inserts `Row` and `Column` glyphs according to its algorithm. Three concrete strategies are defined: `SimpleCompositor` (fast, no hyphenation), `TeXCompositor` (full TeX algorithm with good color), `ArrayCompositor` (fixed-interval breaks for tables).
**Key participants:**
- `Compositor` — Strategy interface
- `SimpleCompositor`, `TeXCompositor`, `ArrayCompositor` — Concrete strategies
- `Composition` — Context (holds a reference to a Compositor)
**Why this works:** Adding a new formatting algorithm requires only a new `Compositor` subclass. The `Glyph` class hierarchy is never touched. The algorithm can be changed at runtime by calling `Composition::SetCompositor()`.
**Key insight:** The Strategy pattern requires designing interfaces general enough to support the full range of algorithms without the context (Composition) needing to know which one is running. The basic Glyph child-access interface is sufficient for all three Compositor strategies.
---
### Problem 3 → Decorator
**Design problem:** Add scroll bars and borders to the document view without class explosion via inheritance.
**Solution:** Introduce `MonoGlyph` as a `Glyph` subclass that stores a reference to one component glyph and forwards all operations to it by default. `Border` and `Scroller` subclass `MonoGlyph` and override `Draw()` to first delegate to their component (drawing the content), then draw their embellishment on top. Embellishments are composed at runtime: `Border(Scroller(Composition))` adds both a border and scroll bars.
**Key participants:**
- `MonoGlyph` — Decorator base (transparent enclosure)
- `Border` — Concrete Decorator (draws border after delegating)
- `Scroller` — Concrete Decorator (clips and offsets content)
- Any `Glyph` subclass — Component
**Why this works:** The order of composition matters and is under application control. `Border(Scroller(c))` gives a border that does not scroll; `Scroller(Border(c))` gives scroll bars that move both content and border. Neither requires a new class — just a different composition order.
**Why not inheritance:** If you add a border by subclassing `Composition`, you get `BorderedComposition`. Add scroll bars the same way and you need `BorderedComposition`, `ScrollableComposition`, and `BorderedScrollableComposition`. Every new embellishment doubles the class count. Transparent enclosure keeps embellishment classes small and their count constant.
---
### Problem 4 → Abstract Factory
**Design problem:** Create platform-specific widget glyphs (buttons, scroll bars, menus) without embedding platform names in client code.
**Solution:** Define a `GUIFactory` abstract class with factory methods: `CreateScrollBar()`, `CreateButton()`, `CreateMenu()`, etc. Concrete subclasses `MotifFactory`, `PMFactory`, `MacFactory` implement each method to return the appropriate platform-specific widget glyph. Client code holds only a `GUIFactory*` reference. The concrete factory is instantiated once at startup based on the platform environment.
**Key participants:**
- `GUIFactory` — Abstract Factory
- `MotifFactory`, `PMFactory`, `MacFactory` — Concrete Factories
- `ScrollBar`, `Button`, `Menu` — Abstract Products
- `MotifScrollBar`, `PMButton`, `MacMenu`, etc. — Concrete Products
**Why this works:** Once `guiFactory` is set to `new MotifFactory`, all subsequent widget creation calls return Motif widgets without any code change. Switching to Mac requires only `guiFactory = new MacFactory` at startup. No Motif or Mac names appear anywhere in the client code.
**Constraint:** Abstract Factory works here because Lexi controls its own widget class hierarchies. It would not work if the widget hierarchies came from incompatible vendor libraries without common abstract base classes — that is why a different approach was needed for window system portability (Problem 5).
---
### Problem 5 → Bridge
**Design problem:** Window drawing must work on multiple incompatible window systems without duplicating the Window class hierarchy per platform.
**Solution:** Separate the `Window` class hierarchy (application-level windowing: `ApplicationWindow`, `IconWindow`, `DialogWindow`) from the `WindowImp` class hierarchy (platform-level implementation: `XWindowImp`, `PMWindowImp`, `MacWindowImp`). `Window` holds a `WindowImp*` reference. All platform-specific drawing operations are delegated to `WindowImp`. Application code uses `Window`; it never calls `WindowImp` directly.
**Key participants:**
- `Window` — Abstraction
- `ApplicationWindow`, `IconWindow`, `DialogWindow` — Refined Abstractions
- `WindowImp` — Implementor interface
- `XWindowImp`, `PMWindowImp`, `MacWindowImp` — Concrete Implementors
**Why not Abstract Factory here:** Window system APIs are incompatible — they come from different vendors and cannot be made to share abstract product base classes. Abstract Factory requires common abstract products. Bridge does not — it only requires that the abstraction (`Window`) call through to `WindowImp`'s interface.
**Key insight:** `Window`'s interface caters to application programmers. `WindowImp`'s interface caters to window system engineers. They serve different masters and should evolve independently. Bridge enables both.
**Interaction with Abstract Factory:** `WindowSystemFactory` (an Abstract Factory) provides the means to create the right `WindowImp` subclass. `Window`'s constructor calls `windowSystemFactory->CreateWindowImp()` to initialize its `_imp` member. This is the canonical example of Abstract Factory configuring a Bridge.
---
### Problem 6 → Command
**Design problem:** Operations must be triggerable from multiple UI surfaces (menu items, buttons, keyboard shortcuts) with a uniform mechanism, and document-modifying operations must be undoable.
**Solution:** Define a `Command` abstract class with `Execute()` and `Unexecute()` operations, plus `Reversible()` to determine if undo is applicable. `MenuItem` stores a `Command*` reference. When clicked, `MenuItem` calls `command->Execute()`. A command history (list of Commands with a "present" pointer) supports multi-level undo: step backward calls `Unexecute()` on the command to the left of present; step forward calls `Execute()` on the command to the right.
**Key participants:**
- `Command` — abstract command interface
- `FontCommand`, `PasteCommand`, `SaveCommand`, `QuitCommand` — Concrete Commands
- `MacroCommand` — Composite Command (see interaction with Composite below)
- `MenuItem` — Invoker
- `CommandHistory` — stores the history list and present pointer
**Undo/redo mechanism:**
```
Past: [cmd1] [cmd2] [cmd3] [cmd4] ← present
↑
Unexecute() moves present left
Execute() moves present right
```
**Selective undo:** `Reversible()` returns false for `SaveCommand` and `QuitCommand` — these are skipped during undo traversal.
**Interaction with Composite:** `MacroCommand` is a `Command` subclass that contains a list of `Command*` objects. `MacroCommand::Execute()` calls `Execute()` on each child. `MacroCommand::Unexecute()` calls `Unexecute()` on each child in reverse. This means macros can be stored in history and undone as a unit, exactly like atomic commands.
---
### Problem 7a → Iterator
**Design problem:** Traverse the Glyph structure to access elements without exposing whether the underlying storage is an array, linked list, or other structure.
**Solution:** Define an `Iterator` abstract class with `First()`, `Next()`, `IsDone()`, and `CurrentItem()` operations. Glyph subclasses override `CreateIterator()` to return the appropriate concrete iterator. `ArrayIterator` and `ListIterator` handle the data structure differences. `PreorderIterator` and `PostorderIterator` implement traversal strategies in terms of the structural iterators. Leaf glyphs return a `NullIterator` whose `IsDone()` always returns true.
**Key participants:**
- `Iterator` — abstract interface
- `ArrayIterator`, `ListIterator` — structure-specific iterators
- `PreorderIterator`, `PostorderIterator` — traversal-specific iterators
- `NullIterator` — for leaf glyphs
- `Glyph::CreateIterator()` — factory method returns the right iterator
**Why external iteration:** Putting traversal in Glyph would require adding `First()`, `Next()`, `IsDone()` to the Glyph interface, biasing it toward arrays. External iterators encapsulate traversal without modifying the traversed class hierarchy. Multiple concurrent traversals can run over the same structure — each iterator object holds its own traversal state.
---
### Problem 7b → Visitor
**Design problem:** Support multiple extensible analysis operations (spell check, hyphenation, word count, search) over the Glyph structure without adding operations to every Glyph subclass.
**Solution:** Define a `Visitor` abstract class with a `Visit` operation for each Glyph subclass: `VisitCharacter(Character*)`, `VisitRow(Row*)`, `VisitImage(Image*)`, etc. Add a single `Accept(Visitor&)` operation to `Glyph`. Each Glyph subclass implements `Accept` by calling the appropriate `Visit` method on the visitor, passing `this`. New analyses are new `Visitor` subclasses: `SpellingCheckingVisitor`, `HyphenationVisitor`, `WordCountVisitor`.
**Key participants:**
- `Visitor` — abstract visitor interface
- `SpellingCheckingVisitor`, `HyphenationVisitor` — Concrete Visitors
- `Glyph::Accept(Visitor&)` — the single dispatch hook added to Glyph
- `Character`, `Row`, `Image` — Visited elements (call the appropriate Visit method)
**Double dispatch:** `glyph->Accept(visitor)` causes a virtual call on Glyph that resolves to the correct subclass (e.g., `Character::Accept`). `Character::Accept` then calls `visitor.VisitCharacter(this)` — a second dispatch on the Visitor type. This double dispatch allows the analysis to specialize by both the element type and the visitor type without type casts.
**Interaction with Iterator:** A `PreorderIterator` traverses the Glyph structure. At each glyph, the analysis calls `glyph->Accept(spellingChecker)`. The Iterator provides *how to move*; the Visitor provides *what to do at each node*. Either can be changed independently.
**Visitor applicability constraint:** Visitor is most valuable when the class structure (Glyph hierarchy) is stable and the operations on it are expected to grow. If new Glyph subclasses are added frequently, every Visitor must be updated. For Lexi, new Glyph types are rare; new analyses (spell check, grammar check, word count) are common — Visitor is the right fit.
---
## Pattern Interaction Map (Summary from GoF Chapter 2)
The eight patterns applied to Lexi and how they relate:
```
┌──────────────────────────────────────────────────────┐
│ LEXI ARCHITECTURE │
└──────────────────────────────────────────────────────┘
DOCUMENT REPRESENTATION PLATFORM ADAPTATION
┌───────────────────┐ ┌────────────────────────────┐
│ Composite │ │ Abstract Factory │
│ (Glyph tree: │ │ (GUIFactory creates │
│ Character, Row, │ │ platform widgets) │
│ Column, Page) │ └─────────────┬──────────────┘
└────────┬──────────┘ │ creates WindowImp
│ structure for ▼
│ traversal ┌────────────────────────────────────┐
▼ │ Bridge │
┌─────────────────────┐ │ (Window ↔ WindowImp) │
│ Strategy │ │ Window: app abstraction │
│ (Compositor algo │ │ WindowImp: OS implementation │
│ runs on Glyph │ └────────────────────────────────────┘
│ tree to produce │
│ Row/Column layout)│
└─────────────────────┘ BEHAVIORAL
┌────────────────────────────────────┐
EMBELLISHMENT │ Command │
┌─────────────────────┐ │ (MenuItem → Command → Execute) │
│ Decorator │ │ History enables undo/redo │
│ (MonoGlyph: │ └─────────────────┬──────────────────┘
│ Border, Scroller │ │ uses
│ wrap Composition) │ ▼
└─────────────────────┘ ┌────────────────────────────────────┐
│ Composite (MacroCommand) │
│ Command list = undoable macro │
└────────────────────────────────────┘
ANALYSIS
┌─────────────────────┐ ┌────────────────────────────────────┐
│ Iterator │ ────────► │ Visitor │
│ (PreorderIterator │ enables │ (SpellingCheckingVisitor, │
│ traverses Glyph │ │ HyphenationVisitor carry │
│ tree) │ │ analysis over the structure) │
└─────────────────────┘ └────────────────────────────────────┘
```
### The Three Interaction Points Explained
**Interaction 1: Abstract Factory configures Bridge**
`WindowSystemFactory` is a concrete Abstract Factory. `Window`'s constructor calls:
```cpp
Window::Window() {
_imp = windowSystemFactory->CreateWindowImp();
}
```
The Bridge (`Window` ↔ `WindowImp`) does not know which window system is in use. The Abstract Factory decides this at startup. Swapping the factory object changes the entire window system — without modifying a single Window class.
**Interaction 2: Command uses Composite**
`MacroCommand` inherits from `Command` and holds a `List<Command*>`:
```cpp
class MacroCommand : public Command {
void Execute() override {
for (each cmd in _commands) cmd->Execute();
}
void Unexecute() override {
for (each cmd in reverse) cmd->Unexecute();
}
List<Command*> _commands;
};
```
The command history stores `MacroCommand` instances alongside atomic commands. Undo/redo works uniformly on both. No special-casing is needed for macros vs. individual commands.
**Interaction 3: Iterator enables Visitor**
The traversal and the analysis are strictly separated:
```cpp
// Traversal via Iterator
PreorderIterator i(composition);
// Analysis via Visitor
SpellingChecker spellingChecker;
for (i.First(); !i.IsDone(); i.Next()) {
Glyph* g = i.CurrentItem();
g->Accept(spellingChecker); // Visitor double dispatch
}
```
Adding hyphenation analysis = new `HyphenationVisitor` subclass. Changing traversal order = different `Iterator` subclass. Neither change touches the Glyph hierarchy.
---
## Complete Extension Guide
| To add... | Implement... | Extend... | Touch... |
|------------------------------------|-------------------------------------|--------------------|-----------------|
| New formatting algorithm | `TeXCompositor : Compositor` | Nothing | Nothing |
| New UI embellishment | `ShadowDecorator : MonoGlyph` | Nothing | Nothing |
| New look-and-feel standard | `MacFactory : GUIFactory` + products| Nothing | Nothing |
| New window system platform | `WaylandWindowImp : WindowImp` | `WindowSystemFactory` | Nothing |
| New undoable operation | `IndentCommand : Command` | Nothing | Nothing |
| New macro (compound operation) | `MacroCommand` + fill with Commands | Nothing | Nothing |
| New document analysis | `WordCountVisitor : Visitor` | Nothing | `Glyph::Accept` (already done once) |
| New traversal order | `PostorderIterator : Iterator` | Nothing | Nothing |
| New glyph type (rare) | `Table : Glyph` | All Visitor classes (add VisitTable) | Nothing |
The last row illustrates the Visitor trade-off: adding a new Glyph type requires updating every Visitor. This is acceptable when new analyses (new Visitors) are added far more frequently than new Glyph types. For Lexi, this is the correct assumption.
---
## GoF Chapter 2 Summary (verbatim list)
From page 80 of the source text — the eight patterns applied to Lexi:
1. **Composite (163)** to represent the document's physical structure
2. **Strategy (315)** to allow different formatting algorithms
3. **Decorator (175)** for embellishing the user interface
4. **Abstract Factory (87)** for supporting multiple look-and-feel standards
5. **Bridge (151)** to allow multiple windowing platforms
6. **Command (233)** for undoable user operations
7. **Iterator (257)** for accessing and traversing object structures
8. **Visitor (331)** for allowing an open-ended number of analytical capabilities without complicating the document structure's implementation
The GoF note: "None of these design issues is limited to document editing applications like Lexi. Indeed, most nontrivial applications will have occasion to use many of these patterns, though perhaps to do different things."
Select the right GoF design pattern for a specific object-oriented design problem. Use when facing any of these situations: object creation inflexibility (to...
---
name: design-pattern-selector
description: |
Select the right GoF design pattern for a specific object-oriented design problem. Use when facing any of these situations: object creation inflexibility (too many concrete class references), tight coupling between classes, subclass explosion from trying to extend behavior, inability to modify a class you don't own, need to vary an algorithm or behavior at runtime, want to add operations to a stable class hierarchy without modifying it, need to decouple a sender from its receivers, need undo/redo or event notification, or any time someone asks "which design pattern should I use?", "is there a pattern for this?", or "how do I design this more flexibly?" Analyzes the problem through 6 complementary selection approaches — intent scanning, purpose/scope classification, variability analysis, redesign cause diagnosis, pattern relationship navigation, and category comparison — to produce a scored pattern recommendation with trade-off analysis.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/design-pattern-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: []
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [1]
tags: [design-patterns, object-oriented, gof, creational, structural, behavioral, refactoring, software-design]
execution:
tier: 1
mode: full
inputs:
- type: none
description: "Design problem description from user — existing code, pain points, or a design scenario"
tools-required: [Read, Write, TodoWrite]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any agent environment. If a codebase exists, scanning it for context is helpful but not required."
---
# Design Pattern Selector
## When to Use
You've encountered a concrete object-oriented design problem and need a proven structural solution. This skill is appropriate when:
- You're designing a new component and want to get the structure right before writing code
- Existing code has a design smell — tightly coupled classes, a proliferating subclass hierarchy, or brittle object creation — and you want a pattern to address it
- You know roughly what you want (e.g., "something like a factory") but need to pick the right variant or confirm it fits
- You're reviewing someone else's code and want to identify what pattern(s) would improve it
- You want to understand trade-offs between two candidate patterns before committing
Before starting, verify:
- Do you have a description of the design problem, even informally? (Ask the user if not)
- Are you working with object-oriented design? (These patterns are OO-specific — they assume classes, objects, and inheritance)
- Is this about pattern *selection* (this skill) or pattern *implementation*? If the user already knows which pattern they want, move directly to the application process in Step 6
## Context & Input Gathering
### Required Context (must have before proceeding)
- **Problem description:** What design challenge are you solving? Even a rough description works — "I have a class that does too many things" or "I need multiple algorithms that clients can swap at runtime"
- **Design intent:** What should the result allow? (e.g., "swap implementations without changing clients", "add behaviors without modifying existing classes")
### Observable Context (gather from environment if available)
- **Existing code:** If a codebase is present, scan for the problematic class or module
- Look for: large switch/if-else dispatch blocks (Cause 2), `new ConcreteClass()` proliferation (Cause 1), deep inheritance hierarchies (Cause 7), classes doing too many unrelated things (Cause 6)
- Use `Grep` for `new `, `extends`, `implements` patterns to map existing structure
- **Prior patterns:** Check if related patterns are already in use — the new pattern should compose well with them
- **Language/framework:** Some patterns are idiomatically provided by the language (e.g., Java's `Iterator`, Python's context managers as a form of Template Method); note where native constructs already cover the need
### Sufficiency Check
You have enough to proceed if you can articulate: (a) what problem exists, and (b) what the solution should enable. If only (a) is known, the redesign-cause diagnosis (Step 3) will derive (b).
---
## Process
### Step 1: Frame the Problem
**ACTION:** Use `TodoWrite` to track the steps, then articulate the design problem in structured form.
**WHY:** Vague problems produce vague pattern matches. Forcing a structured framing surfaces the real constraints and eliminates patterns that technically apply but don't fit the actual need.
```
TodoWrite:
- [ ] Step 1: Frame the problem
- [ ] Step 2: Classify by purpose and scope
- [ ] Step 3: Diagnose redesign causes
- [ ] Step 4: Check variability requirements
- [ ] Step 5: Compare candidate patterns
- [ ] Step 6: Produce recommendation with trade-offs
```
Capture these four elements from the user's description:
| Element | Question to answer |
|---------|-------------------|
| **Context** | What type of system/component is this? |
| **Problem** | What is currently wrong or inflexible? |
| **Forces** | What constraints must the solution respect? (performance, simplicity, existing code) |
| **Goal** | What must the solution make possible? |
If any element is unclear, ask one targeted question before proceeding.
---
### Step 2: Classify by Purpose and Scope
**ACTION:** Narrow the candidate set using the purpose × scope classification matrix from `references/pattern-catalog.md`.
**WHY:** The 23 GoF patterns split cleanly into three purposes and two scopes. Getting this classification right eliminates 15–20 patterns immediately. It also forces precision about *what kind* of problem you're solving — a creation problem is fundamentally different from a communication problem.
**Decision rules:**
**Purpose:**
- **Creational** — the problem is about *how objects are created*. Objects are being created from concrete classes; you need flexibility in what gets created, or how.
- **Structural** — the problem is about *how classes or objects are composed*. You need a different interface, a flexible structure, or want to add responsibilities.
- **Behavioral** — the problem is about *how objects communicate and distribute work*. You need to decouple who initiates operations from who carries them out, or vary algorithms/state.
**Scope (within the chosen purpose):**
- **Class scope** — the relationship is fixed at compile-time via inheritance. Behavioral class patterns use inheritance to describe algorithms. Use when the variation can be determined at compile-time.
- **Object scope** — the relationship is dynamic at run-time via composition/delegation. Most patterns are object-scoped. Use when you need run-time flexibility.
**Output:** Identify 1–2 most likely purposes. If the problem spans purposes (e.g., creation inflexibility AND tight coupling), note both — you may need two patterns, or one pattern that addresses both.
---
### Step 3: Diagnose Redesign Causes
**ACTION:** Match the problem description against the 8 causes of redesign in `references/redesign-causes.md`.
**WHY:** The GoF identified that most design problems stem from dependencies that make change expensive. Each cause maps directly to the patterns that eliminate that specific dependency. This is the fastest path from "something is wrong" to "here are the patterns that fix it."
**Diagnostic questions to ask:**
1. Does the code create objects by naming concrete classes? → Cause 1 (object creation)
2. Is there hard-coded dispatch logic for handling requests? → Cause 2 (specific operations)
3. Is there platform-specific code mixed into business logic? → Cause 3 (platform dependence)
4. Does the client know too much about another object's internal structure? → Cause 4 (representation dependence)
5. Is an algorithm embedded in a class that should not own it? → Cause 5 (algorithmic dependence)
6. Can you not unit-test a class without instantiating many others? → Cause 6 (tight coupling)
7. Is the inheritance hierarchy growing combinatorially? → Cause 7 (subclassing for extension)
8. Does the class need changing but cannot be modified (no source, or too many dependents)? → Cause 8 (unmodifiable classes)
**Output:** Identify the 1–3 most relevant causes. Cross-reference with the patterns mapped to each cause — this produces a candidate shortlist.
---
### Step 4: Check Variability Requirements
**ACTION:** Consult `references/variability-table.md` to identify which patterns encapsulate the aspects of the design that must remain variable.
**WHY:** This is the "opposite" approach to cause diagnosis — instead of asking what is breaking, ask what must stay flexible. The variability table maps directly from "aspect I want to vary" to pattern. It catches cases where the user knows what they need to keep variable but cannot articulate the pain yet.
**Key question:** "What concept in this design must be encapsulated so it can change independently?"
Common variability needs and their patterns:
| I need to vary... | Candidates |
|-------------------|------------|
| Which algorithm runs | Strategy, Template Method |
| Which object handles a request | Chain of Responsibility |
| How objects notify each other | Observer |
| The interface exposed to clients | Adapter, Facade |
| An object's behavior by state | State |
| How and when a request executes | Command |
| The structure of a part-whole hierarchy | Composite |
| Responsibilities without subclassing | Decorator |
| How an object is located/accessed | Proxy |
| Which concrete class gets instantiated | Abstract Factory, Factory Method, Prototype |
**Output:** Add any new candidates surfaced by variability analysis to the shortlist from Step 3.
---
### Step 5: Compare Candidate Patterns
**ACTION:** For each candidate pattern on the shortlist (aim for 2–4), evaluate it against the problem using these lenses:
**WHY:** Multiple patterns often address the same problem. The right choice depends on trade-offs specific to this context — complexity budget, run-time flexibility needs, relationship to existing code, and what the solution must remain open to in the future.
**Evaluation lens for each candidate:**
| Lens | Question |
|------|----------|
| **Fit** | Does the pattern's intent directly address the core problem? |
| **Scope match** | Does it operate at class or object scope — which is right here? |
| **Complexity cost** | How many new classes/interfaces does it introduce? Is that justified? |
| **Run-time flexibility** | Does it need to vary at run-time or compile-time? |
| **Composition** | Does it work with patterns already in use, or conflict? |
| **Over-engineering risk** | Does the flexibility it provides actually apply to this problem's future? |
**Pattern relationship check:** Consult the "Pattern Relationships" section in `references/pattern-catalog.md` to identify whether candidates are alternatives (choose one), complements (use both), or sequenced (use first one, then the other as the system evolves).
**Output:** A ranked shortlist with a 1–2 sentence rationale and a key trade-off noted for each.
---
### Step 6: Produce Recommendation
**ACTION:** Deliver a structured recommendation in the following format, then offer to walk through the 7-step application process if the user wants to implement it.
**WHY:** A recommendation without trade-offs is advice without accountability. The user needs to understand not just what to do but why this pattern beats the alternatives, and what they give up. This enables informed commitment rather than cargo-culting.
**Recommendation format:**
```
## Pattern Recommendation: [Pattern Name]
**Category:** [Creational / Structural / Behavioral] — [Class / Object] scope
**Why this pattern:** [1–2 sentences connecting the pattern's intent to the specific problem]
**What it enables:** [What becomes possible that wasn't before]
**Key trade-off:** [What you give up or what additional complexity you accept]
**Alternative considered:** [Runner-up pattern] — [Why it was not chosen]
**Patterns to combine with:** [Any complementary patterns, if applicable]
```
**If the user wants to implement:** Walk through the 7-step application process:
1. **Read the pattern** — focus on Applicability and Consequences sections to confirm fit
2. **Study the structure** — understand all participants (classes/objects) and how they collaborate
3. **Review sample code** — see a concrete implementation to understand how it works
4. **Name participants contextually** — rename pattern roles to application-specific names (e.g., `LayoutStrategy` not just `Strategy`)
5. **Define the classes** — declare interfaces, establish inheritance, identify existing classes the pattern will affect
6. **Name operations** — use application-specific names for pattern operations; follow consistent naming conventions (e.g., `Create-` prefix for factory methods)
7. **Implement** — write the operations, following hints in the pattern's Implementation section
---
## Examples
### Example 1: Subclass Explosion for UI Rendering
**Scenario:** A UI library has a `Button` class. Over time, the team added `RoundButton`, `FlatButton`, `DarkRoundButton`, `LightFlatButton` — 12 subclasses for 3 shapes × 4 themes, and it keeps growing.
**Trigger:** "Our button hierarchy is exploding. Every new theme or shape doubles the subclass count."
**Process:**
- Step 2: Structural (composition problem), Object scope (need run-time flexibility)
- Step 3: Cause 7 — extending functionality by subclassing
- Step 4: Need to vary "responsibilities of an object without subclassing" → Decorator; also "implementation of an object" → Bridge
- Step 5: Decorator adds responsibilities dynamically but doesn't separate two orthogonal dimensions. Bridge separates abstraction (shape) from implementation (theme) — exactly the two dimensions causing explosion.
**Output:**
```
## Pattern Recommendation: Bridge
**Category:** Structural — Object scope
**Why this pattern:** Bridge separates the Button abstraction (shape: Round, Flat)
from its implementation (theme: Dark, Light) into two independent hierarchies.
Adding a new theme requires one new class, not N new subclasses.
**What it enables:** Shape and theme can vary independently. 3 shapes × 4 themes = 7 classes total, not 12.
**Key trade-off:** Increases design complexity upfront; adds one level of indirection
(button delegates to its renderer). Justified when both dimensions are genuinely
expected to grow.
**Alternative considered:** Decorator — works well for adding single responsibilities
but doesn't solve two orthogonal variation axes as cleanly as Bridge.
```
---
### Example 2: Hardcoded Payment Processing
**Scenario:** An e-commerce checkout class contains a large `if/else` block selecting between Stripe, PayPal, and bank transfer logic. Adding a new payment method means modifying the checkout class and its tests.
**Trigger:** "Every new payment provider requires touching the core checkout code."
**Process:**
- Step 2: Behavioral (communication/delegation problem), Object scope
- Step 3: Cause 2 — dependence on specific operations (hard-coded request handling)
- Step 4: Need to vary "an algorithm" → Strategy
- Step 5: Strategy replaces the `if/else` dispatch with a family of interchangeable payment-processing objects. Command could encapsulate payment as an undoable request — but undo is not required here and adds unnecessary complexity.
**Output:**
```
## Pattern Recommendation: Strategy
**Category:** Behavioral — Object scope
**Why this pattern:** Strategy replaces the dispatch block with a PaymentStrategy
interface. Each provider (Stripe, PayPal, bank transfer) is a ConcreteStrategy.
Checkout receives a strategy object at runtime and delegates payment to it.
**What it enables:** New payment providers are new classes, not modifications to Checkout.
Checkout can be tested with a MockPaymentStrategy.
**Key trade-off:** Clients must know which strategy to instantiate. If selection logic
is complex, pair with Abstract Factory to manage strategy creation.
**Alternative considered:** Command — adds request queuing and undo capability not
needed here; over-engineers this use case.
**Patterns to combine with:** Abstract Factory (for strategy instantiation by region/config)
```
---
### Example 3: Cannot Modify a Third-Party Logger
**Scenario:** A team is integrating a third-party analytics library whose `Logger` class has an interface incompatible with the `IEventTracker` interface the rest of the application uses. Source code is unavailable.
**Trigger:** "We need to use this library's logger but it doesn't match our interface and we can't change it."
**Process:**
- Step 2: Structural (interface composition), Object scope
- Step 3: Cause 8 — inability to alter classes conveniently
- Step 4: Need to vary "interface to an object" → Adapter
- Step 5: Adapter is the canonical solution. Decorator adds behavior but requires the same interface. Facade simplifies a subsystem but doesn't translate interfaces.
**Output:**
```
## Pattern Recommendation: Adapter
**Category:** Structural — Object scope
**Why this pattern:** An Adapter class wraps the third-party Logger and implements
IEventTracker, translating calls between the two interfaces. The rest of the
application only ever sees IEventTracker.
**What it enables:** Library can be swapped for a different provider by swapping the
Adapter, with zero changes to application code.
**Key trade-off:** Each new method in IEventTracker requires a corresponding
translation in the Adapter. If the interfaces diverge significantly, the Adapter
becomes a maintenance burden — consider whether the interfaces can be aligned upstream.
**Alternative considered:** Facade — would simplify the logger API but cannot make it
implement IEventTracker.
```
---
## Key Principles
- **Start from the problem, not the pattern.** The most common mistake is picking a familiar pattern and forcing it onto a problem. The 6-step process starts with the problem and narrows to the pattern. WHY: Pattern-first thinking produces over-engineered designs with unnecessary indirection.
- **Patterns introduce indirection — only add it when needed.** Every pattern adds a layer of abstraction. Apply only when the flexibility it affords is genuinely required. WHY: The GoF authors warn explicitly against indiscriminate pattern application (p41).
- **Use the redesign causes as a diagnostic, not just a lookup.** The 8 causes of redesign are symptoms, not just labels. When reviewing code, check whether the system exhibits any of these causes — that reveals which patterns to consider. WHY: Symptom-driven selection finds patterns you wouldn't think to look for.
- **Table 1.2 is the reverse lookup.** When you know what must stay flexible, Table 1.2 tells you which pattern encapsulates that aspect. When you know what's breaking, the redesign causes table tells you which pattern prevents that fragility. Use both. WHY: Two complementary angles catch more candidates than either alone.
- **Patterns evolve.** Designs often start with Factory Method and evolve toward Abstract Factory or Prototype as flexibility needs emerge. Don't over-design from the start — pick the simplest pattern that solves the current problem and know the evolution path.
## Reference Files
| File | Contents |
|------|----------|
| `references/pattern-catalog.md` | Table 1.1: all 23 patterns by purpose × scope, one-line intents, pattern relationships |
| `references/variability-table.md` | Table 1.2: what each pattern lets you vary independently |
| `references/redesign-causes.md` | 8 causes of redesign with symptoms and mapped patterns |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/pattern-catalog.md
# GoF Pattern Catalog — Purpose × Scope Matrix (Table 1.1)
Source: Design Patterns: Elements of Reusable Object-Oriented Software (GoF, 1994), Table 1.1 and catalog intents pp. 20–21.
## Classification Axes
**Purpose** — What the pattern does:
- **Creational** — concerns the process of object creation
- **Structural** — deals with the composition of classes or objects
- **Behavioral** — characterizes the ways classes or objects interact and distribute responsibility
**Scope** — What the pattern applies to:
- **Class** — relationships between classes and subclasses (fixed at compile-time, via inheritance)
- **Object** — relationships between objects (dynamic at run-time, via composition/delegation)
---
## Table 1.1: Design Pattern Space
| Scope | Creational | Structural | Behavioral |
|-------|-----------|------------|------------|
| **Class** | Factory Method | — | Interpreter, Template Method |
| **Object** | Abstract Factory, Builder, Prototype, Singleton | Adapter, Bridge, Composite, Decorator, Facade, Flyweight, Proxy | Chain of Responsibility, Command, Iterator, Mediator, Memento, Observer, State, Strategy, Visitor |
---
## Full Catalog: All 23 Patterns with One-Line Intent
### Creational Patterns
| Pattern | Scope | Intent |
|---------|-------|--------|
| **Abstract Factory** | Object | Provide an interface for creating families of related or dependent objects without specifying their concrete classes. |
| **Builder** | Object | Separate the construction of a complex object from its representation so that the same construction process can create different representations. |
| **Factory Method** | Class | Define an interface for creating an object, but let subclasses decide which class to instantiate. Factory Method lets a class defer instantiation to subclasses. |
| **Prototype** | Object | Specify the kinds of objects to create using a prototypical instance, and create new objects by copying this prototype. |
| **Singleton** | Object | Ensure a class only has one instance, and provide a global point of access to it. |
### Structural Patterns
| Pattern | Scope | Intent |
|---------|-------|--------|
| **Adapter** | Object | Convert the interface of a class into another interface clients expect. Adapter lets classes work together that couldn't otherwise because of incompatible interfaces. |
| **Bridge** | Object | Decouple an abstraction from its implementation so that the two can vary independently. |
| **Composite** | Object | Compose objects into tree structures to represent part-whole hierarchies. Composite lets clients treat individual objects and compositions of objects uniformly. |
| **Decorator** | Object | Attach additional responsibilities to an object dynamically. Decorators provide a flexible alternative to subclassing for extending functionality. |
| **Facade** | Object | Provide a unified interface to a set of interfaces in a subsystem. Facade defines a higher-level interface that makes the subsystem easier to use. |
| **Flyweight** | Object | Use sharing to support large numbers of fine-grained objects efficiently. |
| **Proxy** | Object | Provide a surrogate or placeholder for another object to control access to it. |
### Behavioral Patterns
| Pattern | Scope | Intent |
|---------|-------|--------|
| **Chain of Responsibility** | Object | Avoid coupling the sender of a request to its receiver by giving more than one object a chance to handle the request. Chain the receiving objects and pass the request along the chain until an object handles it. |
| **Command** | Object | Encapsulate a request as an object, thereby letting you parameterize clients with different requests, queue or log requests, and support undoable operations. |
| **Interpreter** | Class | Given a language, define a representation for its grammar along with an interpreter that uses the representation to interpret sentences in the language. |
| **Iterator** | Object | Provide a way to access the elements of an aggregate object sequentially without exposing its underlying representation. |
| **Mediator** | Object | Define an object that encapsulates how a set of objects interact. Mediator promotes loose coupling by keeping objects from referring to each other explicitly, and it lets you vary their interaction independently. |
| **Memento** | Object | Without violating encapsulation, capture and externalize an object's internal state so that the object can be restored to this state later. |
| **Observer** | Object | Define a one-to-many dependency between objects so that when one object changes state, all its dependents are notified and updated automatically. |
| **State** | Object | Allow an object to alter its behavior when its internal state changes. The object will appear to change its class. |
| **Strategy** | Object | Define a family of algorithms, encapsulate each one, and make them interchangeable. Strategy lets the algorithm vary independently from clients that use it. |
| **Template Method** | Class | Define the skeleton of an algorithm in an operation, deferring some steps to subclasses. Template Method lets subclasses redefine certain steps of an algorithm without changing the algorithm's structure. |
| **Visitor** | Object | Represent an operation to be performed on the elements of an object structure. Visitor lets you define a new operation without changing the classes of the elements on which it operates. |
---
## Quick-Reference: Pattern Relationships (Figure 1.1 summary)
Key relationships between patterns that often appear together:
- **Composite** is often used with **Iterator** or **Visitor**
- **Prototype** is an alternative to **Abstract Factory**
- **Abstract Factory** is often implemented using **Factory Method**
- **Abstract Factory**, **Builder**, and **Prototype** can use **Singleton**
- **Decorator** and **Proxy** have similar structures but different intents
- **Chain of Responsibility**, **Command**, **Mediator**, and **Observer** all address how senders and receivers are coupled
- **Strategy** and **State** both delegate behavior to helper objects; State's helper knows about the context, Strategy's typically does not
- **Template Method** uses inheritance; **Strategy** uses delegation — both vary algorithm parts
- **Facade** often uses **Singleton** for the facade object
- **Flyweight** often uses **Composite** for hierarchical structure
FILE:references/redesign-causes.md
# 8 Causes of Redesign — Mapped to Design Patterns
Source: Design Patterns: Elements of Reusable Object-Oriented Software (GoF, 1994), pp. 34–36, "Designing for Change."
## Purpose
Use this table when you can articulate **what pain you are experiencing** or **what risk you are trying to avoid**. Each cause describes a design dependency that makes software fragile — brittle to change, hard to reuse, or expensive to maintain. The mapped patterns eliminate or reduce that specific dependency.
Key principle from the GoF: *"Each design pattern lets some aspect of system structure vary independently of other aspects, thereby making a system more robust to a particular kind of change."*
---
## The 8 Causes
### 1. Creating an object by specifying a class explicitly
**The problem:** Specifying a concrete class name when creating an object commits you to a particular implementation rather than an interface. Future changes to that concrete class ripple through all call sites.
**Symptom:** `new ConcreteProductX()` scattered throughout client code. Swapping implementations requires changing many files.
**Design patterns:**
- **Abstract Factory** — creates families of related objects without naming concrete classes
- **Factory Method** — lets subclasses decide which class to instantiate
- **Prototype** — creates new objects by cloning a prototypical instance; the class is never named explicitly
---
### 2. Dependence on specific operations
**The problem:** Hard-coding the way a request is satisfied commits you to one implementation at both compile-time and run-time. Changing how the request is handled requires modifying code that should not care.
**Symptom:** `if (requestType == "A") doA(); else if (requestType == "B") doB();` — business logic tangled with dispatch logic. Cannot add new request handling without modifying existing code.
**Design patterns:**
- **Chain of Responsibility** — decouples sender from receiver; any object in the chain can handle the request
- **Command** — encapsulates a request as an object; handling can be changed, queued, logged, or undone independently
---
### 3. Dependence on hardware and software platform
**The problem:** Direct dependencies on OS interfaces, APIs, or platform-specific features make software hard to port and hard to keep current even on its native platform.
**Symptom:** `#ifdef WIN32` / platform-specific API calls in core logic. Cannot run on a new OS or swap a database without touching business code.
**Design patterns:**
- **Abstract Factory** — isolates platform-specific object creation behind an interface
- **Bridge** — separates the abstraction (what the system does) from its platform-specific implementation (how it does it)
---
### 4. Dependence on object representations or implementations
**The problem:** Clients that know how an object is represented (stored, located, or implemented) must change whenever that object changes — changes cascade unnecessarily.
**Symptom:** Client code accesses internal fields directly, or knows whether an object is local vs. remote, or knows whether it is cached.
**Design patterns:**
- **Abstract Factory** — hides how product families are created and structured
- **Bridge** — hides implementation behind a stable abstraction interface
- **Memento** — externalizes and restores state without exposing internal representation
- **Proxy** — controls access to an object and hides its location or construction
---
### 5. Algorithmic dependencies
**The problem:** Algorithms are frequently extended, optimized, or replaced during development and reuse. Objects that depend on a specific algorithm must change when the algorithm changes.
**Symptom:** A sorting routine, a layout engine, or a compression algorithm embedded directly in a class. Cannot swap algorithms or test alternatives without modifying the class.
**Design patterns:**
- **Builder** — isolates the algorithm for building a complex object from the object's representation
- **Iterator** — abstracts the traversal algorithm over an aggregate, decoupling it from the aggregate's structure
- **Strategy** — defines a family of interchangeable algorithms; clients switch algorithms without touching the code that uses them
- **Template Method** — defines the skeleton of an algorithm in a base class, letting subclasses override specific steps
- **Visitor** — adds operations to an object structure without modifying the classes in that structure
---
### 6. Tight coupling
**The problem:** Tightly coupled classes depend heavily on each other — they cannot be reused in isolation, cannot be changed without understanding many other classes, and the system becomes a dense mass that is hard to learn, port, and maintain.
**Symptom:** Fan-out: a class directly references many other concrete classes. Cannot unit-test a class without instantiating half the system.
**Design patterns:**
- **Abstract Factory** — clients depend on factory interface, not concrete products
- **Bridge** — separates abstraction from implementation so neither depends directly on the other
- **Chain of Responsibility** — decouples sender from all potential receivers
- **Command** — decouples the object that invokes the operation from the one that knows how to perform it
- **Facade** — provides a simple interface to a complex subsystem, reducing coupling between clients and the subsystem's internals
- **Mediator** — replaces a web of object-to-object references with a single mediator object; objects only know the mediator
- **Observer** — decouples subjects from observers; subjects do not know observer types
---
### 7. Extending functionality by subclassing
**The problem:** Customizing behavior through inheritance requires an in-depth understanding of the parent class, carries fixed initialization overhead, and can cause a subclass explosion — many subclasses for even a simple extension.
**Symptom:** A class hierarchy that grows by one new leaf class for every combination of features. Overriding one method requires understanding ten others. Cannot add behavior at run-time.
**Design patterns:**
- **Bridge** — lets abstraction and implementation vary independently without deep hierarchies
- **Chain of Responsibility** — adds new handlers without modifying existing ones
- **Composite** — lets clients add new components to a tree structure uniformly
- **Decorator** — attaches responsibilities to objects dynamically, avoiding the combinatorial subclass explosion
- **Observer** — lets you add new observers without modifying subjects
- **Strategy** — defines behavior as a composable object rather than a subclass
---
### 8. Inability to alter classes conveniently
**The problem:** Sometimes you must modify a class you cannot change — no source code, a commercial library, too many existing subclasses that would all need updating.
**Symptom:** A third-party class with the wrong interface, or a core class that is too widely subclassed to change safely.
**Design patterns:**
- **Adapter** — converts an incompatible interface into one clients expect, without touching the original class
- **Decorator** — adds responsibilities to objects without subclassing; works even with sealed or third-party classes
- **Visitor** — adds new operations to an existing class structure without modifying those classes
---
## Quick Diagnostic: Which Cause Matches Your Pain?
| Symptom You're Experiencing | Most Likely Cause |
|-----------------------------|------------------|
| `new ConcreteClass()` everywhere, hard to swap | Cause 1 |
| Request dispatch logic tangled in client code | Cause 2 |
| Platform-specific code mixed into business logic | Cause 3 |
| Client knows too much about another object's internals | Cause 4 |
| Cannot swap or test algorithms independently | Cause 5 |
| Cannot unit-test a class without the whole system | Cause 6 |
| Subclass hierarchy exploding with combinations | Cause 7 |
| Can't modify the class (no source, too many dependents) | Cause 8 |
FILE:references/variability-table.md
# Variability Table — What Each Pattern Lets You Vary (Table 1.2)
Source: Design Patterns: Elements of Reusable Object-Oriented Software (GoF, 1994), Table 1.2, p. 40.
## Purpose
This table answers the question: **"What do I want to be able to change without redesign?"**
Instead of diagnosing what is causing pain (see `redesign-causes.md`), approach pattern selection from the other direction — identify the concept you want to keep variable, and this table points directly to the pattern that encapsulates it.
Key principle from the GoF: *"Encapsulate the concept that varies."* Each pattern isolates a specific aspect of your design so it can vary independently of everything else.
---
## Table 1.2: Design Aspects That Design Patterns Let You Vary
| Purpose | Pattern | Aspect(s) That Can Vary |
|---------|---------|------------------------|
| **Creational** | Abstract Factory | families of product objects |
| | Builder | how a composite object gets created |
| | Factory Method | subclass of object that is instantiated |
| | Prototype | class of object that is instantiated |
| | Singleton | the sole instance of a class |
| **Structural** | Adapter | interface to an object |
| | Bridge | implementation of an object |
| | Composite | structure and composition of an object |
| | Decorator | responsibilities of an object without subclassing |
| | Facade | interface to a subsystem |
| | Flyweight | storage costs of objects |
| | Proxy | how an object is accessed; its location |
| **Behavioral** | Chain of Responsibility | object that can fulfill a request |
| | Command | when and how a request is fulfilled |
| | Interpreter | grammar and interpretation of a language |
| | Iterator | how an aggregate's elements are accessed, traversed |
| | Mediator | how and which objects interact with each other |
| | Memento | what private information is stored outside an object, and when |
| | Observer | number of objects that depend on another object; how dependent objects stay up to date |
| | State | states of an object |
| | Strategy | an algorithm |
| | Template Method | steps of an algorithm |
| | Visitor | operations that can be applied to object(s) without changing their class(es) |
---
## How to Use This Table
**Selection approach:** When you know what aspect of your design must remain flexible, scan the "Aspect(s) That Can Vary" column for a match, then apply the corresponding pattern.
**Examples of working backwards from variability needs:**
| You want to vary... | Start with... |
|--------------------|---------------|
| Which algorithm runs at runtime | Strategy |
| How objects notify each other of changes | Observer |
| The interface presented to clients | Adapter or Facade |
| Which object handles a request | Chain of Responsibility |
| An object's behavior based on its state | State |
| Whether an object is created once or many times | Singleton, Prototype, Factory Method |
| How deep/complex an object graph gets built | Builder, Composite |
| Operations on a stable object structure | Visitor |
| Undo/redo capability | Command, Memento |
Implement the Decorator pattern to attach additional responsibilities to objects dynamically, providing a flexible alternative to subclassing. Use when you n...
---
name: decorator-pattern-implementor
description: |
Implement the Decorator pattern to attach additional responsibilities to objects dynamically, providing a flexible alternative to subclassing. Use when you need to add behaviors like logging, caching, validation, authentication, compression, or UI embellishments without creating a subclass for every combination. Addresses the subclass explosion problem through transparent enclosure — decorators conform to the component interface so clients cannot distinguish decorated from undecorated objects. Covers composition order effects and the lightweight-decorator optimization. Triggers when: adding optional behaviors to objects at runtime, wrapping streams or middleware, layering cross-cutting concerns, needing to mix and match responsibilities without combinatorial subclass growth.
model: sonnet
context: 1M
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/decorator-pattern-implementor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- structural-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [4]
pages: [51-54, 166-174]
tags: [design-patterns, structural, gof, decorator, composition, wrapper, embellishment, middleware]
execution:
tier: 2
mode: full
inputs:
- type: code
description: "Existing code showing a class that needs optional behaviors added, or a description of the responsibilities to layer dynamically"
tools-required: [TodoWrite, Read]
tools-optional: [Grep, Edit]
mcps-required: []
environment: "Any codebase. Language-agnostic — examples use Python but the pattern applies universally."
---
# Decorator Pattern Implementor
## When to Use
You need to add responsibilities to individual objects — not to an entire class — and you want to do it dynamically, without modifying the component or generating a subclass for every possible combination of responsibilities.
Apply this skill when:
- **Subclass explosion** is imminent or already present: a class needs behaviors like `Logged`, `Cached`, `Validated`, `Compressed` in varying combinations, and creating `LoggedCachedValidator` subclasses has become unmanageable.
- **Responsibilities are optional or withdrawable** at runtime: a feature toggles on per-object, not per-class.
- **Individual objects** need embellishment while other instances of the same class stay plain.
- **You cannot subclass** the target: the class definition is sealed, final, or belongs to a third-party library.
- You want to **compose responsibilities incrementally** — add one, try it, add another — rather than building a monolithic class.
Before starting, confirm this is Decorator and not another structural pattern:
- Need to adapt an incompatible interface? → Use Adapter instead.
- Need to aggregate multiple children, not wrap one? → Use Composite instead.
- Need to change the algorithm inside the object (its "guts") rather than its visible behavior (its "skin")? → Consider Strategy instead.
- Do you need to vary responsibilities at the *class* level (compile-time), not per-object at runtime? → Use inheritance directly.
---
## Process
### Step 1: Set Up Tracking and Assess the Subclass Explosion Problem
**ACTION:** Use `TodoWrite` to track all steps, then identify the component class and enumerate the responsibilities that need to be added in combination.
**WHY:** The central problem Decorator solves is subclass explosion. Before reaching for the pattern, you need to see the problem concretely: if N responsibilities can be independently combined, inheritance requires 2^N subclasses. Mapping this out makes the trade-off visible and confirms that Decorator is worth the additional structural complexity it introduces.
```
TodoWrite:
- [ ] Step 1: Assess the combination space and confirm Decorator applies
- [ ] Step 2: Define or confirm the Component interface
- [ ] Step 3: Define the abstract Decorator base class
- [ ] Step 4: Implement ConcreteDecorator classes
- [ ] Step 5: Determine composition order and wire the chain at runtime
- [ ] Step 6: Validate and address implementation concerns
```
Map the combination space:
| Responsibility | Independent? | Withdrawable? | Optional per-object? |
|---------------|:---:|:---:|:---:|
| e.g., Logging | Yes | Yes | Yes |
| e.g., Caching | Yes | Yes | Yes |
| e.g., Validation | Yes | No | Yes |
If each row answers Yes to at least one column, Decorator applies.
Also note: **how many responsibilities** will be combined. If there is only ever one additional responsibility and it is always present, plain inheritance may suffice — the Decorator trade-off (more classes, "lots of little objects" debugging complexity) is not worth it for a single static enhancement.
---
### Step 2: Define or Confirm the Component Interface
**ACTION:** Identify the abstract interface (class or protocol) that both the real object and all decorators must implement. If it does not exist, extract one.
**WHY:** Transparent enclosure — the core mechanism of the pattern — depends entirely on this interface. Clients pass a `Component*` reference; they cannot tell whether they hold the real component or a chain of three decorators. This transparency is what allows decorators to be nested recursively and layered at runtime without clients knowing. If the Component class is too heavyweight (stores data, has many non-interface methods), decorators inherit that bulk and the pattern becomes expensive to use in quantity.
**Two situations:**
**A — Interface already exists** (common when working with existing architecture):
- Confirm the ConcreteComponent implements it.
- Confirm all operations that decorators need to forward are declared on the interface, not just the ConcreteComponent.
**B — Interface needs to be extracted** (common when refactoring legacy code):
- Identify every operation clients call on the component.
- Extract those into an abstract base class or interface.
- Keep the Component class **lightweight** — define the interface only, defer data representation to subclasses.
```python
# Component interface — lightweight, defines interface only
from abc import ABC, abstractmethod
class TextComponent(ABC):
@abstractmethod
def draw(self) -> str: ...
@abstractmethod
def resize(self, factor: float) -> None: ...
# ConcreteComponent — data and real behavior live here, not on Component
class TextView(TextComponent):
def __init__(self, content: str):
self._content = content
def draw(self) -> str:
return self._content
def resize(self, factor: float) -> None:
# resize logic
pass
```
**Pitfall:** If `TextComponent` had stored `_content` itself, every decorator instance would carry that unused field. Keep data out of the Component base.
---
### Step 3: Define the Abstract Decorator Base Class
**ACTION:** Create an abstract `Decorator` class that (1) implements the Component interface and (2) holds a reference to a wrapped Component. Each operation in the interface delegates to that wrapped component by default.
**WHY:** This class is what makes the chain work. Because it implements the Component interface, a Decorator can wrap another Decorator — arbitrary nesting is possible. Because it delegates by default, ConcreteDecorators only need to override the operations they augment; all other operations pass through transparently. Without this base, each ConcreteDecorator would have to manually implement every forwarding call, creating boilerplate and the risk of forgetting to forward an operation.
**Note on omitting the abstract Decorator class:** If you only need to add *one* responsibility and are adapting an existing class hierarchy (not designing from scratch), you can merge the Decorator base into a single ConcreteDecorator. Omit the abstraction when the abstraction adds no value — there will never be a second ConcreteDecorator to share it.
```python
class Decorator(TextComponent):
"""Abstract Decorator — conforms to Component interface, delegates by default."""
def __init__(self, component: TextComponent):
# WHY: store as Component reference, not as TextView — allows decorating
# decorators (recursive wrapping) transparently.
self._component = component
def draw(self) -> str:
# Default: pure forwarding. ConcreteDecorators override to augment.
return self._component.draw()
def resize(self, factor: float) -> None:
self._component.resize(factor)
```
---
### Step 4: Implement ConcreteDecorator Classes
**ACTION:** For each responsibility to be added, create a ConcreteDecorator subclass that overrides the relevant operations, calls the parent Decorator (to forward to the wrapped component), and adds its behavior before or after the forwarding call.
**WHY:** The before/after placement is a meaningful design decision. Running added behavior *before* forwarding (pre-processing) is appropriate when you need to gate or transform the request before the component handles it — e.g., validation, authentication, input transformation. Running behavior *after* forwarding (post-processing) is appropriate when you need to augment or annotate the component's output — e.g., drawing a border after the component draws itself, logging the result of an operation, caching the return value. Either order is valid; the choice is determined by what the decoration is semantically doing.
```python
class BorderDecorator(Decorator):
"""Draws a border around the component's rendered output."""
def __init__(self, component: TextComponent, border_width: int = 1):
super().__init__(component)
self._border_width = border_width
def draw(self) -> str:
# Post-processing: let the component draw first, then add the border
inner = super().draw() # delegates up the chain
return self._draw_border(inner)
def _draw_border(self, content: str) -> str:
border = "-" * (len(content.splitlines()[0]) + 2 * self._border_width)
return f"{border}\n{content}\n{border}"
class ScrollDecorator(Decorator):
"""Adds scrolling capability — clips to viewport and tracks scroll position."""
def __init__(self, component: TextComponent, viewport_height: int = 20):
super().__init__(component)
self._scroll_position = 0
self._viewport_height = viewport_height
def draw(self) -> str:
full = super().draw()
lines = full.splitlines()
visible = lines[self._scroll_position:self._scroll_position + self._viewport_height]
return "\n".join(visible)
def scroll_to(self, position: int) -> None:
"""Decorator-specific operation — only callable when client knows the type."""
self._scroll_position = position
class LoggingDecorator(Decorator):
"""Logs every draw call with timestamp."""
def draw(self) -> str:
import datetime
result = super().draw()
print(f"[{datetime.datetime.now().isoformat()}] draw() called")
return result
```
**Decorator-specific operations:** `ScrollDecorator` adds `scroll_to()`, which is not on the Component interface. Clients that need this operation must hold a `ScrollDecorator` reference directly, not a `TextComponent` reference. This is a deliberate trade-off: the transparency guarantee (clients treat everything as `TextComponent`) breaks only when a client explicitly needs decorator-specific behavior. In that case, keep a direct reference to the decorator alongside the component reference in the window object.
---
### Step 5: Determine Composition Order and Wire the Chain at Runtime
**ACTION:** Decide the order in which decorators wrap the component, then instantiate the chain. Composition order determines which decorator's behavior is "outermost" and therefore executes first on the way in and last on the way out.
**WHY — the Lexi Border/Scroller case study:** The GoF "Embellishing the User Interface" case study (pages 51-54) demonstrates this directly. Lexi needs a text editing area with both a border and scroll bars. Two composition orders are possible:
```
Order A (Border outside Scroller):
border → scroller → composition (text content)
```
When `Draw()` is called on `border`, it tells `scroller` to draw, which renders the clipped scrollable view of the text, and then `border` draws around the scrolled viewport. The **border is fixed relative to the window** — it does not scroll.
```
Order B (Scroller outside Border):
scroller → border → composition (text content)
```
Now when `scroller` renders its viewport, it is clipping a bordered composition. The **border scrolls with the text** — when the user scrolls down, the border moves off-screen. This may or may not be desirable, but it is a different visual behavior.
The general rule: **the outermost decorator's behavior is the last thing applied to the output.** Order from innermost (closest to the real component) to outermost (what clients actually interact with):
```python
# Lexi-style: fixed border around a scrollable text view (Order A)
text_view = TextView(content)
scrollable = ScrollDecorator(text_view, viewport_height=20)
bordered_scrollable = BorderDecorator(scrollable, border_width=1)
# Clients interact with bordered_scrollable — it is a TextComponent
window.set_contents(bordered_scrollable)
# Alternative: scrollable border (Order B) — the border scrolls with the text
bordered = BorderDecorator(text_view, border_width=1)
scrollable_bordered = ScrollDecorator(bordered, viewport_height=20)
window.set_contents(scrollable_bordered)
```
**Decision guide for composition order:**
| Question | Implication |
|----------|-------------|
| Does decorator A need to see the output of decorator B? | A must be outer (A wraps B) |
| Is decorator A a gateway/guard that should run before B? | A must be outer |
| Should decorator A's additions be inside B's scope? | A must be inner |
| Does one decorator's state affect the other's behavior? | Explicit contract needed; document the order requirement |
---
### Step 6: Validate and Address Implementation Concerns
**ACTION:** Verify the implementation satisfies the transparency requirement, check for the "lots of little objects" problem, and mark tasks complete.
**WHY:** The Decorator pattern's primary liability — documented by the GoF — is that it produces systems of many small, similar-looking objects. Each decorator layer is a separate object; a chain of five decorators wrapping one component is six objects that all respond to the same interface. This makes the system highly customizable but harder to debug (you cannot inspect the class name of a reference and know what you are dealing with) and potentially harder for teammates to understand. Acknowledging this upfront means it will not be a surprise during maintenance.
**Validation checklist:**
- [ ] The Component interface does not store data — data representation lives in ConcreteComponent and ConcreteDecorators only
- [ ] Every ConcreteDecorator calls `super().operation()` (or equivalent) rather than skipping the forwarding — skipping silently breaks the chain
- [ ] Clients that only use the Component interface have not been given a concrete Decorator type reference (they should hold a Component reference)
- [ ] Object identity: if any code uses `isinstance` checks or identity comparisons (`obj is component`), it must be updated — a decorated component is *not* identical to the original
- [ ] The composition order is documented wherever the chain is wired together (not left implicit)
- [ ] If the abstract Decorator class was omitted (single-responsibility case), confirm this choice in a comment so future developers understand there is no base class to extend
**Mark all TodoWrite tasks complete.**
---
## Examples
### Example 1: Lexi Embellishment — Border and Scroller (GoF Case Study)
**Scenario:** Lexi, a WYSIWYG document editor, displays a text editing area. The design requires both a border (to demarcate the page) and scroll bars (to navigate long documents). The embellishments must be removable at runtime and must not affect other UI components in the application.
**Trigger:** "We need to add a border and scrolling to the text editing area. Other parts of the UI should not know about these embellishments."
**Key design decisions from the GoF:**
- `MonoGlyph` (abstract Decorator base) stores a single child `_component` and forwards all `Draw()` calls to it by default. This is exactly the abstract Decorator base from Step 3.
- `Border` subclasses `MonoGlyph`, overrides `Draw()` to call `MonoGlyph::Draw()` first (forward to child), then call `DrawBorder()` — post-processing.
- `Scroller` also subclasses `MonoGlyph` and adds `scroll_to()` as a decorator-specific operation.
```
Composition chain (border fixed, scroller inside border):
Border → Scroller → Composition (text content + layout glyphs)
border->Draw() calls:
MonoGlyph::Draw() (forwards to scroller)
scroller->Draw() (clips to viewport, calls composition->Draw())
DrawBorder() (draws the border around what scroller rendered)
```
**Reversing the order** (scroller outside border) makes the border scroll with the text — a different UX behavior from the same two classes, zero code changes, only the instantiation order changes.
**Output:** A `Border*` reference that is also a `Glyph*`. The window treats it as a plain glyph; the embellishments are invisible to the rest of the system.
---
### Example 2: HTTP Middleware Stack (Web Backend)
**Scenario:** A web API handler needs logging, authentication checking, and response caching, in varying combinations per endpoint. Using inheritance would require `LoggedAuthenticatedCachedHandler`, `LoggedHandler`, `AuthenticatedHandler`, etc.
**Trigger:** "Different API endpoints need different combinations of logging, auth, and caching. I can't keep creating subclasses for every combination."
```python
class RequestHandler(ABC):
@abstractmethod
def handle(self, request: dict) -> dict: ...
class UserProfileHandler(RequestHandler):
def handle(self, request: dict) -> dict:
return {"user": "Alice", "email": "[email protected]"}
class HandlerDecorator(RequestHandler):
def __init__(self, handler: RequestHandler):
self._handler = handler
def handle(self, request: dict) -> dict:
return self._handler.handle(request)
class AuthDecorator(HandlerDecorator):
def handle(self, request: dict) -> dict:
# Pre-processing: gate before forwarding
if not request.get("auth_token"):
return {"error": "Unauthorized", "status": 401}
return super().handle(request)
class LoggingDecorator(HandlerDecorator):
def handle(self, request: dict) -> dict:
print(f"Request: {request.get('path')}")
result = super().handle(request)
print(f"Response status: {result.get('status', 200)}")
return result
class CachingDecorator(HandlerDecorator):
def __init__(self, handler: RequestHandler):
super().__init__(handler)
self._cache: dict = {}
def handle(self, request: dict) -> dict:
key = str(request)
if key in self._cache:
return self._cache[key]
result = super().handle(request)
self._cache[key] = result
return result
# Composition: Auth gate → Cache → real handler, with logging on the outside
handler = LoggingDecorator(
AuthDecorator(
CachingDecorator(
UserProfileHandler()
)
)
)
```
**Composition order matters:** `AuthDecorator` is inside `LoggingDecorator`, so failed auth attempts are still logged. Moving `AuthDecorator` outside `LoggingDecorator` would suppress logs for unauthenticated requests. Neither is wrong — the choice reflects product intent.
**Output:** Any combination of behaviors can be assembled at configuration time. Adding rate-limiting is a new `RateLimitDecorator` class — no changes to existing handlers or decorators.
---
### Example 3: I/O Stream Decoration (GoF Known Uses)
**Scenario:** A file streaming system needs to support optional compression (run-length encoding, Lempel-Ziv, etc.) and optional 7-bit ASCII conversion for transmission over legacy channels. Any combination of these should work with any stream type (`FileStream`, `MemoryStream`).
**Trigger:** "We need to compress and/or ASCII-convert stream data. Different streams need different combinations and we don't want to create CompressedFileStream, ASCIIFileStream, CompressedASCIIFileStream..."
```
Component: Stream (abstract — PutInt, PutString, HandleBufferFull)
ConcreteComp: FileStream, MemoryStream
Decorator base: StreamDecorator (forwards HandleBufferFull by default)
ConcreteDecA: CompressingStream (overrides HandleBufferFull — compresses buffer)
ConcreteDecB: ASCII7Stream (overrides HandleBufferFull — converts to 7-bit ASCII)
```
Wiring a compressed, ASCII-encoded file stream:
```cpp
// C++ (from GoF page 174)
Stream* aStream = new CompressingStream(
new ASCII7Stream(
new FileStream("aFileName")
)
);
aStream->PutInt(12);
aStream->PutString("aString");
```
`CompressingStream` wraps `ASCII7Stream` which wraps `FileStream`. Data flows from the client down to `FileStream`, then back up through ASCII conversion, then through compression, before being written to disk. The order ensures data is first ASCII-encoded, then compressed — reversing would compress raw binary data and then attempt ASCII encoding on it, which is semantically wrong for this use case.
**Output:** Two decorator classes cover four combinations (plain, compressed, ASCII, compressed+ASCII) across any number of stream types, with no subclass proliferation.
---
## Decorator vs. Inheritance — When to Use Which
| Dimension | Decorator | Inheritance |
|-----------|-----------|-------------|
| When behaviors are added | Runtime (dynamic) | Compile-time (static) |
| Combinations | 2 decorators = 4 combinations (mixed freely) | 2 behaviors = 3 subclasses minimum |
| Withdrawing a responsibility | Remove from chain | Impossible — class is fixed |
| Object identity | Decorated object ≠ original (identity breaks) | Is-a relationship preserved |
| Debugging | Chains of small objects — harder to inspect | Single class — straightforward |
| Adding same behavior twice | Wrap with the same decorator twice | Inheriting from a class twice is an error |
**Choose Decorator** when responsibilities are optional, additive, and need to vary per-object at runtime.
**Choose inheritance** when the additional behavior is always present for all instances of a class and there is no combinatorial variation.
---
## Key Principles
- **Interface conformance over convenience** — a decorator that does not fully implement the Component interface breaks the transparency guarantee. Every operation on the Component interface must be implemented, even if the implementation is pure delegation. Partial forwarding means some clients will get wrong behavior silently.
- **Keep the Component base lightweight** — the Component class's job is to define the interface, not store data. If data representation is placed on the Component, every decorator instance in a long chain carries that data as dead weight. Data belongs on ConcreteComponent and ConcreteDecorators only.
- **Forward before or after based on semantics, not convention** — whether a ConcreteDecorator calls `super().operation()` before or after its added behavior is a semantic decision: "does my addition depend on what the component did, or does the component's output depend on what I do first?" Make this explicit in a comment.
- **Document composition order wherever the chain is assembled** — composition order is behavior-changing and invisible at the call site. A reader seeing `BorderDecorator(ScrollDecorator(view))` must understand the layering effect. A short comment stating "border is fixed; scroller is inside" prevents order from being accidentally reversed during refactoring.
- **Object identity breaks — acknowledge it** — clients that use `==`, `is`, or `isinstance` against the original component will fail with decorated versions. If identity checks are present in the codebase, keep a direct reference to the ConcreteComponent alongside the decorated chain rather than expecting clients to unwrap it.
---
## Reference Files
| File | Contents |
|------|----------|
| `references/decorator-implementation-guide.md` | Full participants catalog, consequences analysis, language-specific implementation notes, transparent enclosure reasoning process, comparison with Strategy (skin vs. guts), and omitting the abstract Decorator class |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-structural-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/decorator-implementation-guide.md
# Decorator Pattern — Implementation Guide
Deep reference for `decorator-pattern-implementor`. Read this file when you need:
- The full participants catalog with responsibility descriptions
- The 5-phase transparent enclosure reasoning process
- Consequences analysis (benefits and liabilities)
- Guidance on omitting the abstract Decorator class
- Decorator vs. Strategy (skin vs. guts) distinction
- Language-specific implementation notes
---
## Participants
| Role | GoF Name | Lexi Name | Responsibility |
|------|----------|-----------|---------------|
| Component | `Component` | `Glyph` / `VisualComponent` | Defines the interface for objects that can have responsibilities added dynamically. Kept lightweight — no data representation. |
| ConcreteComponent | `ConcreteComponent` | `TextView`, `Composition` | The real object to which additional responsibilities are attached. Implements Component. Holds data. |
| Decorator (abstract) | `Decorator` | `MonoGlyph` / `Decorator` | Implements Component interface. Holds a reference to a Component. Default behavior: forward all operations to the wrapped component. |
| ConcreteDecorator | `ConcreteDecoratorA/B` | `Border`, `Scroller`, `BorderDecorator`, `ScrollDecorator` | Subclasses Decorator. Overrides one or more operations to add behavior before or after forwarding. May add state (e.g., `_border_width`, `_scroll_position`). |
---
## The 5-Phase Transparent Enclosure Reasoning Process
This is the mental sequence for deciding whether and how to apply Decorator. Work through each phase before writing any code.
### Phase 1: Rule Out Inheritance
Ask: "Can I achieve what I need with a subclass?"
Inheritance fails when:
- The number of combinations is combinatorial (N independent behaviors → 2^N subclasses)
- The responsibility needs to be added at runtime, not at class definition time
- The responsibility needs to be *withdrawn* at runtime (you cannot un-inherit)
- You do not own or cannot modify the class hierarchy
- You need to add the *same* responsibility twice (e.g., double border) — you cannot inherit from a class twice
If inheritance is adequate (single static addition, always present), stop here — use it. Decorator adds structural complexity that inheritance avoids.
### Phase 2: Choose the Composition Direction
When you decide to use composition (object containing another object), two directions are possible:
**Direction A: Embellishment contains the component** (Decorator direction)
- The border object holds a reference to the text view.
- The border "knows" about the component; the component does not know about the border.
- Border-drawing code stays entirely inside the Border class.
- Component classes remain untouched.
**Direction B: Component contains the embellishment**
- The text view holds a reference to a border object.
- The component must be modified to be aware of the embellishment.
- Adding a new embellishment type requires modifying the component class.
Direction A is the Decorator pattern. Direction B is not — it requires modifying the component, coupling it to the embellishment. Always prefer Direction A when the goal is to keep the component class unaware of and unmodified by its embellishments.
### Phase 3: Establish Interface Conformance
Decide: what interface must both the real component and all decorators share?
This is the "compatible interfaces" half of transparent enclosure. The reason for requiring interface compatibility is client transparency: if a `Border` has a `Draw(Window*)` method and `TextView` also has a `Draw(Window*)` method, and both are `Glyph` subclasses, then clients that call `glyph->Draw(w)` work identically whether `glyph` is a plain `TextView` or a `Border` wrapping a `Scroller` wrapping a `TextView`. The chain is invisible.
Check: Does the Component interface cover *all* operations clients will call? If a client calls an operation that exists only on the ConcreteComponent (not the Component interface), the decorator chain breaks transparency for that operation.
### Phase 4: Define the Abstract Decorator Base
Create the `Decorator` class that implements the Component interface and holds the `_component` reference. Implement every interface operation as a pure forwarding call.
This class exists for two reasons:
1. It makes the chain structurally sound — a Decorator can wrap another Decorator because both implement Component.
2. It provides the default forwarding behavior — ConcreteDecorators only override operations they augment, not the entire interface.
**When to omit this base class:** If you are adding exactly one responsibility and will never add a second ConcreteDecorator, the abstract Decorator base adds a class without adding value. In that case, implement the forwarding and the added behavior directly in a single class. The GoF explicitly acknowledge this: "There's no need to define an abstract Decorator class when you only need to add one responsibility." The optimization merges the Decorator's forwarding responsibility into the ConcreteDecorator.
### Phase 5: Layer at Runtime
Instantiation order determines the decoration chain. The innermost object is the real component; each wrapping object is one decoration layer. The outermost reference is what clients hold and interact with.
```
client → [outer decorator] → [middle decorator] → [inner decorator] → [component]
↑
client's entry point
```
At the construction site, document the order. A future developer changing `BorderDecorator(ScrollDecorator(view))` to `ScrollDecorator(BorderDecorator(view))` will get subtly different behavior — the comment prevents this from being an invisible refactoring accident.
---
## Consequences Analysis
### Benefits
**1. More flexibility than static inheritance**
Responsibilities can be added and removed at runtime simply by attaching and detaching decorators. With inheritance, the set of responsibilities is fixed at compile time. This means a decorator can be added for a debugging session and removed in production without code changes — just configuration changes.
Inheritance requires a new class for each combination. Decorators can be mixed and matched freely. Two decorators covering N behaviors replace 2^N subclasses.
You can also add a responsibility *twice*. Attaching two `BorderDecorator` instances gives a double border. Inheriting from `Border` twice is a compile error.
**2. Avoids feature-laden base classes**
Without Decorator, the pressure is to add all foreseeable features into a single monolithic class. Decorator allows a simple, focused base class with functionality composed incrementally. Clients pay only for what they use. New decorators can be defined independently of the components they decorate, even for unforeseen extensions.
### Liabilities
**3. Decorator and component are not identical**
A decorated component is not the same object as the undecorated component. From an object identity standpoint, `decorated_view is text_view` is `False`. Code that relies on object identity — caching by reference, set membership checks, `isinstance` guards — breaks when decorators are introduced. If identity matters, keep a direct reference to the ConcreteComponent alongside the decorated chain.
**4. Lots of little objects**
A design that uses Decorator heavily produces systems composed of many small objects that look similar (they all implement the same interface). They differ only in how they are connected, not in their class names or visible data. These systems are highly customizable and easy to extend, but can be hard to debug and understand. Inspecting a reference at runtime tells you only the outermost type — you cannot easily see the full chain without traversing it. Document the chain structure explicitly in the code where it is assembled.
---
## Decorator vs. Strategy: Skin vs. Guts
The GoF describe the key distinction as: "A decorator changes an object's skin; a strategy changes its guts."
| Aspect | Decorator | Strategy |
|--------|-----------|----------|
| What changes | Outer behavior (adds to interface, wraps calls) | Inner algorithm (replaces how the object computes) |
| Component awareness | Component does not know about decorators | Component knows about strategy objects; references and maintains them |
| Depth of change | Surface — before/after operation forwarding | Core — the computation itself is replaced |
| Component weight | Works best with lightweight Component class | Viable even when Component is heavyweight |
| Interface requirement | Decorator must conform to Component interface | Strategy can have its own specialized interface |
| Analogy | Wrapping a gift — the gift doesn't change | Swapping the engine in a car — the car has to be designed to allow it |
**When to prefer Strategy:** When the Component class is intrinsically heavyweight (contains significant data and behavior), making the Decorator pattern too costly (every decorator carries the base weight). In that case, design the component to delegate behavior to a separate strategy object, and replace the strategy to extend functionality. The component must be modified to support strategies, but the modification is contained.
**When to prefer Decorator:** When you cannot or should not modify the component, when responsibilities are additive (not replacements), and when the Component interface is stable and lightweight.
---
## Omitting the Abstract Decorator Class — Full Analysis
The abstract Decorator base class is a structural convenience, not a logical requirement. It can be omitted when:
1. **Only one ConcreteDecorator is needed** — there is nothing to share between multiple ConcreteDecorators, so the abstraction layer adds zero value.
2. **Working with an existing class hierarchy** — you are decorating a class you did not design, and adding an intermediate abstract class to the hierarchy would require changes you cannot make (sealed class, third-party library, etc.).
3. **The forwarding logic is trivial** — a one-method interface with a single forwarding line is not worth abstracting.
In these cases, the ConcreteDecorator directly holds the `_component` reference and implements both the forwarding and the decoration:
```python
# Single-responsibility case: omit abstract Decorator base
class LoggingWrapper(TextComponent):
def __init__(self, component: TextComponent):
self._component = component # forwarding is merged into ConcreteDecorator
def draw(self) -> str:
result = self._component.draw()
print(f"draw() → {len(result)} chars")
return result
def resize(self, factor: float) -> None:
self._component.resize(factor) # pure forwarding for non-decorated operations
```
The trade-off: if a second ConcreteDecorator is needed later, the forwarding boilerplate will be duplicated until the abstract base is extracted. Use a comment to signal the intent: `# Single decorator — abstract base omitted; extract if a second decorator is added`.
---
## Language-Specific Notes
### Python
- Use `ABC` and `@abstractmethod` for the Component interface; this enforces interface conformance at class definition time rather than runtime.
- `super().method()` is the correct forwarding call in a Decorator subclass — do not call `self._component.method()` directly in ConcreteDecorators, because this bypasses other decorators in the chain. `super()` goes to the abstract Decorator base, which then calls `self._component.method()`.
- Python's duck typing means the Component interface does not need to be a formal ABC — decorators and components just need to have the same method names. Prefer explicit ABCs in production code for tooling support (type checking, IDE autocomplete).
### Java / C#
- Use an `interface` for the Component (not an abstract class) to avoid the single-inheritance constraint.
- The abstract Decorator base class implements the interface and holds the `component` field.
- ConcreteDecorators `extend` the abstract Decorator, not the interface directly.
### C++ (GoF style)
- `VisualComponent` is an abstract base class with virtual methods.
- `Decorator` subclasses `VisualComponent`, holds `VisualComponent* _component`.
- `Decorator::Draw()` implements `{ _component->Draw(); }` — pure forwarding.
- ConcreteDecorators override `Draw()` and call `Decorator::Draw()` explicitly (not `_component->Draw()`) to maintain the chain.
### JavaScript / TypeScript
- In TypeScript, define the Component as an `interface`.
- The abstract Decorator class `implements` the interface and holds a `protected component` field.
- Use `protected` rather than `private` for `_component` in the abstract Decorator so ConcreteDecorators can access it directly if needed (though using `super.method()` is preferred).
---
## Known Uses (from GoF)
- **InterViews, ET++ (UI toolkits):** Graphical embellishments on widgets — borders, scroll bars, drop shadows. The `DebuggingGlyph` in InterViews prints debugging information before and after forwarding layout requests, useful for tracing layout behavior.
- **ET++ streams:** `StreamDecorator` for compression (`CompressingStream`) and ASCII encoding (`ASCII7Stream`) — see Example 3 in the skill.
- **MacApp 3.0 / Bedrock:** Use a list of "adorner" objects attached to views — a variant of Decorator where the component maintains a list of decorations rather than each decorator wrapping the next. Necessary because the View class is heavyweight; full Decorator would be too expensive.
- **Java I/O:** `InputStream` → `BufferedInputStream`, `GZIPInputStream`, `DataInputStream` — the canonical real-world Decorator chain.
- **Python WSGI middleware:** Each middleware layer wraps the next application callable — structurally identical to Decorator.
Choose the right creational design pattern (Abstract Factory, Builder, Factory Method, Prototype, or Singleton) for an object creation problem. Use when some...
---
name: creational-pattern-selector
description: |
Choose the right creational design pattern (Abstract Factory, Builder, Factory Method, Prototype, or Singleton) for an object creation problem. Use when someone asks "which factory pattern should I use?", "should I use Abstract Factory or Factory Method?", "how do I create objects without specifying the concrete class?", "I need a factory but don't know which one", or "how do I make object creation flexible?" Also use when you see hard-coded `new ConcreteClass()` calls scattered throughout code, when product families must be swapped at runtime, when object construction is too complex for a single constructor, or when objects should be created by cloning a prototype. Compares patterns using two parameterization strategies — subclassing vs object composition — with trade-offs and an evolution path from Factory Method to more flexible alternatives.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/creational-pattern-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- design-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [3]
tags: [design-patterns, object-oriented, gof, creational, abstract-factory, builder, factory-method, prototype, singleton]
execution:
tier: 1
mode: full
inputs:
- type: none
description: "Description of an object creation problem — concrete class references, construction complexity, or any creation-related design pain"
tools-required: [TodoWrite]
tools-optional: [Read, Grep]
mcps-required: []
environment: "Any agent environment. If a codebase exists, scanning it provides additional context but is not required."
---
# Creational Pattern Selector
## When to Use
You have an object creation problem that needs a structural solution. This skill applies when:
- Code hard-codes `new ConcreteClass()` calls in places where the class type should be variable
- You need to create families of related objects that must be used together consistently
- Object construction is complex — multi-step, multi-part, or produces different representations
- You want to create objects by copying a known configuration rather than constructing from scratch
- A class must have exactly one instance with globally accessible state
- You've been told to "use a creational pattern" but need help picking which one
Before starting, verify you have a creation problem — not a structural or behavioral one. If you're not sure whether it's a creation problem at all, invoke the `design-pattern-selector` skill first, which will classify the problem and route here if appropriate.
## Context & Input Gathering
### Required Context
- **What is being created:** What type of object? Is it a product, a complex structure, a family of related objects, or a resource that should be shared?
- **What is inflexible now:** Where are concrete class names hard-coded? What happens when a new variant is needed?
- **Who calls the creation code:** Is it a framework, an application, a client class?
### Useful Context (gather from codebase if available)
- Search for `new ConcreteClass()` calls scattered across multiple files — indicates Factory Method or Abstract Factory
- Search for telescoping constructors (many parameters or many overloads) — indicates Builder
- Look for `clone()` or copy-constructor patterns — indicates Prototype already in use or needed
- Look for global variables or static accessor patterns managing a shared object — indicates Singleton
---
## Process
### Step 1: Frame the Creation Problem
**ACTION:** Use `TodoWrite` to track steps, then articulate what aspect of creation is inflexible.
**WHY:** Creational patterns solve five distinct problems. The right pattern becomes obvious once you identify which of the five matches your situation. Without this framing, patterns get conflated (e.g., using Builder when Factory Method would suffice).
```
TodoWrite:
- [ ] Step 1: Frame the creation problem
- [ ] Step 2: Identify the parameterization strategy
- [ ] Step 3: Check the five applicability checklists
- [ ] Step 4: Consider the evolution path
- [ ] Step 5: Produce scored recommendation with trade-offs
```
Capture these elements:
| Element | Question |
|---------|----------|
| **Product type** | Single object, family of related objects, or complex composite object? |
| **Variation axis** | What needs to vary — the concrete class, the construction steps, or the state? |
| **Flexibility timing** | Must creation be flexible at compile-time or at run-time? |
| **Uniqueness constraint** | Must the system allow only one instance? |
---
### Step 2: Identify the Parameterization Strategy
**ACTION:** Determine which of the two fundamental strategies applies to this problem.
**WHY:** The GoF Discussion of Creational Patterns identifies two recurring ways to make a system independent of how objects are created. Understanding which strategy you're using eliminates half the candidate patterns immediately and reveals the underlying design commitment you're making.
**Strategy 1 — Subclass-based (class parameterization):**
The class that creates objects is subclassed to change what gets created. The creator class calls a virtual factory method; subclasses override it to return different product types.
- **Pattern:** Factory Method
- **Commitment:** Adding a new product variant means adding a new subclass of the creator
- **Risk:** Subclass proliferation — changing the product can cascade to changing the creator's class hierarchy
- **When appropriate:** You're building a framework where the instantiation hook is the only customization needed; creation logic is confined to one method
**Strategy 2 — Composition-based (object parameterization):**
An external object is responsible for knowing which concrete classes to create. That object is passed into the creator as a parameter or configured at startup.
- **Patterns:** Abstract Factory, Builder, Prototype
- **Commitment:** Adding a new product variant means creating a new factory/builder/prototype object — no subclassing of the creator required
- **Advantage:** More flexible — the creator is entirely unaware of concrete product classes
- **When appropriate:** You need run-time flexibility, multiple product families, or want to decouple creation from the creator's class hierarchy entirely
**Output of this step:** Identify which strategy fits. If the flexibility is needed at run-time or affects multiple product types, lean toward composition-based (Strategy 2). If a single virtual method suffices, consider Factory Method (Strategy 1).
---
### Step 3: Check the Five Applicability Checklists
**ACTION:** Run the problem against each pattern's applicability criteria. Mark which conditions are satisfied.
**WHY:** Each creational pattern has a precise set of conditions that distinguish it from the others. Checking all five prevents the common mistake of picking the most familiar pattern rather than the most fitting one. A pattern with 3 of 3 matching conditions beats one with 2 of 4.
Detailed applicability criteria are in `references/creational-comparison.md`. Summary checklists:
**Abstract Factory** — use when ALL of these hold:
- [ ] The system creates families of related or dependent objects
- [ ] The system must be configured with one of multiple product families
- [ ] You need to enforce that products from one family are used together (compatibility constraint)
- [ ] You want to reveal only product interfaces, not implementations
*Example trigger:* UI toolkit supporting multiple look-and-feel standards — Motif or Presentation Manager. Each look-and-feel is a product family (windows, scrollbars, buttons). Swapping the factory swaps the entire family consistently.
**Builder** — use when ALL of these hold:
- [ ] Object construction is complex: multi-step, multi-part assembly
- [ ] The same construction process should produce different representations
- [ ] The construction algorithm must be independent of the parts and how they're assembled
*Example trigger:* A document converter that reads one format (RTF) and can produce multiple output formats (ASCII text, TeX, interactive widget). The parsing algorithm (Director) is stable; only the converter object (Builder) changes.
**Factory Method** — use when ANY of these hold:
- [ ] A class cannot anticipate the class of objects it must create
- [ ] A class wants its subclasses to specify the objects it creates
- [ ] Classes delegate creation responsibility to helper subclasses and you want to localize the knowledge of which helper is used
*Example trigger:* An application framework that manages Document objects, but the framework cannot know which Document subclass the application will use. The framework calls a virtual `CreateDocument()` factory method; application subclasses override it.
**Prototype** — use when ALL of these hold:
- [ ] The system must be independent of how its products are created AND
- [ ] Classes to instantiate are specified at run-time (e.g., dynamic loading) OR
- [ ] You want to avoid building a parallel creator class hierarchy for each product type OR
- [ ] Instances can have only a few different combinations of state — cloning configured instances is easier than manual construction each time
*Example trigger:* A graphical editor where each tool creates graphic objects. Instead of subclassing the tool for each graphic type (subclass explosion), each tool is parameterized with a prototype it clones when the user draws.
**Singleton** — use when BOTH of these hold:
- [ ] There must be exactly one instance of the class, accessible to all clients
- [ ] The sole instance may need to be extensible by subclassing, and clients must be able to use the extended instance without modifying their code
*Example trigger:* A maze game where all game objects need access to the Maze — without passing the maze as a parameter everywhere. Singleton ensures one maze exists and provides a well-known access point.
---
### Step 4: Consider the Evolution Path
**ACTION:** Assess where the system is now and where it needs to go. Map that against the natural evolution path.
**WHY:** The GoF explicitly states that designs often start with Factory Method and evolve toward Abstract Factory, Builder, or Prototype as flexibility needs grow. Starting with the most flexible pattern when Factory Method would suffice is over-engineering. But starting with Factory Method when Abstract Factory is needed leads to a painful refactor. Knowing the trajectory saves both mistakes.
**Evolution path:**
```
Factory Method
|
| → When you need multiple product families → Abstract Factory
|
| → When construction is complex/multi-step → Builder
|
| → When subclass proliferation is the pain → Prototype
|
| → When you need one shared instance → Singleton (orthogonal — often combined with the above)
```
**Trajectory questions:**
1. Will the number of product variants grow? If yes, how — more types of the same product (stay with Factory Method), or multiple coordinated families (move to Abstract Factory)?
2. Will construction complexity grow? If construction is currently simple but expected to gain optional parts, configure-then-build patterns, or multiple output representations → Builder
3. Is subclass proliferation already a problem, or is it a future risk? Prototype is the aggressive solution; Factory Method may be enough early on
4. Singleton is often combined with, not instead of, the other patterns — e.g., a Singleton AbstractFactory manages the product family for the whole application
**Maze game comparison (from the GoF):**
The `MazeGame::CreateMaze()` function is the baseline: it hard-codes `new Room`, `new Wall`, `new Door`. To support enchanted mazes (EnchantedRoom, DoorNeedingSpell), each pattern applies differently:
| Pattern | How CreateMaze changes |
|---------|----------------------|
| Factory Method | MazeGame calls virtual `MakeRoom()`, `MakeWall()`, `MakeDoor()`. EnchantedMazeGame overrides them. |
| Abstract Factory | CreateMaze is passed a MazeFactory parameter. EnchantedMazeFactory returns enchanted parts. |
| Builder | CreateMaze is passed a MazeBuilder object. Builder constructs the maze incrementally. |
| Prototype | CreateMaze is passed prototypical Room, Wall, Door instances it clones. |
| Singleton | MazeFactory is a Singleton — one factory per application, globally accessible. |
---
### Step 5: Produce Recommendation
**ACTION:** Deliver a structured recommendation in the format below, based on the steps above.
**WHY:** A clear recommendation with explicit trade-offs enables the developer to commit confidently or challenge the reasoning. Surfacing the runner-up pattern and why it was ruled out prevents second-guessing after implementation starts.
**Recommendation format:**
```
## Creational Pattern Recommendation: [Pattern Name]
**Parameterization strategy:** [Subclass-based / Composition-based]
**Why this pattern:** [1–2 sentences: connect the pattern's applicability conditions to the specific problem]
**What it decouples:** [What concrete class references are eliminated from the caller]
**What you gain:** [Key benefit — flexibility, consistency, construction control, etc.]
**Key trade-off:** [Cost — added classes, complexity, or constraint]
**Evolution note:** [When to migrate to a more flexible pattern, if applicable]
**Alternative considered:** [Runner-up] — [Why it was not chosen]
**Compatible with:** [Other patterns that combine well, e.g., Singleton + Abstract Factory]
```
---
## Examples
### Example 1: Cross-Platform UI Toolkit
**Scenario:** A UI toolkit must support two look-and-feel standards (Motif and a custom dark theme). Both have windows, buttons, and scrollbars. Clients should be able to switch the entire look-and-feel without knowing which concrete widget classes they're using.
**Trigger:** "We need to support multiple themes and the theme selection should be swappable at startup without touching client code."
**Process:**
- Step 2: Composition-based — the factory object knows which concrete widget classes to create; clients are parameterized with the factory
- Step 3: Abstract Factory matches all four conditions — related family (window + button + scrollbar), multiple families (Motif + dark), compatibility constraint (all widgets from same family), hide implementations
- Step 4: Evolution — if more themes are expected, Abstract Factory scales cleanly (add a new ConcreteFactory). Factory Method would require subclassing the client for each theme combination.
**Output:**
```
## Creational Pattern Recommendation: Abstract Factory
**Parameterization strategy:** Composition-based
**Why this pattern:** There are multiple coordinated product families (widget sets per look-and-feel),
and clients must use widgets from one consistent family. Abstract Factory encapsulates this
family constraint as a factory object passed to clients.
**What it decouples:** All concrete widget class names from client code. Clients
only reference abstract Widget interfaces and the WidgetFactory interface.
**What you gain:** Swapping the entire look-and-feel is a single factory substitution.
The constraint that Motif widgets go with Motif scrollbars is enforced automatically.
**Key trade-off:** Adding a new widget type (e.g., a tooltip) requires changing the
AbstractFactory interface and all ConcreteFactory implementations.
**Evolution note:** If themes are runtime-configurable (loaded from disk, user preference),
combine with Prototype — register prototypical widget instances per theme.
**Alternative considered:** Factory Method — would require subclassing the client application
for each theme, which does not scale when themes are selected at run-time.
```
---
### Example 2: Document Converter with Multiple Output Formats
**Scenario:** An application reads one input format (Markdown) and must produce different output representations: plain text, HTML, and a rich text widget for in-app editing. The parsing algorithm is stable; only the output format changes.
**Trigger:** "The same parsing logic needs to produce completely different output formats. Adding a new format shouldn't require touching the parser."
**Process:**
- Step 2: Composition-based — a converter object (Builder) is passed to the parser (Director)
- Step 3: Builder matches — complex multi-part output, same construction process (parsing), multiple representations (plain text vs HTML vs widget)
- Step 4: Evolution path — if the number of formats grows, each is a new ConcreteBuilder, zero changes to the parser
**Output:**
```
## Creational Pattern Recommendation: Builder
**Parameterization strategy:** Composition-based
**Why this pattern:** The parsing algorithm (Director) is stable and orthogonal to
how the output format is assembled. Builder separates the algorithm from the
representation by providing the parser with a converter interface (Builder) that
each output format implements independently.
**What it decouples:** The parser from all output format classes. The parser calls
BuildParagraph(), BuildHeading() — it never references HTMLConverter or TextConverter.
**What you gain:** New output format = new ConcreteBuilder class. Zero changes to parser.
Formats can have internal state (e.g., tracking open tags in HTML) hidden from the parser.
**Key trade-off:** Builder requires a Director–Builder collaboration; it adds more
structure than directly parameterizing with a conversion function. Justified when
construction is genuinely multi-step with internal state.
**Evolution note:** If the parser needs to support multiple input formats too,
Abstract Factory could manage the pairing of reader + builder per format combination.
**Alternative considered:** Abstract Factory — suited for families of related objects,
not for a single complex object assembled step-by-step from sequential operations.
```
---
### Example 3: Game Maze with Variant Room Types
**Scenario:** A maze game's `CreateMaze()` method directly instantiates `Room`, `Wall`, and `Door` with `new`. The team wants to introduce an enchanted maze variant with `EnchantedRoom` and `DoorNeedingSpell` without duplicating the construction logic.
**Trigger:** "We want to reuse the maze creation algorithm for different maze variants, but it's hard-coded to specific classes."
**Process:**
- Step 2: Depends on flexibility needed. If one override point suffices (just override a few virtual methods), Factory Method is enough. If the entire set of maze component types must vary together, Abstract Factory is cleaner.
- Step 3: Factory Method matches — a class (MazeGame) cannot anticipate which Room subclass to create; subclasses should specify it. Abstract Factory also matches if the component types must be coordinated as a family.
- Step 4: Start with Factory Method — it only requires adding virtual `MakeRoom()`, `MakeWall()`, `MakeDoor()` to MazeGame and overriding them in EnchantedMazeGame. Evolve to Abstract Factory only if maze variants proliferate and need to be swappable at runtime.
**Output:**
```
## Creational Pattern Recommendation: Factory Method (evolve to Abstract Factory if needed)
**Parameterization strategy:** Subclass-based (Factory Method) — composition-based
(Abstract Factory) when variants must be runtime-swappable
**Why this pattern:** MazeGame knows when to create rooms and doors but should not
decide which concrete type. Factory Method adds virtual creation methods; EnchantedMazeGame
overrides them. The construction algorithm in CreateMaze() is unchanged.
**What it decouples:** MazeGame from the concrete component classes Room, Wall, Door.
CreateMaze() calls MakeRoom() instead of new Room().
**What you gain:** Adding a new maze variant is one new subclass of MazeGame with
three method overrides. No duplication of the construction algorithm.
**Key trade-off:** Every new maze variant requires a new MazeGame subclass. If variants
must be selected at runtime (not compile-time), this becomes awkward — that signals
the need to migrate to Abstract Factory.
**Evolution note:** When maze types need to be chosen at startup or loaded from config,
replace the subclass approach with an AbstractMazeFactory passed to CreateMaze(). The
factory object becomes a Singleton if only one maze type is active per application run.
**Alternative considered:** Abstract Factory — more flexible but requires an extra
factory object hierarchy from the start; over-engineered if compile-time subclassing is sufficient.
```
---
## Key Principles
- **Start with Factory Method, evolve as needed.** Designs typically start with Factory Method (simplest) and evolve toward Abstract Factory, Builder, or Prototype as flexibility needs emerge. Don't over-engineer from day one. WHY: The GoF authors document this evolution path explicitly — premature flexibility adds complexity without value.
- **Two strategies, one question.** The fundamental choice is: parameterize by subclassing (Factory Method) or by object composition (Abstract Factory/Builder/Prototype)? If you need runtime swapping, you need composition. WHY: This single fork eliminates 3 of 5 candidates immediately.
- **Factory Method cascading is the key warning sign.** If creating a new product variant requires a new creator subclass, AND that creator is itself created by a factory method, the cascade has begun. This signals the need to evolve to Abstract Factory or Prototype. WHY: Cascade subclassing is the primary cost of Factory Method.
## Reference Files
| File | Contents |
|------|----------|
| `references/creational-comparison.md` | Full applicability, consequences, and trade-off table for all 5 creational patterns; maze game comparison; GraphicTool example |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-design-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/creational-comparison.md
# Creational Pattern Comparison Reference
> Source: Design Patterns: Elements of Reusable Object-Oriented Software (Gamma, Helm, Johnson, Vlissides), Chapter 3 — Creational Patterns + Discussion of Creational Patterns (pp. 83–132)
This file provides the detailed per-pattern applicability, consequences, and trade-off data used by the `creational-pattern-selector` skill.
---
## Overview: The Two Parameterization Strategies
Creational patterns all solve the same root problem: **hard-coding the concrete class names of objects that a system creates makes it inflexible**. When the class to instantiate is named explicitly (e.g., `new Room()`), changing it requires editing the code that does the creation — often a cascading change.
The GoF identifies two fundamental approaches to removing these concrete class references:
### Strategy 1: Subclass-Based (Class Parameterization)
Subclass the creator class and override a virtual factory method. The creator calls the virtual method to create products; subclasses decide the concrete type.
- **Only pattern:** Factory Method
- **Mechanism:** Inheritance. The type of product changes by changing the subclass of the creator.
- **Main drawback:** Creating a new product subclass often requires creating a new creator subclass. Subclass changes can cascade — if the creator is itself created by a factory method, you must also override its creator.
- **Best when:** The customization is simple (one virtual method), the variation is known at compile-time, and subclass proliferation is not a concern.
### Strategy 2: Composition-Based (Object Parameterization)
Define a separate "factory object" responsible for knowing which concrete classes to create. Pass this object into the creator as a parameter. The creator delegates instantiation entirely to the factory object.
- **Patterns:** Abstract Factory, Builder, Prototype
- **Mechanism:** Object composition. The type of product changes by passing a different factory/builder/prototype object — no subclassing of the creator needed.
- **Main advantage:** The creator is fully decoupled from concrete product classes. Run-time flexibility is inherent.
- **Best when:** Products must vary at run-time, multiple product families must be managed, or subclassing the creator is undesirable.
**Singleton** is orthogonal to both strategies: it governs instance count, not instantiation mechanism. It is commonly combined with Abstract Factory (to ensure one factory per application).
---
## Pattern 1: Factory Method
**Intent:** Define an interface for creating an object, but let subclasses decide which class to instantiate. Factory Method lets a class defer instantiation to subclasses.
**Also known as:** Virtual Constructor
### Applicability
Use Factory Method when:
1. A class cannot anticipate the class of objects it must create — it knows *when* to create but not *what* to create
2. A class wants its subclasses to specify the objects it creates
3. Classes delegate responsibility to one of several helper subclasses, and you want to localize the knowledge of which helper subclass is the delegate
### Consequences
**Benefits:**
- Eliminates the need to bind application-specific classes into your code — code deals only with the Product interface
- Provides hooks for subclasses: creating objects in a separate factory method is always more flexible than creating them directly
- Connects parallel class hierarchies: factory methods are often called by a Creator that also has a parallel Product hierarchy (e.g., Figure and Manipulator)
**Liabilities:**
- Clients may need to subclass the Creator class just to create a particular ConcreteProduct — adds a point of evolution
- Two major varieties:
1. Abstract Creator — no default implementation, forces subclasses to define it
2. Concrete Creator with default — provides flexibility without requiring subclasses
### Key Trade-offs
| | |
|--|--|
| **Flexibility** | Compile-time (via subclassing). Run-time flexibility requires additional mechanism. |
| **New variant cost** | One new Creator subclass + one new Product subclass |
| **Creator coupling** | Creator is decoupled from ConcreteProduct but coupled to its own class hierarchy |
| **Complexity introduced** | Low — only a new virtual method and override |
---
## Pattern 2: Abstract Factory
**Intent:** Provide an interface for creating families of related or dependent objects without specifying their concrete classes.
**Also known as:** Kit
### Applicability
Use Abstract Factory when:
1. A system should be independent of how its products are created, composed, and represented
2. A system should be configured with one of multiple families of products
3. A family of related product objects is designed to be used together, and you need to enforce this constraint
4. You want to provide a class library of products, and you want to reveal just their interfaces, not their implementations
### Consequences
**Benefits:**
1. **Isolates concrete classes** — clients manipulate instances through abstract interfaces; product class names never appear in client code
2. **Makes exchanging product families easy** — the entire family changes when the ConcreteFactory changes, since only one ConcreteFactory is used per application
3. **Promotes consistency among products** — products from one family are always used together; the factory enforces this automatically
**Liabilities:**
4. **Supporting new kinds of products is difficult** — adding a new product type (e.g., a new widget) requires changing the AbstractFactory interface and *all* ConcreteFactory subclasses
### Implementation Notes
- Typically implemented as a Singleton — only one ConcreteFactory per application run
- Factories can be parameterized (accept a product type identifier) to avoid the interface-change problem, at the cost of type safety
- Prototype-based factories can be used to avoid the parallel ConcreteFactory hierarchy: register prototypical product instances in a single factory
### Key Trade-offs
| | |
|--|--|
| **Flexibility** | Run-time — swap the factory object to swap the entire product family |
| **New family cost** | One new ConcreteFactory class implementing all creation methods |
| **New product type cost** | High — requires changing AbstractFactory interface and all ConcreteFactories |
| **Complexity introduced** | Medium-high — factory object hierarchy parallel to product hierarchy |
---
## Pattern 3: Builder
**Intent:** Separate the construction of a complex object from its representation so that the same construction process can create different representations.
### Applicability
Use Builder when:
1. The algorithm for creating a complex object should be independent of the parts that make up the object and how they're assembled
2. The construction process must allow different representations for the object that's constructed
### Structure
- **Director:** Constructs the object using the Builder interface. Calls BuildPart() methods in sequence. Owns the construction algorithm.
- **Builder:** Abstract interface for creating parts of the Product object
- **ConcreteBuilder:** Implements the Builder interface; constructs and assembles parts; provides a GetResult() interface
- **Product:** The complex object under construction. ConcreteBuilder builds its internal representation.
Collaboration: Client creates a Director and configures it with a ConcreteBuilder. Director calls BuildPart() methods. Client retrieves the product from the ConcreteBuilder.
### Consequences
1. **Vary product's internal representation** — to change representation, define a new ConcreteBuilder; the Director is unchanged
2. **Isolate construction from representation** — modularity: the Builder encapsulates all code for creating and assembling the product; the Director knows nothing about the product's structure
3. **Finer control over the construction process** — unlike creational patterns that construct products in one shot, Builder builds step by step under the Director's control
### Key Trade-offs
| | |
|--|--|
| **Flexibility** | Run-time — configure Director with different ConcreteBuilders |
| **New representation cost** | One new ConcreteBuilder class |
| **When not to use** | When product construction is simple (single method call) — adds unnecessary structure |
| **Distinguishing from Abstract Factory** | Builder constructs complex objects step-by-step; Abstract Factory creates families of objects in one step. Builder returns the product at the end; Abstract Factory returns it immediately. |
| **Complexity introduced** | Medium — Director + Builder interface + ConcreteBuilder(s) + Product |
---
## Pattern 4: Prototype
**Intent:** Specify the kinds of objects to create using a prototypical instance, and create new objects by copying this prototype.
### Applicability
Use Prototype when a system should be independent of how its products are created, composed, and represented, **and** when at least one of the following holds:
1. The classes to instantiate are specified at run-time (e.g., dynamic loading)
2. To avoid building a class hierarchy of factories that parallels the class hierarchy of products
3. Instances of a class can have only a few different combinations of state — it's more convenient to install prototypes and clone them than to instantiate the class manually with the appropriate state each time
### Consequences
**Shared benefits with Abstract Factory and Builder:**
- Hides the concrete product classes from the client
- Lets a client work with application-specific classes without modification
**Prototype-specific benefits:**
1. **Add and remove products at run-time** — register/unregister prototypes dynamically; more flexible than other patterns for run-time variation
2. **Specify new objects by varying values** — define new kinds of objects by varying an existing object's state, without defining new classes
3. **Specify new objects by varying structure** — prototype complex composite structures (e.g., circuit diagrams) so they can be cloned and reused
4. **Reduce subclassing** — Factory Method produces a parallel Creator hierarchy for each Product; Prototype eliminates the need for Creator subclasses entirely
5. **Configure application with classes dynamically** — dynamically loaded classes can register prototype instances; application can request them from the prototype manager without knowing the class name
**Liabilities:**
- Each subclass of Prototype must implement a `Clone` operation — can be difficult for existing classes, and for classes with circular references or objects that do not support copying
### Key Trade-offs
| | |
|--|--|
| **Flexibility** | Run-time — swap prototype instances; can add/remove at run-time |
| **New variant cost** | Register a new prototype instance — no new class required |
| **Main constraint** | All product types must implement Clone (deep copy) |
| **Reduces subclassing** | Eliminates parallel Creator hierarchy needed by Factory Method |
| **Complexity introduced** | Low–medium — prototype registry + Clone implementation on all products |
---
## Pattern 5: Singleton
**Intent:** Ensure a class only has one instance, and provide a global point of access to it.
### Applicability
Use Singleton when:
1. There must be exactly one instance of a class, and it must be accessible to clients from a well-known access point
2. The sole instance should be extensible by subclassing, and clients should be able to use an extended instance without modifying their code
### Consequences
1. **Controlled access to sole instance** — the Singleton class encapsulates the instance; it can have strict control over how and when clients access it
2. **Reduced name space** — improvement over global variables: avoids polluting the name space with global variables for sole instances
3. **Permits refinement of operations and representation** — the Singleton class may be subclassed; the application can be configured with an instance of an extended class
4. **Permits a variable number of instances** — the same mechanism can control any fixed number of instances, not just one
5. **More flexible than class operations** — class-level (static) methods cannot be virtual and cannot be overridden polymorphically; Singleton avoids this
### Key Trade-offs
| | |
|--|--|
| **Solves** | Instance count control + global access without global variable |
| **Common combination** | Abstract Factory as a Singleton — one factory object per application |
| **Common combination** | Singleton Facade — one subsystem access point per application |
| **Caution** | Overuse of Singleton creates hidden global state; makes unit testing harder |
| **Complexity introduced** | Low — one static accessor method + lazy initialization |
---
## Side-by-Side Comparison
| Dimension | Abstract Factory | Builder | Factory Method | Prototype | Singleton |
|-----------|-----------------|---------|----------------|-----------|-----------|
| **Problem solved** | Family of related products | Complex object step-by-step | Single product; defer type to subclass | Clone configured instances | Single instance globally accessible |
| **Parameterization** | Composition | Composition | Subclassing | Composition | N/A |
| **Run-time flexibility** | High — swap factory | High — swap builder | Low — compile-time | High — swap prototype | N/A (fixed instance) |
| **Product family support** | Yes (core strength) | No | No | No | No |
| **Construction control** | Single-step | Step-by-step | Single-step | Copy | N/A |
| **New variant cost** | New ConcreteFactory | New ConcreteBuilder | New Creator + Product subclass | Register new prototype | N/A |
| **New product type cost** | High (change all factories) | Low (add builder method) | Low | Low (add Clone) | N/A |
| **Subclass requirement** | Factory subclasses | Builder subclasses | Creator subclasses | Product Clone impl | Optional |
| **Complexity** | Medium-high | Medium | Low | Low-medium | Low |
---
## Maze Game: All Five Patterns Applied
The GoF use `MazeGame::CreateMaze()` as the canonical comparison example. The baseline code hard-codes `new Maze`, `new Room(1)`, `new Room(2)`, `new Door(r1, r2)`. The goal: support enchanted maze variants with `EnchantedRoom` and `DoorNeedingSpell`.
### Baseline (inflexible)
```cpp
Maze* MazeGame::CreateMaze() {
Maze* aMaze = new Maze;
Room* r1 = new Room(1);
Room* r2 = new Room(2);
Door* theDoor = new Door(r1, r2);
// ... connect rooms
return aMaze;
}
```
Problem: The maze layout logic is sound, but any change to the component types requires editing or rewriting this entire method.
### Factory Method Solution
Add virtual creation methods to MazeGame. Subclasses override them.
```cpp
// MazeGame adds:
virtual Room* MakeRoom(int n) { return new Room(n); }
virtual Door* MakeDoor(Room* r1, Room* r2) { return new Door(r1, r2); }
// CreateMaze calls:
Room* r1 = MakeRoom(1);
Door* theDoor = MakeDoor(r1, r2);
// EnchantedMazeGame overrides:
Room* MakeRoom(int n) override { return new EnchantedRoom(n, CastSpell()); }
Door* MakeDoor(Room* r1, Room* r2) override { return new DoorNeedingSpell(r1, r2); }
```
**Trade-off:** Every maze variant = one new MazeGame subclass. Clean for compile-time variants; awkward for run-time selection.
### Abstract Factory Solution
Pass a factory object into CreateMaze. The factory creates all components.
```cpp
Maze* MazeGame::CreateMaze(MazeFactory& factory) {
Maze* aMaze = factory.MakeMaze();
Room* r1 = factory.MakeRoom(1);
Room* r2 = factory.MakeRoom(2);
Door* theDoor = factory.MakeDoor(r1, r2);
// ...
return aMaze;
}
// Call with: game.CreateMaze(EnchantedMazeFactory());
```
**Trade-off:** CreateMaze is fully decoupled from concrete classes. Adding a new component type requires adding to the MazeFactory interface and all concrete factories.
### Builder Solution
Pass a MazeBuilder object that tracks the maze being built. CreateMaze calls the builder incrementally.
```cpp
Maze* MazeGame::CreateMaze(MazeBuilder& builder) {
builder.BuildMaze();
builder.BuildRoom(1);
builder.BuildRoom(2);
builder.BuildDoor(1, 2);
return builder.GetMaze();
}
```
**Trade-off:** Builder knows both how to create and how to represent. Multiple builders (StandardMazeBuilder, CountingMazeBuilder) can interpret the same construction calls differently. Best when the construction algorithm is valuable independent of the product type.
### Prototype Solution
Parameterize CreateMaze with prototypical component instances it will clone.
```cpp
Maze* MazeGame::CreateMaze(MazePrototypeFactory& factory) {
// factory holds prototypical Room*, Door* instances
Room* r1 = factory.MakeRoom(1); // internally: _protoRoom->Clone()
Room* r2 = factory.MakeRoom(2);
Door* theDoor = factory.MakeDoor(r1, r2);
// ...
}
```
**Trade-off:** No new ConcreteFactory classes needed for each variant. Swap prototype instances to change component types. Requires Clone() on all product classes.
### Singleton Solution
The Maze object itself is managed as a Singleton. Ensures one maze exists per game session; all game objects access it through a well-known access point without passing it as a parameter everywhere.
```cpp
Maze* Maze::_instance = nullptr;
Maze* Maze::Instance() {
if (_instance == nullptr) _instance = new Maze;
return _instance;
}
```
**Trade-off:** Orthogonal to the other patterns — controls access to the maze, not how it's built. Often combined with an Abstract Factory Singleton to manage the product family for the whole application run.
---
## GraphicTool: Factory Method vs Abstract Factory vs Prototype
The GoF Discussion section uses the GraphicTool drawing editor as a second comparison. A GraphicTool needs to create Graphic objects (circles, lines, etc.). There are three ways to parameterize it:
| Approach | Structure | Trade-off |
|----------|-----------|-----------|
| **Factory Method** | A GraphicTool subclass for each Graphic subclass. Each subclass redefines a NewGraphic() operation. | Lots of GraphicTool subclasses that differ only in what they instantiate — subclass proliferation. Easy to start with. |
| **Abstract Factory** | A class hierarchy of GraphicsFactories — one for each Graphic subclass (CircleFactory, LineFactory). GraphicTool is parameterized with a factory. | Requires an equally large GraphicsFactory hierarchy. Preferable only if a GraphicsFactory hierarchy already exists for another reason. |
| **Prototype** | Each Graphic subclass implements Clone(). Each GraphicTool instance is parameterized with a prototype it clones when invoked. | Reduces class count significantly — no creator hierarchy needed. Best for this use case. Clone must be implemented on all Graphic types. |
**GoF conclusion:** Prototype is best for the drawing editor because it only requires implementing Clone() on each Graphics class, reduces the number of classes, and Clone() is useful beyond instantiation (e.g., a Duplicate menu operation). Factory Method is easiest to start with, and Abstract Factory is preferable only when a factory hierarchy already exists or is needed elsewhere.
**General evolution guidance (GoF):**
> Designs that use Abstract Factory, Prototype, or Builder are more flexible than those that use Factory Method, but they're also more complex. Often, designs start out using Factory Method and evolve toward the other creational patterns as the designer discovers where more flexibility is needed.
Implement the Composite pattern to compose objects into tree structures representing part-whole hierarchies, letting clients treat individual objects and com...
---
name: composite-pattern-implementor
description: |
Implement the Composite pattern to compose objects into tree structures representing part-whole hierarchies, letting clients treat individual objects and compositions uniformly. Use when building file systems, UI component trees, organization charts, document structures, or any recursive containment hierarchy. Addresses 9 implementation concerns including the safety-vs-transparency trade-off for child management, parent references, component sharing, caching, and child ordering.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/composite-pattern-implementor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- structural-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [4]
pages: [156-165, 43-48]
tags: [design-patterns, structural, gof, composite, tree, hierarchy, recursion, refactoring]
execution:
tier: 2
mode: full
inputs:
- type: code
description: "Existing code that distinguishes between primitive and container objects with separate handling paths, or a description of a part-whole hierarchy to model"
tools-required: [TodoWrite, Read]
tools-optional: [Grep, Edit]
mcps-required: []
environment: "Any codebase. Language-agnostic — examples use Python but the pattern applies universally."
---
# Composite Pattern Implementor
## When to Use
You need to model a part-whole hierarchy where clients should be able to treat leaf objects and composite objects the same way — without type-checking or branching on "is this a container or a primitive?"
Apply this skill when:
- Code that operates on a structure must repeatedly ask "is this a single item or a group?" to decide what to do — **the primary code smell**
- You want to represent hierarchies like: file systems (files and directories), UI trees (widgets and panels), org charts (employees and departments), document structures (characters, rows, columns), or bill-of-materials (parts and assemblies)
- A recursive containment relationship exists: containers can hold other containers as well as primitives
- New kinds of components should be addable without changing client traversal code
**Two applicability conditions (from GoF):**
1. You want to represent part-whole hierarchies of objects
2. You want clients to be able to ignore the difference between compositions of objects and individual objects — clients will treat all objects in the composite structure uniformly
Before starting, confirm you are not looking for Decorator (which adds responsibilities to a single object, not a tree), or Iterator (which traverses but does not define the containment structure).
---
## Process
### Step 1: Set Up Tracking and Identify the Hierarchy
**ACTION:** Use `TodoWrite` to track progress, then identify the part-whole structure that needs to be modeled.
**WHY:** The shape of the hierarchy — how many levels deep, whether ordering matters, whether nodes can have multiple parents — governs all subsequent design decisions. Capturing this upfront prevents discovering structural constraints mid-implementation that force revisiting earlier steps.
```
TodoWrite:
- [ ] Step 1: Identify the hierarchy and characterize its nodes
- [ ] Step 2: Design the Component interface
- [ ] Step 3: Implement Leaf classes
- [ ] Step 4: Implement the Composite class
- [ ] Step 5: Make the safety-vs-transparency decision
- [ ] Step 6: Apply implementation concerns (parent refs, caching, ordering, deletion)
- [ ] Step 7: Validate the refactored structure
```
Document the hierarchy:
| Element | What to capture |
|---------|----------------|
| **Primitive nodes** | What are the leaf types? What operations do they perform? |
| **Container nodes** | What are the composite types? Do containers vary in type (e.g., Bus vs. Cabinet)? |
| **Shared operations** | Which operations must both leaves and composites support? |
| **Child ordering** | Does sequence matter (parse trees, rendered rows) or is it a set? |
| **Sharing** | Can a node appear under multiple parents? |
| **Depth** | Is the nesting depth bounded or unbounded? |
If working from a description rather than existing code, sketch the object tree showing two or three levels of nesting before writing any classes.
---
### Step 2: Design the Component Interface
**ACTION:** Define the abstract Component class — the common interface for both Leaf and Composite objects.
**WHY:** The Component interface is what enables uniform treatment. When all objects in the hierarchy share this interface, client code can invoke operations without knowing whether it holds a leaf or a composite. If the interface is too narrow, clients must downcast; if too wide, leaf implementations are burdened with operations that have no meaning for them. The goal is to maximize what can be defined here — every operation moved to Component is an operation clients never need to branch on.
**Design the interface by listing operations in two categories:**
| Category | Operations | Where declared |
|----------|-----------|----------------|
| **Domain operations** | `Draw()`, `NetPrice()`, `Render()`, `Execute()` — operations meaningful to every node | Component — always |
| **Child management** | `Add(Component)`, `Remove(Component)`, `GetChild(int)` — structural operations for composites | Component or Composite — see Step 5 |
**Maximizing the Component interface:** Some operations appear to belong only to Composites (e.g., accessing children). However, if you model a Leaf as a Component that *never has children*, you can define a default `GetChild()` on Component that returns nothing, and Composites override it to return their children. This keeps clients ignorant of the distinction. Do this for read-only structural operations. The add/remove write operations are more controversial — see Step 5.
```python
from abc import ABC, abstractmethod
from typing import Optional
class Component(ABC):
"""Abstract base for all nodes in the part-whole hierarchy."""
@abstractmethod
def operation(self) -> str:
"""Domain operation — implemented by every node."""
...
def add(self, component: "Component") -> None:
"""Default: raise or no-op. Composite overrides."""
raise NotImplementedError("Leaf components do not support add()")
def remove(self, component: "Component") -> None:
"""Default: raise or no-op. Composite overrides."""
raise NotImplementedError("Leaf components do not support remove()")
def get_child(self, index: int) -> Optional["Component"]:
"""Default: return None. Composite overrides."""
return None
```
---
### Step 3: Implement Leaf Classes
**ACTION:** Create one Leaf class per primitive node type. Each implements the domain operation(s) and does not override child management.
**WHY:** Leaves are the base cases of the recursive structure — they do actual work. They represent the primitives of the composition: `FloppyDisk`, `Character`, `Employee`, `File`. Keeping leaves simple is the key benefit: a client can hold a `Component` reference and call `operation()` without knowing it is a leaf.
```python
class Leaf(Component):
"""Represents a primitive object — has no children."""
def __init__(self, name: str):
self._name = name
def operation(self) -> str:
return f"Leaf({self._name})"
```
**In the Equipment hierarchy (GoF sample code):**
```python
class FloppyDisk(Component):
"""Leaf — a single hardware component with no subparts."""
def __init__(self, name: str, wattage: int, price: float):
self._name = name
self._wattage = wattage
self._price = price
def net_price(self) -> float:
return self._price # Leaf's price is its own price — no summation needed
def power(self) -> int:
return self._wattage
```
---
### Step 4: Implement the Composite Class
**ACTION:** Create the Composite class that stores and manages child components, and implements domain operations by delegating to children recursively.
**WHY:** The Composite is where the recursive power of the pattern lives. It implements the same interface as Leaf but does so by iterating over children and aggregating their results. This means operations like `net_price()` on a Cabinet automatically sum up all nested components — no client code has to know the tree depth or structure.
```python
class CompositeComponent(Component):
"""A container node that can hold both Leaves and other Composites."""
def __init__(self, name: str):
self._name = name
self._children: list[Component] = []
def add(self, component: Component) -> None:
self._children.append(component)
def remove(self, component: Component) -> None:
self._children.remove(component)
def get_child(self, index: int) -> Optional[Component]:
return self._children[index]
def operation(self) -> str:
results = [child.operation() for child in self._children]
return f"Composite({self._name})[{', '.join(results)}]"
```
**Equipment hierarchy — Composite with aggregation (GoF sample code):**
```python
class CompositeEquipment(Component):
"""Base for equipment that contains sub-equipment."""
def __init__(self, name: str):
self._name = name
self._equipment: list[Component] = []
def add(self, component: Component) -> None:
self._equipment.append(component)
def remove(self, component: Component) -> None:
self._equipment.remove(component)
def net_price(self) -> float:
"""Sum the net prices of all sub-equipment recursively."""
# WHY: recursive delegation means Cabinet.net_price() automatically
# traverses the full tree — no client code needs to walk the structure.
return sum(child.net_price() for child in self._equipment)
class Chassis(CompositeEquipment):
"""A chassis holds drives, buses, and cards."""
def __init__(self, name: str):
super().__init__(name)
self._wattage = 0
def net_price(self) -> float:
return super().net_price() # inherits summation from CompositeEquipment
class Cabinet(CompositeEquipment):
"""A cabinet holds chassis and other equipment."""
pass
# Assembly:
cabinet = Cabinet("PC Cabinet")
chassis = Chassis("PC Chassis")
cabinet.add(chassis)
bus = CompositeEquipment("MCA Bus")
bus.add(FloppyDisk("Token Ring Card", wattage=4, price=29.99))
chassis.add(bus)
chassis.add(FloppyDisk("3.5in Floppy", wattage=5, price=19.99))
print(cabinet.net_price()) # Recursively sums all sub-equipment prices
```
---
### Step 5: Make the Safety-vs-Transparency Decision
**ACTION:** Decide where to declare child management operations (`add`, `remove`, `get_child`) — in Component (transparency) or only in Composite (safety). This is the central design decision of the Composite pattern.
**WHY:** This trade-off has no universally correct answer — it depends on what kind of errors your system must catch and at what point. Choosing transparency means clients never need to downcast or type-check, which is the core promise of the pattern. Choosing safety means misuse is caught at compile time, but clients that want to use the full Composite interface must check types at runtime. Consciously choose; do not let it default.
**The trade-off:**
| Option | What it gives you | What it costs you |
|--------|-------------------|-------------------|
| **Transparency** — declare `add`/`remove` in Component | Clients treat all components uniformly; no downcasting needed | Leaves inherit operations that are meaningless for them; `add()` on a Leaf must fail at runtime |
| **Safety** — declare `add`/`remove` only in Composite | Misuse caught at compile time (statically typed languages); Leaf interface is clean | Clients must downcast to `Composite` to manage children; uniformity breaks |
**GoF recommendation:** Emphasize transparency — declare child management in Component, implement defaults that fail at runtime. This preserves the pattern's central benefit.
**Implementing safe transparency with `get_composite()` (GoF idiom):**
If you choose safety but still want to avoid unsafe downcasts, add a `get_composite()` method to Component that returns `None` by default, and override it in Composite to return `self`. Clients use this to obtain the Composite interface safely:
```python
class Component(ABC):
def get_composite(self) -> Optional["CompositeComponent"]:
"""Returns self if this component is a Composite, None otherwise."""
return None # Default: not a composite
class CompositeComponent(Component):
def get_composite(self) -> Optional["CompositeComponent"]:
return self # Override: I am a composite
# Safe usage — no blind downcast:
if composite := component.get_composite():
composite.add(new_leaf)
```
**When transparency raises on Leaf.add():** Make `add()` fail by raising an exception — not silently succeed. Silently ignoring `add()` on a leaf hides a bug (the caller believes the component was added). Raising makes misuse visible immediately.
---
### Step 6: Apply Remaining Implementation Concerns
**ACTION:** Evaluate each of the 9 implementation concerns and apply the relevant ones to your design.
**WHY:** The 9 concerns are not all mandatory — they are a checklist of decisions that frequently arise. Each one you skip is an implicit default that may cause problems later (e.g., not managing parent references will complicate deletion; not addressing ordering will produce incorrect tree traversal). Work through the list deliberately, even if the answer is "not applicable."
See `references/composite-implementation-guide.md` for full details on each concern.
**Checklist:**
| # | Concern | Decision to make |
|---|---------|-----------------|
| 1 | **Explicit parent references** | Does traversal upward (to parent) simplify your operations? If yes, add a `parent` field to Component and maintain it in `add()`/`remove()`. |
| 2 | **Sharing components** | Can a node appear under two parents? If yes, apply Flyweight — externalize mutable state so nodes do not need to know their parent. |
| 3 | **Maximizing the Component interface** | Have you moved as many operations to Component as reasonable (with sensible leaf defaults)? |
| 4 | **Safety vs. transparency** | Decided in Step 5. Confirm the choice is consistently applied. |
| 5 | **Child list storage** | Does every Composite have the same child storage structure? If Composites vary significantly, store the child list in each Composite subclass rather than a shared base. Watch the memory cost of putting a list in every Leaf. |
| 6 | **Child ordering** | Does the sequence of children matter (rendering order, parse tree structure)? If yes, use an ordered list and design `add()` to accept a position. Consider Iterator for traversal. |
| 7 | **Caching** | Are traversals expensive? Cache computed results (e.g., bounding boxes, aggregated prices) in Composite. Invalidate parent caches when a child changes — requires parent references (concern 1). |
| 8 | **Who deletes components** | In languages without garbage collection: Composite is responsible for deleting its children on destruction, except when Leaf objects are immutable and shared (Flyweight). |
| 9 | **Data structure for children** | Default is a list. Use a tree for sorted order, a hash table for fast lookup by key, or per-child named fields when the set of children is fixed and known. |
**Applying parent references (concern 1):**
```python
class Component(ABC):
def __init__(self):
self._parent: Optional["CompositeComponent"] = None
@property
def parent(self) -> Optional["CompositeComponent"]:
return self._parent
class CompositeComponent(Component):
def add(self, component: Component) -> None:
self._children.append(component)
component._parent = self # maintain invariant: child's parent = this composite
def remove(self, component: Component) -> None:
self._children.remove(component)
component._parent = None # clear parent when removed
```
---
### Step 7: Validate the Refactored Structure
**ACTION:** Verify the implementation is complete and uniform treatment holds end-to-end.
**WHY:** The refactoring is only successful if client code that operated on leaves or composites separately can now use the Component interface uniformly, without branching on type. If any client still does `isinstance(node, Composite)` for anything other than the explicit `get_composite()` safety idiom, the pattern is incomplete.
**Validation checklist:**
- [ ] Client code uses only the `Component` interface for domain operations — no `isinstance` checks to decide which operation to call
- [ ] `operation()` (or equivalent domain method) produces correct results on a Leaf, a shallow Composite, and a deep nested Composite
- [ ] `add()` / `remove()` on a Leaf either raises a clear exception (safety) or fails with a meaningful error (transparency variant)
- [ ] Adding a new Leaf type requires only a new class implementing Component — no changes to existing Composites or client code
- [ ] Adding a new Composite type requires only a new class extending CompositeEquipment (or equivalent) — no changes to clients
- [ ] If caching was added (concern 7), modifying a child correctly invalidates the parent's cache
- [ ] If parent references were added (concern 1), the invariant holds: every child's `parent` points to its containing Composite
---
## Examples
### Example 1: File System Hierarchy
**Scenario:** A file system utility needs to calculate directory sizes. Currently, code special-cases directories vs. files with `isinstance` checks throughout the size calculation logic.
**Trigger:** "Every place we compute size or count files, we have to check whether we're dealing with a file or a directory."
**Before (type-branching code smell):**
```python
def get_size(node):
if isinstance(node, File):
return node.size
elif isinstance(node, Directory):
return sum(get_size(child) for child in node.children)
```
**After (Composite):**
```python
class FileSystemNode(ABC):
@abstractmethod
def size(self) -> int: ...
def add(self, node: "FileSystemNode") -> None:
raise NotImplementedError
def get_children(self) -> list["FileSystemNode"]:
return []
class File(FileSystemNode):
def __init__(self, name: str, size_bytes: int):
self._name = name
self._size = size_bytes
def size(self) -> int:
return self._size
class Directory(FileSystemNode):
def __init__(self, name: str):
self._name = name
self._children: list[FileSystemNode] = []
def add(self, node: FileSystemNode) -> None:
self._children.append(node)
def get_children(self) -> list[FileSystemNode]:
return self._children
def size(self) -> int:
return sum(child.size() for child in self._children)
# Client — no isinstance, no branching:
root = Directory("root")
root.add(File("readme.txt", 1024))
docs = Directory("docs")
docs.add(File("design.pdf", 204800))
root.add(docs)
print(root.size()) # 205824 — traverses entire tree automatically
```
**Output:** Client code calls `node.size()` uniformly. A new `SymbolicLink` type (a Leaf with a `size` pointing to its target) requires zero changes to directories or the size computation.
---
### Example 2: UI Component Tree (React-style)
**Scenario:** A server-side UI renderer builds HTML from a component tree. Panels hold other components; buttons, labels, and inputs are leaves. The renderer must call `render()` on the root and get complete HTML.
**Trigger:** "We're adding more container types (tabs, accordions) and the render function keeps growing with new `if isinstance(...)` branches."
**Design:**
```python
class UIComponent(ABC):
@abstractmethod
def render(self, indent: int = 0) -> str: ...
def add(self, component: "UIComponent") -> None:
raise NotImplementedError("Leaf UI components do not support add()")
class Button(UIComponent):
def __init__(self, label: str):
self._label = label
def render(self, indent: int = 0) -> str:
return " " * indent + f"<button>{self._label}</button>"
class Panel(UIComponent):
def __init__(self, css_class: str):
self._class = css_class
self._children: list[UIComponent] = []
def add(self, component: UIComponent) -> None:
self._children.append(component)
def render(self, indent: int = 0) -> str:
inner = "\n".join(child.render(indent + 2) for child in self._children)
return f'{" " * indent}<div class="{self._class}">\n{inner}\n{" " * indent}</div>'
# Assembly and rendering:
form = Panel("login-form")
form.add(Button("Sign In"))
form.add(Button("Cancel"))
page = Panel("page")
page.add(form)
print(page.render()) # Full recursive HTML, no type checks
```
**Output:** Adding a `TabContainer` composite or an `Input` leaf requires no changes to `render()` callers or existing components.
---
### Example 3: Lexi Document Structure (GoF Case Study)
**Scenario:** Lexi is a WYSIWYG document editor. Documents are built from graphical elements: characters, images, and lines at the leaf level; rows and columns as composites. Operations like `Draw()`, `Bounds()`, and `Intersects()` must work on any element regardless of complexity — a single character and a multi-column page are treated identically by the rendering engine.
**The Glyph hierarchy (GoF pages 43-48):**
```
Component: Glyph — Draw(Window), Bounds(Rect), Intersects(Point), Insert(Glyph, int), Remove(Glyph), Child(int), Parent()
Leaf: Character — holds a char, draws via w->DrawCharacter(c)
Leaf: Rectangle — draws via w->DrawRect(x0, y0, x1, y1)
Leaf: Image — holds a bitmap, draws to window
Composite: Row — arranges children left-to-right; Draw() iterates children, ensures each is positioned, calls c->Draw(w)
Composite: Column — arranges Row children vertically
```
**Key design decisions made in Lexi:**
1. **Transparency chosen:** `Insert()`, `Remove()`, and `Child()` are declared on `Glyph` (the Component), not just on `Row` (the Composite). Leaves do not override these — clients can always call them without type-checking.
2. **Parent references used:** `Glyph` declares `Parent()`. Every glyph stores a reference to its parent so the formatter can walk up the structure and so `Remove()` can detach a glyph from its container.
3. **Ordered children:** `Row` uses an ordered list for children — the sequence of characters defines the visual order on screen. `Insert(Glyph* g, int i)` inserts at a specific position.
4. **Operation maximization:** `Bounds()` is defined on all Glyphs, not just containers. A `Character`'s bounding box is its character cell; a `Row`'s bounding box is the union of its children's bounding boxes.
**Result:** The formatter, spell-checker, and renderer all hold `Glyph*` references and call `Draw()`, `Bounds()`, or `Intersects()` uniformly — whether the target is a single character or a nested multi-column structure, the code is identical.
---
## Key Principles
- **The pattern's value is uniform treatment** — if clients still branch on leaf-vs-composite for domain operations, the pattern has failed. Every `isinstance` check for operational dispatch is a sign the Component interface needs expansion.
- **Transparency is a deliberate choice, not a default** — declaring child management in Component trades compile-time safety for uniformity. Make this choice explicitly and document it. Do not let it happen because you copied a template.
- **Recursive delegation is the mechanism** — a Composite's `operation()` should almost always iterate over children and aggregate. The power of the pattern is that `cabinet.net_price()` correctly sums a deeply nested tree without the caller knowing the depth or structure.
- **Leaves must still handle child management calls gracefully** — whether by raising a clear exception (transparency + safety) or silently doing nothing (transparency only). Silent no-ops hide bugs; prefer raising with a message.
- **"Can make design overly general" is a real risk** — the pattern makes it easy to add any component anywhere in the tree. If you want a Composite to hold only certain types of children, Composite cannot enforce that constraint via the type system — you must add runtime checks. Acknowledge this limitation explicitly; it may mean Composite is the wrong choice when type constraints are essential. Always ask: "Should ANY child be addable to ANY composite, or do structural rules apply?" If rules apply, add validation in `add()` and document which child types each Composite accepts.
- **Use Iterator for traversal, Visitor for operations** — once you have a Composite tree, two companion patterns become relevant. Use **Iterator** when clients need to traverse the tree in different orders (preorder, postorder, breadth-first) without knowing the tree's internal structure. Use **Visitor** when you need to add new operations to the tree (validation, rendering, aggregation) without modifying the Component classes. Iterator + Visitor together are the standard way to operate on Composite structures — Iterator handles the "walk" and Visitor handles the "what to do at each node."
---
## References
| File | Contents |
|------|----------|
| `references/composite-implementation-guide.md` | All 9 implementation concerns with full rationale, language-specific notes, related patterns (Chain of Responsibility, Flyweight, Iterator, Visitor, Decorator), and the complete GoF consequences catalog |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-structural-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/composite-implementation-guide.md
# Composite Pattern — Implementation Guide
> Reference for `composite-pattern-implementor` SKILL.md.
> Source: GoF Design Patterns, pages 156–165 (Composite pattern) and pages 43–48 (Lexi Glyph case study).
---
## The 9 Implementation Concerns
These are the decisions that most frequently arise when implementing the Composite pattern. They are not sequential steps — they are a checklist to work through after the basic structure is in place.
---
### Concern 1: Explicit Parent References
**What it is:** Storing a back-reference from each child component to its parent Composite.
**Why you might want it:**
- Simplifies traversal *upward* in the structure (e.g., "find the nearest ancestor with property X")
- Required for the Chain of Responsibility pattern when used alongside Composite
- Makes deletion cleaner — a component can remove itself from its parent without the parent knowing
**The invariant to maintain:** Every child of a Composite must have that Composite as its parent. The invariant must hold at all times — not just after construction. The safest approach is to change a component's parent *only* in the `add()` and `remove()` operations of the Composite class:
```python
class CompositeComponent(Component):
def add(self, component: Component) -> None:
self._children.append(component)
component._parent = self # set parent when adding
def remove(self, component: Component) -> None:
self._children.remove(component)
component._parent = None # clear parent when removing
```
If this logic lives in `CompositeEquipment` (a shared base for all Composites), all Composite subclasses inherit correct parent management automatically.
**When to skip it:** If your traversal is always top-down (root to leaves) and you never need to navigate upward or delete from within, parent references add memory and maintenance cost with no benefit.
---
### Concern 2: Sharing Components (Flyweight Integration)
**What it is:** Allowing a single component to appear as a child of more than one parent.
**Why it is difficult:** When a component has only one parent, parent references (Concern 1) work cleanly. When a component has multiple parents, propagating a request upward becomes ambiguous — which parent do you notify?
**The Flyweight solution:** Redesign the component so it holds no parent reference and externalizes mutable state. Components become intrinsic (shared, immutable) + extrinsic (contextual, passed at request time). Clients pass context (e.g., current position, current composite) as parameters rather than the component looking it up via `parent`.
```python
# Flyweight approach — no parent reference, context passed explicitly
class SharedGlyph:
"""Intrinsic state only — can be shared across multiple Row composites."""
def __init__(self, char: str):
self._char = char
def draw(self, window, x: int, y: int) -> None:
# x, y are extrinsic — passed by the Row that owns this position
window.draw_character(self._char, x, y)
```
**When to use:** When memory reduction from sharing outweighs the complexity of externalizing context. Particularly valuable for large trees with repeated primitives (characters in a text document, identical icons in a UI).
---
### Concern 3: Maximizing the Component Interface
**What it is:** Defining as many operations as possible in the Component class so Leaf and Composite subclasses can be used interchangeably.
**The tension:** The Component interface should be wide enough that clients never need to downcast — but a class should only define operations meaningful to its subclasses. Leaf classes should not be burdened with child management operations they cannot meaningfully implement.
**Resolution strategy:** Ask "can this operation have a reasonable default implementation on a Leaf?" If yes, move it to Component.
| Operation | Leaf default | Composite override |
|-----------|-------------|-------------------|
| `get_child(i)` | Return `None` — leaf has no children | Return `_children[i]` |
| `get_children()` | Return `[]` — empty list | Return `_children` |
| `net_price()` | Return own price | Sum children's prices |
| `power()` | Return own wattage | Sum children's wattage |
| `add(c)` | Raise exception | Append to `_children` |
| `remove(c)` | Raise exception | Remove from `_children` |
**Child management is the exception:** `add()` and `remove()` have no sensible default for leaves. These are handled by the safety-vs-transparency decision (Concern 4).
---
### Concern 4: Declaring the Child Management Operations — Safety vs. Transparency
**This is the central design decision of the Composite pattern.**
#### Option A: Transparency — Declare in Component
Child management operations (`add`, `remove`, `get_child`) are declared in the `Component` abstract class.
**What you gain:** Clients can call these operations on any Component without knowing whether it is a Leaf or a Composite. This is the purest form of the pattern — uniform treatment extends to structural operations.
**What you pay:** Leaf classes inherit operations that have no meaning for them. The implementation must handle misuse at runtime (by raising exceptions or returning a null/empty result). There is no compile-time protection.
**Implementation for transparency:**
```python
class Component(ABC):
def add(self, component: "Component") -> None:
raise NotImplementedError(f"{type(self).__name__} does not support add()")
def remove(self, component: "Component") -> None:
raise NotImplementedError(f"{type(self).__name__} does not support remove()")
def get_child(self, index: int) -> Optional["Component"]:
return None
```
#### Option B: Safety — Declare Only in Composite
Child management is declared and implemented only in the `Composite` class (not on `Component`).
**What you gain:** The type system enforces correct usage. In a statically typed language, calling `add()` on a `Component` reference is a compile error.
**What you pay:** Clients cannot uniformly treat all components. Whenever they need to add or remove children, they must obtain a `Composite` reference, either via downcast or the `get_composite()` idiom.
**The `get_composite()` idiom (GoF page 160):** Avoids unsafe downcasting while retaining type safety.
```python
class Component(ABC):
def get_composite(self) -> Optional["CompositeComponent"]:
"""Returns self if this is a Composite, None otherwise."""
return None
class CompositeComponent(Component):
def get_composite(self) -> Optional["CompositeComponent"]:
return self # I am a Composite
# Safe type-checked usage:
if composite := component.get_composite():
composite.add(new_child) # Only executes if component is actually a Composite
```
#### GoF recommendation
> "We have emphasized transparency over safety in this pattern."
Transparency is the default choice. Safety is appropriate when the cost of runtime leaf-manipulation errors is high and compile-time protection is worth the loss of uniformity.
---
### Concern 5: Should Component Implement a Child List?
**The question:** Should the child storage (the list/set/tree of children) live in the `Component` base class, or only in `Composite`?
**Putting it in Component:**
- Simplifies implementation — all classes inherit the storage
- Costs memory on every Leaf — a Leaf has no children but pays the overhead of an empty collection
- Worthwhile only if there are relatively few leaves in the structure
**Putting it in Composite (or a shared CompositeEquipment base):**
- Leaves pay no cost for storage they will never use
- Preferred when the structure contains many leaves
- Allows different Composite subtypes to use different storage structures
**Practical rule:** If Leaf objects are numerous and small (characters in a text document), keep child storage out of Component. If the tree is shallower with few leaves (an org chart, a UI widget tree), the convenience of shared storage in Component is acceptable.
---
### Concern 6: Child Ordering
**When it matters:** Any time the sequence of children affects the semantics of the operation. Examples:
- Rendering order (children drawn left-to-right define layout)
- Parse trees (compound statements have children ordered to reflect program sequence)
- Layered composites (z-order for UI stacking)
**Implementation:** Use an ordered list (`list` in Python, `std::vector` in C++). Design `add()` to accept an optional position parameter:
```python
def add(self, component: Component, position: int = -1) -> None:
if position == -1:
self._children.append(component)
else:
self._children.insert(position, component)
```
**Iterator integration:** When traversal order matters and is non-trivial, consider applying the Iterator pattern to encapsulate traversal logic. The Composite's `create_iterator()` factory method returns an iterator appropriate for its structure, keeping traversal policy separate from the containment structure.
---
### Concern 7: Caching to Improve Performance
**When to use:** When traversals are expensive and results can be validly cached between changes. Examples:
- Bounding box of a group of graphic objects (expensive to recompute if nothing moved)
- Net price of an equipment hierarchy (stable unless components are added or removed)
- Rendered HTML of a UI subtree (cacheable until a child changes)
**Implementation:** Store a cached value in the Composite, invalidated when any child changes.
```python
class CachedComposite(Component):
def __init__(self, name: str):
self._name = name
self._children: list[Component] = []
self._cached_price: Optional[float] = None # None means cache is invalid
def add(self, component: Component) -> None:
self._children.append(component)
self._invalidate_cache()
def remove(self, component: Component) -> None:
self._children.remove(component)
self._invalidate_cache()
def net_price(self) -> float:
if self._cached_price is None:
self._cached_price = sum(c.net_price() for c in self._children)
return self._cached_price
def _invalidate_cache(self) -> None:
self._cached_price = None
if self._parent:
self._parent._invalidate_cache() # propagate up — requires parent refs (Concern 1)
```
**Dependency on Concern 1:** Cache invalidation propagating up the tree requires parent references. If you need caching, you almost certainly also need parent references.
---
### Concern 8: Who Should Delete Components?
**Context:** In languages without garbage collection (C++, Rust), memory management must be explicit.
**Default rule:** Make the Composite responsible for deleting its children when it is destroyed. This prevents memory leaks when a Composite is removed from the tree.
**Exception:** When Leaf objects are immutable and shared (Flyweight, Concern 2), the Composite must *not* delete shared leaves — doing so would destroy objects still referenced by other parents.
**In garbage-collected languages (Python, Java, JavaScript):** This concern does not apply directly, but the equivalent is reference counting. If a Composite is the sole owner of its children, clearing the children list when the Composite is removed allows GC to reclaim the leaves.
---
### Concern 9: Best Data Structure for Storing Children
**Context:** The GoF default is a list (linked list in C++, array in practice). The right choice depends on the operations most frequently performed.
| Data structure | Best when | Trade-off |
|---------------|-----------|-----------|
| **List (ordered)** | Sequence matters; children often iterated in order | O(n) search and removal |
| **Array/vector** | Random access by index; children rarely inserted in middle | O(n) insertion at arbitrary position |
| **Hash table** | Children looked up by key (name, ID) | No inherent ordering |
| **Tree (sorted)** | Children must be maintained in sorted order | O(log n) insert/search, more complex |
| **Per-child named fields** | Set of children is fixed and known at design time (e.g., a binary tree node always has `left` and `right`) | Requires each Composite subclass to implement its own management interface |
**When named fields are preferred:** If you know a Composite always has exactly N children with semantic roles (e.g., a binary expression node has `left_operand` and `right_operand`), use named fields instead of a generic collection. This is more expressive, provides type-safe access, and eliminates iteration overhead. The trade-off is that each Composite subclass must define its own management interface. See the Interpreter pattern for an example of this approach.
---
## Consequences (Full Catalog)
From GoF pages 158–159:
**1. Defines class hierarchies consisting of primitive objects and composite objects.**
Primitive objects can be composed into more complex objects, which in turn can be composed, and so on recursively. Wherever client code expects a primitive object, it can also take a composite object.
**2. Makes the client simple.**
Clients can treat composite structures and individual objects uniformly. Clients normally don't know (and shouldn't care) whether they're dealing with a leaf or a composite. This simplifies client code, because it avoids having to write tag-and-case-statement-style functions over the classes that define the composition.
**3. Makes it easier to add new kinds of components.**
Newly defined Composite or Leaf subclasses work automatically with existing structures and client code. Clients don't have to be changed for new Component classes.
**4. Can make your design overly general.**
The disadvantage of making it easy to add new components is that it makes it harder to restrict the components of a composite. Sometimes you want a composite to have only certain components. With Composite, you can't rely on the type system to enforce those constraints for you. You'll have to use run-time checks instead.
---
## Related Patterns
| Pattern | Relationship to Composite |
|---------|--------------------------|
| **Chain of Responsibility** | The component-parent link is commonly used for Chain of Responsibility. A request propagates up the tree from child to parent. Requires parent references (Concern 1). |
| **Decorator** | Often used with Composite. When Decorators and Composites are used together, they share a common Component class. Decorators must support the Component interface with `Add`, `Remove`, `GetChild`. |
| **Flyweight** | Lets you share components, but they can no longer refer to their parents. Applicable when memory reduction from sharing outweighs the need for parent navigation. |
| **Iterator** | Can be used to traverse composites. Particularly useful when child ordering is significant and traversal policy should be decoupled from the containment structure. |
| **Visitor** | Localizes operations and behavior that would otherwise be distributed across Composite and Leaf classes. Use when you need to add operations to the hierarchy without modifying each class. |
---
## Language-Specific Notes
### Python
- Use `ABC` and `@abstractmethod` for Component
- `Optional["ClassName"]` for forward references and nullable parent/child types
- List comprehensions make recursive `operation()` implementations concise: `sum(c.net_price() for c in self._children)`
- No manual memory management — Concern 8 is handled by the garbage collector
### Java / C#
- Component as an abstract class or interface
- Java: `List<Component>` as child storage; `Optional<Component>` for nullable returns
- C#: `IComponent` interface pattern common; use `IEnumerable<IComponent>` for children
- Safety option is more natural — `instanceof` (Java) / `is` (C#) checks are common and idiomatic
### C++ (GoF original)
- Component as an abstract base class with pure virtual methods
- Child storage in `CompositeEquipment` using `std::list<Equipment*>` (`_equipment` member)
- Manual memory management applies: Composite destructor deletes children (Concern 8)
- `dynamic_cast` available as alternative to `GetComposite()` idiom for safety approach
- `CreateIterator()` factory method in Component returns `NullIterator` by default (leaves), `ListIterator` in Composite
Implement the Command pattern to encapsulate requests as objects, enabling parameterized operations, queuing, logging, and undo/redo. Use when you need to de...
---
name: command-pattern-implementor
description: |
Implement the Command pattern to encapsulate requests as objects, enabling parameterized operations, queuing, logging, and undo/redo. Use when you need to decouple UI elements from operations, implement multi-level undo with command history, support macro recording, queue or schedule requests, or log operations for crash recovery. Includes the complete undo/redo algorithm using a command history list with present-line pointer, MacroCommand for composite operations, and SimpleCommand template for eliminating subclasses.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/command-pattern-implementor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- behavioral-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [2, 5]
pages: [64-69, 219-228]
tags: [design-patterns, behavioral, gof, command, undo, redo, macro, history, transaction, invoker, receiver]
execution:
tier: 2
mode: full
inputs:
- type: codebase
description: "A software project needing to encapsulate operations as objects, with or without undo/redo support"
tools-required: [TodoWrite, Read, Write]
tools-optional: [Grep, Bash]
mcps-required: []
environment: "A codebase with at least one invoker (button, menu, toolbar) and one or more receivers (domain objects that perform the actual work)"
---
# Command Pattern Implementor
## When to Use
You have a behavioral problem where one or more of these is true:
- UI elements (menu items, toolbar buttons, keyboard shortcuts) need to trigger operations without knowing the receiver or the operation's implementation
- The application requires multi-level undo and redo beyond a single "last action"
- Operations need to be queued, scheduled, logged to disk, or replayed after a crash
- Multiple UI surfaces (menu, button, keyboard shortcut) must trigger the same operation without code duplication
- Operations should be combinable into macro sequences that execute as a unit
Before starting, confirm:
- Can you identify at least one **invoker** (the object that triggers the request — e.g., a menu item, button, toolbar) and one **receiver** (the object that knows how to perform the operation — e.g., a document, model, service)?
- Do operations need to be reversible? If so, can each operation store enough state to reverse itself, or does it need an external snapshot (Memento)?
If unsure whether Command is the right pattern, invoke `behavioral-pattern-selector` first and confirm Command is recommended.
---
## Process
### Step 1: Set Up Tracking and Identify Participants
**ACTION:** Use `TodoWrite` to track progress. Identify the four Command pattern participants in your specific system.
**WHY:** Command's power comes from the clean separation of four roles. Confusing the invoker with the receiver — or putting business logic in the wrong place — undermines the pattern's decoupling benefit. Naming these explicitly before writing any code prevents the most common Command implementation mistakes.
```
TodoWrite:
- [ ] Step 1: Identify Command participants (invoker, receiver, command, client)
- [ ] Step 2: Design the Command interface and protocol
- [ ] Step 3: Implement concrete commands with Execute (and Unexecute if needed)
- [ ] Step 4: Implement the undo/redo history if required
- [ ] Step 5: Implement MacroCommand if composite operations are needed
- [ ] Step 6: Wire invoker to commands in client code
- [ ] Step 7: Verify with-skill vs without-skill scenarios
```
Capture these four participants:
| Role | GoF Name | In your system | Responsibility |
|------|----------|----------------|---------------|
| **Command** | Command | The abstract interface | Declares Execute() — and Unexecute()/Reversible() if undoable |
| **ConcreteCommand** | ConcreteCommand | One class per operation | Binds receiver + action; stores state for undo |
| **Invoker** | Invoker | e.g., MenuItem, Button, Toolbar | Calls Execute() on its command; knows nothing about receiver or operation |
| **Receiver** | Receiver | e.g., Document, Model, Service | Has the domain knowledge to actually carry out the request |
| **Client** | Client (Application) | e.g., Application, Controller | Creates ConcreteCommand objects, sets their receivers, gives them to invokers |
If the receiver is not identifiable — i.e., the command does everything itself without delegating — that is a valid design choice for simple operations (see the "How intelligent should a command be?" discussion in the implementation guide).
---
### Step 2: Design the Command Interface and Protocol
**ACTION:** Define the Command interface. Choose the right protocol based on whether undo is required.
**WHY:** The interface design is load-bearing — every invoker, every history list, every macro depends on it. Adding Unexecute later as an afterthought usually means retrofitting all concrete commands. Decide at design time whether undo is in scope; if it is, Unexecute and Reversible belong in the base interface from the start.
**Minimal interface (no undo required):**
```
abstract class Command:
abstract Execute()
```
**Full undo/redo interface:**
```
abstract class Command:
abstract Execute()
abstract Unexecute() # Reverses the effect of Execute()
virtual Reversible() → bool # Default: true. Override to return false
# when a command has no effect or is non-reversible
# (e.g., SaveCommand — "undo save" is meaningless)
```
**Reversible() rationale:** Sometimes a command's net effect is nothing — e.g., a FontCommand applied to text that already has that font. Calling Unexecute on such a command would do equally meaningless work. `Reversible()` lets the command signal at runtime whether the undo is meaningful. Commands that return `false` are not pushed onto the history list.
**Note on state storage:** Each ConcreteCommand must store whatever state is needed to reverse itself. This typically means capturing the pre-Execute state of the receiver *before* the operation runs. What to capture:
- The receiver object reference
- The arguments that were passed to the receiver
- Any original values in the receiver that will change as a result
---
### Step 3: Implement Concrete Commands
**ACTION:** For each operation in scope, create a ConcreteCommand class that binds the receiver to a specific action.
**WHY:** The ConcreteCommand is the "binding" — it pairs a receiver object with a method call and the state needed to undo that call. This binding is what lets the invoker stay completely decoupled: the MenuItem (invoker) does not need to know about the Document (receiver) at all. The ConcreteCommand holds that reference privately.
**Pattern for each ConcreteCommand:**
```
class PasteCommand : Command:
constructor(document: Document):
this._document = document
Execute():
this._document.Paste()
Unexecute():
this._document.DeleteLastPaste() # reverse the paste
```
**Pattern for operations requiring saved state (the common case):**
```
class FontCommand : Command:
constructor(document: Document, newFont: Font, affectedRange: Range):
this._document = document
this._newFont = newFont
this._affectedRange = affectedRange
this._previousFont = null # captured at Execute time
Execute():
this._previousFont = this._document.GetFont(this._affectedRange) # save BEFORE
this._document.SetFont(this._newFont, this._affectedRange)
Unexecute():
this._document.SetFont(this._previousFont, this._affectedRange) # restore
```
**Decisions to make per command:**
| Decision | When | Why |
|----------|------|-----|
| Save state in constructor or Execute? | State that can vary per invocation (e.g., which objects are selected) should be captured at Execute time, not construction time | A MenuItem is configured once at startup; the selection it acts on changes with every user action |
| Copy before history? | If the same ConcreteCommand object is reused across invocations and its state varies, copy the command before placing it on the history list | Without copying, a shared command object would overwrite its undo state on every Execute |
| Does the command know its receiver implicitly? | When no suitable receiver class exists, or the command is self-contained (e.g., OpenCommand prompts the user, creates a Document, adds it to Application) | Self-contained commands are fine; they trade receiver flexibility for simplicity |
For the **SimpleCommand template** (languages that support generic/template types): when a command is simple — not undoable, no arguments, just calls one method on one receiver — a template can eliminate the need for a separate subclass. See `references/command-implementation-guide.md` for the SimpleCommand pattern and usage.
---
### Step 4: Implement the Undo/Redo History
**ACTION:** Implement the command history list with a present-line pointer. Apply the undo and redo algorithms exactly as specified below.
**WHY:** The history list is the mechanism that converts an individual undoable command into multi-level undo. Without the present-line pointer, you cannot distinguish "commands that have been executed" from "commands that have been undone and are available for redo." The pointer is the state machine's cursor: everything to its left is past; everything to its right is future.
**Data structure:**
```
class CommandHistory:
_commands: List<Command> # ordered list of executed commands
_present: int # index of the most recently executed command
# (or -1 if nothing has been executed yet)
```
Conceptually:
```
[ cmd1 ]—[ cmd2 ]—[ cmd3 ]—[ cmd4 ] |
<————————— past ————————> present
```
**Execute a new command:**
```
ExecuteCommand(cmd: Command):
IF cmd.Reversible():
# Discard any redo history to the right of present
# (a new action invalidates the "future" branch)
this._commands.truncate_after(this._present)
cmd.Execute()
this._commands.append(cmd)
this._present = this._commands.last_index()
ELSE:
cmd.Execute() # non-reversible commands (e.g., Save) execute but are not tracked
```
**Undo (move backward):**
```
Undo():
IF this._present >= 0:
cmd = this._commands[this._present]
cmd.Unexecute() # reverse the command
this._present -= 1 # move present one step to the left
```
**Redo (move forward):**
```
Redo():
next = this._present + 1
IF next < this._commands.length:
cmd = this._commands[next]
cmd.Execute() # re-execute the command
this._present += 1 # move present one step to the right
```
**Visualizing undo then redo:**
```
Initial:
[ c1 ]—[ c2 ]—[ c3 ]—[ c4 ] |
present
After Undo:
[ c1 ]—[ c2 ]—[ c3 ]—[ c4 ]
|
present
(c4.Unexecute() was called; c4 stays in list for potential redo)
After Undo again:
[ c1 ]—[ c2 ]—[ c3 ]—[ c4 ]
|
present
After Redo:
[ c1 ]—[ c2 ]—[ c3 ]—[ c4 ]
|
present
(c3.Execute() was called again; present advances right)
After new command c5 (while c4 was in "future"):
[ c1 ]—[ c2 ]—[ c3 ]—[ c5 ]
present
(c4 is discarded — new action invalidates the old redo branch)
```
**Bounding history length:** Set a maximum length if memory or storage is a concern. When the list is full and a new command arrives, remove the oldest command from the left before appending. This sacrifices the earliest undo levels, which is almost always the correct trade-off.
**Error accumulation warning:** Errors can accumulate when a command is executed, unexecuted, and re-executed repeatedly, and the receiver's state drifts from its original values. If exact fidelity is critical, store a Memento (state snapshot) as part of the command's undo data rather than relying on computed reversal.
---
### Step 5: Implement MacroCommand (if composite operations needed)
**ACTION:** Implement MacroCommand when a single user action must execute a sequence of commands as a unit.
**WHY:** MacroCommand is Command + Composite. It lets you treat a sequence of operations identically to a single operation from the invoker's perspective. This is what enables macro recording: you collect commands into a MacroCommand as the user performs them, then play back the MacroCommand. Critically, Unexecute must iterate in *reverse* order — because applying the sub-commands' undo in original order would leave the receiver in the wrong state.
```
class MacroCommand : Command:
_commands: List<Command> = []
Add(cmd: Command):
this._commands.append(cmd)
Remove(cmd: Command):
this._commands.remove(cmd)
Execute():
FOR cmd IN this._commands: # forward order
cmd.Execute()
Unexecute():
FOR cmd IN REVERSE(this._commands): # reverse order — critical
cmd.Unexecute()
```
**MacroCommand has no explicit receiver** — the sub-commands already define their own receivers. The MacroCommand simply orchestrates them.
**Undo order matters:** If the macro executes [A, B, C] in that order, undoing it must call [C.Unexecute(), B.Unexecute(), A.Unexecute()]. Undoing in original order (A, B, C) would try to undo A before undoing C and B, which may leave the receiver in an inconsistent intermediate state.
---
### Step 6: Wire Invokers to Commands in Client Code
**ACTION:** In the client (Application or Controller), create ConcreteCommand objects with their receivers and assign them to invokers.
**WHY:** The client is the only place that knows both the invoker and the receiver. This is intentional — the pattern's decoupling only works because this wiring happens once, in one place, at configuration time. The invoker then operates purely through the Command interface.
```
# Client code (Application setup):
document = Document("readme.txt")
history = CommandHistory()
pasteMenuItem = MenuItem()
pasteCommand = PasteCommand(document) # bind receiver at creation
pasteMenuItem.SetCommand(pasteCommand)
# MenuItem's Clicked() handler (Invoker):
Clicked():
history.ExecuteCommand(this._command) # or: this._command.Execute() if no history
```
**Sharing command instances:** A menu item and a toolbar button can share the exact same ConcreteCommand instance — they both call `Execute()` on it. This is what enables "one operation, many surfaces" with zero duplication.
**Context-sensitive commands:** If the receiver changes at runtime (e.g., "paste into whichever document is currently active"), the command can resolve the receiver lazily inside Execute() rather than at construction time. This trades the simplicity of eager binding for runtime flexibility.
---
### Step 7: Verify Implementation
**ACTION:** Test the implementation against two scenarios: with-skill (using Command pattern) vs. without-skill (baseline).
**WHY:** Functional verification catches the most common implementation bugs — most often: undo restoring to the wrong state (forgot to capture pre-Execute state), redo failing (present-line pointer not advanced after redo), or MacroCommand unexecuting in wrong order.
**Verification checklist:**
- [ ] Invoker has no import or reference to the Receiver class
- [ ] Execute and Unexecute restore exact round-trip state (execute → unexecute → state matches original)
- [ ] Multiple sequential undos each restore one step correctly
- [ ] Redo after undo re-executes the correct command and advances the pointer
- [ ] New command after undo discards the redo branch (old future commands are gone)
- [ ] MacroCommand's Unexecute iterates sub-commands in reverse
- [ ] Non-reversible commands (Reversible() = false) are not placed on the history list
- [ ] Copying behavior is correct: if a command's state varies per invocation, it is copied before history placement
---
## Inputs
- Description of the operations to encapsulate (what actions, which receivers)
- Whether undo/redo is required, and if so, at what depth
- Whether macro recording or composite operations are needed
- Programming language (to apply the SimpleCommand template pattern correctly)
## Outputs
- **Command interface** (abstract class with Execute, and optionally Unexecute and Reversible)
- **ConcreteCommand classes** — one per operation in scope
- **CommandHistory** — present-line pointer implementation for unlimited undo/redo (if applicable)
- **MacroCommand** — composite command with correct forward/reverse Execute/Unexecute (if applicable)
- **Client wiring** — invoker-command-receiver binding in application setup
---
## Key Principles
- **The invoker must not know the receiver** — if the invoker imports, references, or casts to the receiver type, the decoupling is broken. The ConcreteCommand is the only place that knows both. WHY: decoupling is what allows the same Command to be triggered by a menu item, a toolbar button, and a keyboard shortcut without any of those invokers knowing about Document, Model, or Service.
- **Capture undo state at Execute time, not construction time** — a command is often constructed once (at application setup) but invoked many times. The selection, cursor position, or affected range may change between construction and invocation. WHY: capturing state at construction time would save the wrong state, making Unexecute restore to the wrong point.
- **Reversible() is a runtime predicate, not a class-level distinction** — the same class can return true or false based on execution context. WHY: a FontCommand applied to text already in that font has no effect; its Unexecute would do equal meaningless work. Reversible() prevents these no-ops from polluting the history list.
- **New command execution truncates the redo branch** — when the user undoes two steps and then performs a new action, the two undone commands are gone. WHY: the new action creates a divergent history. Preserving the old "future" would require a branching history tree, which is rarely worth the complexity. Most applications implement linear history.
- **MacroCommand unexecute must iterate in reverse** — the sub-commands of a macro may have dependencies (B assumes A has happened; undoing B before A may corrupt state). Reverse-order unexecution guarantees that each command is undone in the context where it was originally executed. WHY: forward-order undo unravels the causal chain backwards, leaving the receiver in intermediate states that were never valid program states.
- **SimpleCommand eliminates subclasses only for simple cases** — the template approach works when a command has no undo requirement and takes no extra arguments. The moment you need undo state or per-invocation arguments, a full ConcreteCommand subclass is required. WHY: trying to stretch SimpleCommand beyond its design envelope produces parameterized callbacks that are harder to read than a dedicated subclass.
---
## Examples
### Example 1: Document Editor with Multi-Level Undo
**Scenario:** A text editor with menus for Cut, Copy, Paste, and Bold. Users expect unlimited undo/redo accessible from Edit > Undo and Edit > Redo, and from Ctrl+Z / Ctrl+Y. The same Cut command should be triggerable from the menu, the toolbar, and the keyboard shortcut.
**Trigger:** "We need multi-level undo for edit operations and the same command needs to work from three different places in the UI."
**Process:**
- Step 1: Invoker = MenuItem + ToolbarButton + KeyboardHandler. Receiver = Document. Client = Application.
- Step 2: Full undo protocol — Command interface includes Execute, Unexecute, Reversible.
- Step 3: CutCommand captures the cut selection and clipboard content at Execute time. BoldCommand captures the affected range and original font at Execute time.
- Step 4: CommandHistory with present-line pointer. Edit > Undo calls `history.Undo()`. Edit > Redo calls `history.Redo()`.
- Step 6: Application creates one `CutCommand(document)`, assigns it to the menu item, toolbar button, and keyboard handler. All three share the same instance.
**Output:**
- `CutCommand.Execute()` saves selected text to clipboard, deletes from document, records deleted range + content
- `CutCommand.Unexecute()` re-inserts recorded content at recorded range, updates clipboard
- History state after three edits: `[CutCmd | BoldCmd | PasteCmd] |` (present at end)
- After two undos: `[CutCmd | BoldCmd | PasteCmd]` (present points at CutCmd)
- After typing a new character: `[CutCmd | TypeCmd]` (BoldCmd and PasteCmd discarded)
---
### Example 2: Database Transaction System
**Scenario:** A database management system needs to support atomic transactions as a sequence of insertions, updates, and deletions. The whole transaction must be undoable as a unit if it fails partway through. Operations need to be logged to disk so they can be replayed after a crash.
**Trigger:** "We need to group database operations into a transaction and undo the whole thing if any step fails."
**Process:**
- Step 1: Invoker = TransactionManager. Receiver = Database. Client = Application.
- Step 2: Full undo interface. Each operation also implements `Store()` / `Load()` for crash-recovery logging.
- Step 3: `InsertCommand` captures the inserted row's primary key for Unexecute (DELETE by key). `UpdateCommand` captures the original row values. `DeleteCommand` captures the full deleted row.
- Step 5: Transaction = MacroCommand. On success, the MacroCommand is placed on history as one unit. On failure midway, `MacroCommand.Unexecute()` reverses completed sub-commands in reverse order.
- Log persistence: after Execute, each sub-command is serialized to a log file. On crash recovery, the log is read and each command's Execute is re-applied.
**Output:**
- Transaction of [InsertCmd, UpdateCmd, DeleteCmd] executes forward; on failure after UpdateCmd, reverse unexecution: [UpdateCmd.Unexecute(), InsertCmd.Unexecute()] — DeleteCmd was never executed so it is not reversed
- History after commit: one MacroCommand on the history list, undoable as a unit
- Crash recovery: replay log by re-calling Execute on each logged command
---
### Example 3: Lexi MenuItem-Command Case Study (from GoF)
**Scenario:** The Lexi document editor (GoF chapter 2) has dozens of user operations accessible via pull-down menus and keyboard shortcuts. Operations include paste, change font, open document, quit, and save. Some are undoable (paste, font change), some are not (quit, save). MenuItems must not be subclassed per operation — one MenuItem class must serve all cases.
**Trigger:** "We have too many menu item subclasses — one per operation. Adding a new operation requires a new subclass."
**Process:**
- Step 1: Invoker = MenuItem. Receivers = Document (paste, font), Application (open, quit). Client = Application (configures menus at startup).
- Step 2: Command interface has Execute. Undo interface has Unexecute + Reversible.
- Step 3:
- `PasteCommand(document)`: Execute calls `document.Paste()`. Undoable — stores clipboard content.
- `FontCommand(document, font, range)`: Execute saves original font, applies new font. Undoable.
- `OpenCommand(application)`: Execute prompts user for filename, creates Document, calls `application.Add(document)`, opens it. Not the typical receiver-delegation pattern — the command does all the work itself because there is no pre-existing receiver.
- `QuitCommand(application)`: Execute checks if document is modified; if so, calls `save.Execute()`, then quits. Reversible() = false — not placed on history.
- Step 6: Application creates all ConcreteCommand instances at startup, assigns each to the appropriate MenuItem. MenuItem.Clicked() calls `history.ExecuteCommand(this._command)`.
**Output:**
- One MenuItem class handles all operations — the operation is entirely in the Command
- Edit menu's Undo item calls `history.Undo()` — which calls `Unexecute()` on the last reversible command
- QuitCommand does not appear in undo history
- Font change on text already in that font: `FontCommand.Reversible()` checks if font actually changed; returns false if not, preventing a no-op undo entry
---
## References
| File | When to read |
|------|-------------|
| `references/command-implementation-guide.md` | For SimpleCommand template pattern, deep state management guidance, MacroCommand subcommand ownership, and language-specific implementation notes |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-behavioral-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/command-implementation-guide.md
# Command Pattern — Implementation Guide
Deep implementation details for the `command-pattern-implementor` skill.
---
## SimpleCommand Template
For commands that are (1) not undoable and (2) take no extra arguments, a generic/template class can eliminate the need to define a ConcreteCommand subclass for every operation.
**SimpleCommand** is parameterized by the receiver type and stores a pointer-to-member-function (the action). Execute simply calls the stored action on the stored receiver.
**C++ template form (source: GoF p. 225–226):**
```cpp
template <class Receiver>
class SimpleCommand : public Command {
public:
typedef void (Receiver::* Action)();
SimpleCommand(Receiver* r, Action a)
: _receiver(r), _action(a) {}
virtual void Execute() {
(_receiver->*_action)();
}
private:
Action _action;
Receiver* _receiver;
};
```
**Usage:**
```cpp
MyClass* receiver = new MyClass;
Command* aCommand = new SimpleCommand<MyClass>(receiver, &MyClass::Action);
aCommand->Execute(); // calls receiver->Action()
```
**In modern languages:**
In Python, JavaScript, or Java, a lambda or method reference serves the same purpose:
```python
# Python — closure replaces SimpleCommand
def make_command(receiver, action_name):
def execute():
getattr(receiver, action_name)()
return execute
paste_cmd = make_command(document, "paste")
paste_cmd()
```
**Limitations of SimpleCommand / closures:**
- No Unexecute — use only when undo is not required
- No argument storage — use only when the operation takes no extra parameters
- No per-invocation state capture — if the operation depends on runtime state (e.g., current selection), a full ConcreteCommand subclass is required
As soon as any of these three conditions changes, upgrade to a full ConcreteCommand class.
---
## How Intelligent Should a Command Be?
GoF names this a design spectrum:
**Thin command (pure binding):**
The command stores only the receiver reference and delegates everything.
```
class PasteCommand:
Execute(): this._document.Paste()
```
Use when: a suitable receiver class already exists and has the method you need.
**Fat command (self-contained):**
The command implements the operation entirely without delegating to a receiver at all.
```
class OpenCommand:
Execute():
name = AskUser()
if name:
doc = new Document(name)
this._application.Add(doc)
doc.Open()
```
Use when: no suitable receiver exists, or the command must coordinate across multiple objects (like OpenCommand, which creates a Document, adds it to Application, and opens it — three operations on two objects, with no single "receiver" that knows how to do all three).
**Middle ground:**
The command has enough knowledge to find its receiver dynamically (e.g., asks a registry or uses a service locator inside Execute).
Use when: the receiver is not known at construction time and cannot be passed in.
**Rule of thumb:** Start with thin commands (better testability, clearer separation). Move toward fat commands only when no clean receiver abstraction exists.
---
## State Management in Undoable Commands
### What to capture
Three categories of state may need to be saved for Unexecute:
1. **The receiver object** — which object was affected (reference, not a copy)
2. **The arguments** — what was passed to the receiver (e.g., the new font, the new value)
3. **The original values** — the prior state of the receiver that will be overwritten
For most commands, (3) is the critical one. It must be captured *before* Execute modifies the receiver.
```
class MoveCommand:
constructor(shape, newX, newY):
this._shape = shape
this._newX = newX
this._newY = newY
this._oldX = null # captured at Execute time, not here
this._oldY = null
Execute():
this._oldX = this._shape.X # save before changing
this._oldY = this._shape.Y
this._shape.MoveTo(this._newX, this._newY)
Unexecute():
this._shape.MoveTo(this._oldX, this._oldY)
```
### When to copy before placing on history
A command must be *copied* before being placed on the history list when:
- The same ConcreteCommand object is reused across multiple Execute calls
- The command's undo state (captured originals) will be overwritten on the next Execute
**Example:** A single `DeleteCommand` object is wired to the Delete key. Each time the user presses Delete, the same object's Execute runs — overwriting `_deletedObjects` with the new selection. Without copying, the history list holds references to the same object, and all entries point to the most recent invocation's state.
```
# Invoker that copies before tracking
ExecuteCommand(cmd):
cmd.Execute()
history.append(cmd.Copy()) # copy captures current undo state
```
If the command's state never varies across invocations (stateless commands), copying is unnecessary — a reference is sufficient.
Commands that must be copied before history placement act as Prototypes (GoF Prototype pattern, p. 117).
---
## MacroCommand: Subcommand Ownership and Lifecycle
### Ownership
MacroCommand owns its subcommands. When the MacroCommand is deleted, it is responsible for deleting its subcommands. This prevents memory leaks in languages without garbage collection:
```cpp
MacroCommand::~MacroCommand() {
// delete all subcommands
ListIterator<Command*> i(_cmds);
for (i.First(); !i.IsDone(); i.Next()) {
delete i.CurrentItem();
}
}
```
### Add/Remove at runtime
MacroCommand exposes Add and Remove to allow dynamic macro construction (e.g., macro recording while the user performs actions):
```
macro = new MacroCommand()
// user performs a sequence:
macro.Add(new PasteCommand(doc))
macro.Add(new FontCommand(doc, bold, selection))
// then save the macro for replay
```
### Nested MacroCommands
A MacroCommand can contain other MacroCommands as subcommands. The Execute and Unexecute implementations recurse naturally — each subcommand handles its own children.
---
## History List: Bounding and Persistence
### Bounding history length
For applications where memory or storage is constrained, cap the history list at N commands. When the cap is reached and a new command is pushed:
1. Remove the oldest command from the front of the list
2. Append the new command to the back
3. Adjust the present pointer accordingly
This is a FIFO eviction: you lose the ability to undo the oldest N operations, but the most recent N remain available.
### Persistent history (crash recovery)
Augment Command with Store and Load operations:
```
abstract class Command:
abstract Execute()
abstract Unexecute()
abstract Store(stream: OutputStream) # serialize command to log
abstract Load(stream: InputStream) # deserialize command from log
```
On each Execute, call `cmd.Store(log)` to append the command to a persistent log file. On application startup after a crash, read the log and call `cmd.Execute()` for each logged command to replay the session.
Commands must be serializable — they need to be reconstructed from the log without the original objects in memory. This typically means storing the receiver's identifier (e.g., object ID) rather than a pointer, and resolving the object from a registry during Load.
---
## Error Accumulation in Undo/Redo
GoF names this "hysteresis" — the accumulation of small errors across repeated execute/unexecute/execute cycles.
**Cause:** Computed reversal may lose precision. Example: a rotation command that rotates 90° clockwise; Unexecute rotates 90° counterclockwise. In floating-point arithmetic, these operations may not perfectly cancel, and the object drifts from its original position after many cycles.
**Mitigation: Memento integration**
When exact fidelity is critical, store a Memento (GoF Memento pattern, p. 283) as part of the command's undo data rather than relying on computed reversal:
```
class TransformCommand:
Execute():
this._savedState = this._shape.CreateMemento() # snapshot before
this._shape.ApplyTransform(this._transform)
Unexecute():
this._shape.RestoreMemento(this._savedState) # exact restoration
```
The Memento gives the command access to the object's private state for restoration without breaking the object's encapsulation — the Memento is the object's own snapshot, returned to it on restore.
Use Memento-based undo when:
- The operation involves floating-point or lossy computation
- The receiver has complex interdependent state where computing the inverse is error-prone
- Exact round-trip fidelity is required (e.g., pixel-perfect graphics undo)
Use computed reversal (Unexecute) when:
- The operation is discrete and exactly invertible (insert → delete, move by delta → move by -delta)
- Storing full state snapshots is too expensive (large objects or frequent operations)
---
## Decoupling Sender from Receiver: The Core Trade-off
The Command pattern's primary cost is class proliferation. Each unique combination of (receiver type, operation) nominally requires its own ConcreteCommand subclass.
Mitigation strategies:
| Strategy | When to use | Mechanism |
|----------|-------------|-----------|
| SimpleCommand / lambda | Operation has no undo, no extra args | Template or closure parameterized by receiver and method |
| Reuse one ConcreteCommand for multiple receivers | The operation is identical across receiver types | Use an interface for the receiver; one command works with any implementing class |
| Fat command | No suitable receiver abstraction exists | Command implements the logic itself |
| MacroCommand | Multiple operations form a logical unit | Composite — delegates to sub-commands |
The class proliferation cost is the pattern's honest trade-off. It is worthwhile when invoker-receiver decoupling, undo/redo, or operation composability are genuine requirements. If none of these are needed, a direct method call is simpler.
---
## Related Patterns
| Pattern | Relationship |
|---------|-------------|
| **Composite (p. 163)** | MacroCommand is Command + Composite |
| **Memento (p. 283)** | Stores state the command needs to undo its effect |
| **Prototype (p. 117)** | Commands that must be copied before history placement act as Prototypes |
| **Strategy** | Both encapsulate behavior; Command encapsulates a *request* (with a receiver binding and potential undo state); Strategy encapsulates an *algorithm* (swappable computation, no intrinsic undo) |
| **Chain of Responsibility** | Both decouple sender from receiver; Command uses a known binding (receiver set at creation), Chain resolves the receiver at runtime by walking a linked list |
Implement the Bridge pattern to decouple an abstraction from its implementation so both can vary independently. Use when you want to avoid a permanent bindin...
---
name: bridge-pattern-implementor
description: |
Implement the Bridge pattern to decouple an abstraction from its implementation so both can vary independently. Use when you want to avoid a permanent binding between abstraction and implementation, when both abstractions and implementations should be extensible by subclassing, when implementation changes should not require recompiling clients, or when you need to share an implementation among multiple objects. Includes the evaluate-extremes design process (union vs intersection of functionality) and Abstract Factory integration for runtime implementation selection.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/bridge-pattern-implementor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissimos"]
chapters: [2, 4]
pages: [58-64, 146-155]
tags: [design-patterns, bridge, structural-pattern, decoupling, abstraction, implementation, cross-platform]
depends-on: []
execution:
tier: 2
mode: full
inputs:
- type: codebase
description: "A software project where you need to decouple an abstraction from its implementation — or a design scenario described by the user"
tools-required: [Read, Write, TodoWrite]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any coding environment. Works with or without an existing codebase."
---
# Bridge Pattern Implementor
## When to Use
You are designing or refactoring a system where an abstraction (e.g., a Window, Shape, or Device) must work across multiple implementations (e.g., platform APIs, rendering engines, storage backends) and both the abstraction and implementation dimensions need to evolve independently.
Specific triggers:
- You notice a class hierarchy "exploding" — adding a new abstraction variant requires N new classes, one per implementation platform
- Client code is bound to a specific implementation at compile time and you need runtime flexibility
- You need to change or swap an underlying implementation without recompiling clients
- You want to share a single implementation object across multiple abstraction instances (e.g., reference counting, shared state)
- You are porting an existing system to multiple platforms and want the abstraction hierarchy to remain platform-neutral
Before starting, verify:
- Is the variation driven by **two independent axes**? (If only one axis varies, plain inheritance suffices.)
- Are the implementations genuinely interchangeable from the abstraction's point of view? (If they require different calling conventions the client must know about, Adapter may be a better fit.)
## Context & Input Gathering
### Required Context (must have — ask if missing)
- **The abstraction to decouple:** What concept does the client use? (e.g., Window, Shape, Queue)
→ Check prompt for: class names, interface names, the concept being modeled
→ If missing, ask: "What is the abstraction — the concept your client code works with?"
- **The implementation axis:** What varies underneath? (e.g., OS platform, rendering backend, storage engine)
→ Check prompt for: mentions of platforms, backends, vendors, environments
→ If missing, ask: "What are the different implementations you need to support?"
### Observable Context (gather from environment)
- **Existing class hierarchy:** Check for signs of the "proliferation" problem
→ Look for: abstract base classes with many concrete subclasses, class names combining two concepts (e.g., `XIconWindow`, `PMIconWindow`, `LinuxFilePrinter`, `WindowsFilePrinter`)
→ If present: this confirms a Bridge is appropriate; use existing names as abstraction/implementor candidates
- **Existing platform coupling:** Scan for platform-specific API calls inside abstraction classes
→ Look for: `#include <windows.h>`, OS-specific function calls, vendor SDK imports inside domain classes
→ If found: the Bridge's job is to push all such calls into the Implementor subclasses
### Default Assumptions
- If no existing codebase: work from the user's verbal description of the two axes
- If language not specified: illustrate with language-neutral pseudocode; note the user's language if evident from context
- If only one implementation variant exists today: still apply Bridge if more are planned, or if compile-time isolation from clients is needed
## Process
Use `TodoWrite` to track all steps before starting.
```
[ ] Step 1: Diagnose the design problem
[ ] Step 2: Apply evaluate-extremes to define the Implementor interface
[ ] Step 3: Define the Abstraction interface
[ ] Step 4: Implement Concrete Implementors
[ ] Step 5: Implement the Abstraction and Refined Abstractions
[ ] Step 6: Wire the Implementor at construction (Abstract Factory or parameter)
[ ] Step 7: Verify independence — extend both hierarchies without touching the other
```
---
### Step 1: Diagnose the Design Problem
**ACTION:** Identify the two independently varying axes and verify you have a Bridge candidate, not an Adapter or Strategy situation.
**WHY:** Bridge is designed upfront to allow independent extension. Adapter is applied retroactively to make incompatible interfaces cooperate. Strategy varies the algorithm, not a structural implementation dependency. Diagnosing early prevents building the wrong pattern. The clearest symptom of a Bridge candidate is "nested generalization" — a class hierarchy that doubles with every new platform or implementation type added.
Draw or enumerate the current (or planned) hierarchy. If adding a new abstraction variant requires creating N classes (one per implementation), and adding a new implementation variant requires creating M classes (one per abstraction) — you have the proliferation problem Bridge solves.
**IF** the symptom is an N×M class explosion → proceed to Step 2
**IF** only the algorithm varies, not the platform/backend structure → consider Strategy instead
**IF** you're adapting a pre-existing incompatible interface → consider Adapter instead
---
### Step 2: Apply the Evaluate-Extremes Process to Define the Implementor Interface
**ACTION:** Determine the Implementor (implementation interface) by reasoning about two extreme positions, then choosing a balanced middle.
**WHY:** The Implementor interface is the hardest design decision in Bridge. It must be broad enough to serve all abstraction variants, but not so broad that it becomes incoherent. The evaluate-extremes technique makes this decision systematic rather than arbitrary.
**The two extremes:**
**Extreme 1 — Intersection of functionality:** Define the Implementor interface as only the operations that EVERY implementation platform supports. This produces the most portable interface but is dangerously limiting — the interface is only as capable as the weakest platform. Features that most (but not all) platforms support become inaccessible.
**Extreme 2 — Union of functionality:** Define the Implementor interface as the total set of operations across ALL platforms. This produces the richest interface but it becomes huge, incoherent, and must change whenever any vendor revises their API. Every concrete Implementor must stub out capabilities it doesn't have.
**The balanced middle (what to actually do):**
1. List all operations your abstractions need to perform (from the client's perspective).
2. For each operation, determine whether ALL target implementations can support it (even via different primitives).
3. Group into "core primitives" — low-level operations that every implementation can provide in some form, even if via different mechanisms.
4. The Implementor interface declares only these primitives. The Abstraction composes them into higher-level operations.
**Example (Lexi Window System):** The Window class needs `DrawLine`, `DrawRect`, `DrawText`, `Raise`, `Lower`. Instead of the Implementor interface mirroring these exactly, it exposes lower-level device primitives like `DeviceRect`, `DeviceText`, `DeviceBitmap`. The Abstraction (`Window`) translates its higher-level calls into combinations of these primitives. This keeps the Implementor interface small and platform-expressible, while the Abstraction interface remains application-friendly.
**IF** you find a device primitive that some platforms lack → check whether it can be emulated using other primitives. If yes, keep it. If no, either remove it from the Implementor interface or accept it as a known limitation in that ConcreteImplementor.
See [references/bridge-implementation-guide.md](references/bridge-implementation-guide.md) for the complete evaluate-extremes worksheet.
---
### Step 3: Define the Abstraction Interface
**ACTION:** Define the Abstraction class — the interface the client sees. It holds a reference (pointer/field) to an Implementor. Its public operations are high-level and application-oriented.
**WHY:** The Abstraction is the client's view of the world. Its interface should express domain concepts, not implementation primitives. The key structural rule: the Abstraction holds a reference to an Implementor object, and delegates to it — it does NOT inherit from it. This composition is what makes the two hierarchies independent.
Key design rules:
- The Abstraction's public operations speak the domain's language (e.g., `DrawRect(point1, point2)`, not `DeviceRect(x0, y0, x1, y1)`)
- The Abstraction translates its operations into calls to `_imp->DeviceXxx(...)` primitives
- The Implementor reference is protected (or package-private) — concrete Abstraction subclasses may need to call it
```
// Pseudocode structure
class Abstraction {
protected:
Implementor* _imp; // The bridge reference
public:
// High-level domain operations — these are what clients call
virtual void HighLevelOperation() {
// Composed from low-level implementor primitives
_imp->PrimitiveA();
_imp->PrimitiveB();
}
}
```
---
### Step 4: Implement the Concrete Implementors
**ACTION:** For each target platform or backend, create a ConcreteImplementor class that inherits from the Implementor interface and translates each primitive into that platform's actual API calls.
**WHY:** The Concrete Implementors are where all platform-specific code lives. Once these classes exist, no platform-specific code should appear anywhere else in the system. The Abstraction and its subclasses remain entirely platform-neutral. Changing a platform means changing only its ConcreteImplementor.
The implementations of the same operation will often look radically different across platforms — and that is expected. The point is that Window's `DrawRect` doesn't need to know or care about these differences.
**Illustrative contrast — drawing a rectangle:**
The X Window System has a direct rectangle primitive:
```cpp
// XWindowImp::DeviceRect — X Window System implementation
void XWindowImp::DeviceRect(Coord x0, Coord y0, Coord x1, Coord y1) {
int x = round(min(x0, x1));
int y = round(min(y0, y1));
int w = round(abs(x0 - x1));
int h = round(abs(y0 - y1));
XDrawRectangle(_dpy, _winid, _gc, x, y, w, h);
}
```
IBM's Presentation Manager has no rectangle primitive — it uses a general path/polygon API:
```cpp
// PMWindowImp::DeviceRect — IBM Presentation Manager implementation
void PMWindowImp::DeviceRect(Coord x0, Coord y0, Coord x1, Coord y1) {
Coord left = min(x0, x1); Coord right = max(x0, x1);
Coord bottom = min(y0, y1); Coord top = max(y0, y1);
PPOINTL point[4];
point[0] = {left, top}; point[1] = {right, top};
point[2] = {right, bottom}; point[3] = {left, bottom};
if (GpiBeginPath(_hps, 1L) == false
|| GpiSetCurrentPosition(_hps, &point[3]) == false
|| GpiPolyLine(_hps, 4L, point) == GPI_ERROR
|| GpiEndPath(_hps) == false) {
// report error
} else {
GpiStrokePath(_hps, 1L, 0L);
}
}
```
Both satisfy the same `DeviceRect` interface. Window's `DrawRect` calls `_imp->DeviceRect(...)` and is completely shielded from this difference.
---
### Step 5: Implement the Abstraction and Refined Abstractions
**ACTION:** Implement the base Abstraction's operations in terms of Implementor primitives. Then create Refined Abstraction subclasses for each logical variation of the abstraction.
**WHY:** Refined Abstractions capture semantic variations of the concept (e.g., `ApplicationWindow`, `IconWindow`, `TransientWindow`) without duplicating any platform-specific code. Each refined abstraction overrides or extends the base abstraction, relying on the same inherited `_imp` reference. This is what makes the N×M explosion disappear — you write N abstraction classes and M implementor classes, not N×M combinations.
```
// Base Abstraction — delegates to implementor
void Window::DrawRect(const Point& p1, const Point& p2) {
_imp->DeviceRect(p1.X(), p1.Y(), p2.X(), p2.Y());
}
// Refined Abstraction — adds application-level behavior, still platform-neutral
class IconWindow : public Window {
void DrawContents() {
WindowImp* imp = GetWindowImp();
if (imp) imp->DeviceBitmap(_bitmapName, 0.0, 0.0);
}
};
```
---
### Step 6: Wire the Implementor at Construction
**ACTION:** Decide how and when the Abstraction receives its Implementor. Choose one of three approaches based on your constraints.
**WHY:** The Abstraction must acquire a concrete Implementor to function, but Bridge's goal is that the Abstraction should NOT be hardwired to a specific Implementor class. The wiring strategy determines how clean this decoupling is at runtime.
**Three wiring strategies:**
**Option A — Constructor parameter (simplest):** Pass the ConcreteImplementor in as a constructor argument. Straightforward, but the caller must know which Implementor to pass.
```python
window = ApplicationWindow(XWindowImp()) # caller decides
```
**Option B — Factory method (runtime selection):** The Abstraction's constructor calls a factory to obtain the right Implementor. The Abstraction stays decoupled from any specific ConcreteImplementor class.
```cpp
// Window's constructor uses a factory — knows nothing about XWindowImp or PMWindowImp
Window::Window() {
_imp = WindowSystemFactory::Instance()->MakeWindowImp();
}
```
**Option C — Abstract Factory (recommended for platform families):** Introduce an Abstract Factory (e.g., `WindowSystemFactory`) whose sole job is to encapsulate all platform-specific object creation. The factory knows what platform is in use; it returns the right Implementor. This is the cleanest approach when the implementation axis is a family of related objects (window system, color system, font system all need to be platform-consistent).
```cpp
class WindowSystemFactory {
public:
virtual WindowImp* CreateWindowImp() = 0;
virtual ColorImp* CreateColorImp() = 0;
virtual FontImp* CreateFontImp() = 0;
};
class XWindowSystemFactory : public WindowSystemFactory {
WindowImp* CreateWindowImp() { return new XWindowImp(); }
// ...
};
```
The factory is configured once at application startup (as a Singleton). From then on, every Window just calls `factory->CreateWindowImp()` and gets the right one for the current platform.
**IF** implementations are a family of related objects → use Abstract Factory (Option C)
**IF** the Abstraction is genuinely agnostic to which Implementor it gets → use factory method (Option B)
**IF** tests or simple configurations need direct control → use constructor parameter (Option A)
---
### Step 7: Verify Independence — Extend Both Hierarchies Without Touching the Other
**ACTION:** As a verification exercise, add one new ConcreteImplementor and one new Refined Abstraction, and confirm neither change affects the other hierarchy.
**WHY:** This is the core promise of Bridge. If your implementation is correct, adding `MacWindowImp` requires only a new class inheriting from `WindowImp` — no Window subclass changes. Adding `PaletteWindow` requires only a new class inheriting from `Window` — no WindowImp subclass changes. If you find yourself modifying the other hierarchy to accommodate a new variant, the bridge is leaky — most likely because the Implementor interface doesn't match what some Abstraction variants actually need.
Checklist for a correct Bridge:
- [ ] Can a new ConcreteImplementor be added without touching any Abstraction class?
- [ ] Can a new Refined Abstraction be added without touching any Implementor class?
- [ ] Does the client only see the Abstraction interface — never a ConcreteImplementor directly?
- [ ] Is all platform-specific code confined to ConcreteImplementor classes?
- [ ] Does the Abstraction forward to the Implementor via delegation (not inheritance)?
---
## Inputs
- Description of the concept to model (the abstraction axis)
- Description of the implementation variants (the implementation axis)
- Optionally: existing class hierarchy showing the proliferation problem
## Outputs
- **Implementor interface** — pure abstract class/interface with platform primitives
- **ConcreteImplementor classes** — one per platform/backend (all platform code lives here)
- **Abstraction base class** — holds `_imp` reference, provides high-level operations
- **Refined Abstraction subclasses** — domain variations, fully platform-neutral
- **Factory wiring** — mechanism to select the right ConcreteImplementor at runtime
## Key Principles
- **Composition over inheritance for implementation** — the Abstraction holds a reference to an Implementor; it does NOT inherit from it. This single structural choice is what enables independent variation. Inheritance would permanently bind the abstraction to one implementation.
- **Implementor interface ≠ Abstraction interface** — the two interfaces serve different audiences. The Abstraction interface speaks to application developers (domain concepts). The Implementor interface speaks to platform implementors (device primitives). They can and should be different in style and granularity.
- **Evaluate extremes before committing** — union is too broad (unstable, incoherent), intersection is too narrow (loses platform capabilities). The right Implementor interface is a set of primitives that every platform can satisfy in some form, even if via different mechanisms.
- **All platform code in ConcreteImplementors** — if you find a platform API call anywhere outside a ConcreteImplementor class, the bridge is incomplete. The whole point is to isolate platform variation so the abstraction hierarchy never needs to change when platforms change.
- **Bridge is designed upfront; Adapter is applied retroactively** — Bridge is a proactive design decision made when building a system that must span multiple implementations. If you're adapting a pre-existing incompatible interface, Adapter is more appropriate.
- **One Implementor is still valid** — if today there is only one implementation, Bridge still has value: it eliminates compile-time dependencies from clients. Changing the Implementor class no longer requires recompiling anything that uses the Abstraction. This is the "Cheshire Cat" idiom.
## Examples
### Example 1: Cross-Platform Document Editor Window System (Lexi)
**Scenario:** A WYSIWYG document editor must run on X Window System and IBM Presentation Manager without duplicating window logic. The team faces a 3×2 class explosion (ApplicationWindow, IconWindow, TransientWindow × XWindow, PMWindow = 6 classes, growing to 3×N for each new platform).
**Trigger:** "I need my document editor to run on X and PM without duplicating the window logic."
**Process:**
- Step 1: Diagnosed N×M explosion — 3 window types × 2 platforms = 6 classes today, 3×N tomorrow.
- Step 2: Applied evaluate-extremes. Intersection misses PM-specific path capabilities. Union exposes X-specific display handles everywhere. Balanced: `WindowImp` with device primitives (`DeviceRect`, `DeviceText`, `DeviceBitmap`).
- Step 3: `Window` base class provides `DrawRect()`, `DrawText()` to application code.
- Step 4: X uses `XDrawRectangle` directly (1 call). PM uses `GpiBeginPath`/`GpiPolyLine`/`GpiStrokePath` (5+ calls) — dramatically different for the same operation.
- Step 6: Wired via `WindowSystemFactory` (Abstract Factory). `Window` constructor calls `factory->CreateWindowImp()`.
- Step 7: Verified — adding `MacWindowImp` required zero changes to Window, IconWindow, or ApplicationWindow.
**Output:** `Window`/`WindowImp` bridge — 5 classes (3 abstractions + 2 implementors) vs 6, scaling to 3+N instead of 3×N.
---
### Example 2: Multi-Backend Notification Service
**Scenario:** A notification system must support email, SMS, and push today, with Slack and webhooks planned. Notification types (Alert, Digest, Reminder) format content differently but must work across all delivery backends.
**Trigger:** "We want to add Slack and webhooks later without rewriting the notification logic."
**Process:**
- Step 1: Two axes identified — notification type (Alert, Digest, Reminder) × delivery backend (Email, SMS, Push). N×M = 9 classes today, growing.
- Step 2: Evaluate-extremes. Intersection: `send(to, subject, body)` — too weak for rich push payloads. Union: expose APNS tokens everywhere — leaks implementation details. Balanced: `Deliver(recipient, title, body, metadata)`.
- Steps 3-5: `Notification` abstraction with `Alert`, `Digest`, `Reminder` as refined abstractions. `EmailDelivery`, `SMSDelivery`, `PushDelivery` as ConcreteImplementors.
- Step 7: Verified — adding `SlackDelivery` requires zero changes to Alert/Digest/Reminder.
**Output:** Notification Bridge — 3 abstractions × 3 implementors, easily extended on both axes independently.
---
### Example 3: Degenerate Bridge — Single Implementation with Compile Isolation
**Scenario:** A system uses a proprietary graphics library. The team wants to be able to swap libraries without breaking client code or forcing recompiles, even though there's only one implementation today.
**Trigger:** "We want to change to a different graphics library without breaking client code."
**Process:**
- Step 1: Only one implementation exists. N×M = N×1 — no explosion yet, but dependency is system-wide.
- Steps 2-5: Created thin `GraphicsImp` interface mirroring only the operations the Abstraction uses. `ProprietaryGraphicsImp` implements it. Client code depends only on `GraphicsImp`.
- Step 7: When the library changes, only `ProprietaryGraphicsImp` is rewritten — clients need not recompile.
**Output:** Compile-time isolation achieved. Future library migration becomes a contained, low-risk change. This is the "Cheshire Cat" idiom.
## References
- For the evaluate-extremes worksheet and Implementor interface design process in detail, see [references/bridge-implementation-guide.md](references/bridge-implementation-guide.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissimos.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/bridge-implementation-guide.md
# Bridge Pattern Implementation Guide
Companion reference for the `bridge-pattern-implementor` skill. Contains the evaluate-extremes worksheet, structural participants reference, implementation decision matrix, and the complete Lexi case study.
---
## 1. Evaluate-Extremes Worksheet
Use this worksheet when designing the Implementor interface — the hardest decision in applying Bridge.
### Step 1: List all operations the Abstraction needs (from the client's perspective)
Write down every operation clients will call on your Abstraction. This is the client's vocabulary, not the platform's.
Example:
```
Window clients need:
- DrawLine(p1, p2)
- DrawRect(p1, p2)
- DrawPolygon(points, n)
- DrawText(text, position)
- Raise()
- Lower()
- Iconify()
- DrawContents()
```
### Step 2: Enumerate your target implementations
List every platform, backend, or implementation variant you must support (including near-future ones you can anticipate).
Example:
```
Platforms: X Window System, IBM Presentation Manager, macOS Quartz
```
### Step 3: Determine the intersection
For each Abstraction operation, mark which implementations support it natively:
| Operation | X Window | PM | macOS |
|-------------------|:---:|:---:|:---:|
| Draw rectangle | Yes (XDrawRectangle) | No (path only) | Yes (CGContextAddRect) |
| Draw line | Yes | Yes | Yes |
| Draw text | Yes | Yes | Yes |
| Raise window | Yes | Yes | Yes |
Operations where ALL platforms say Yes = the intersection.
**Problem with pure intersection:** PM cannot draw a rectangle directly. If you design only around the intersection, you lose XDrawRectangle's directness and force X to emulate rectangle drawing via a polygon — that's a capability regression.
### Step 4: Determine the union
List every operation any platform exposes, even platform-specific ones.
**Problem with pure union:** PM path operations (`GpiBeginPath`, `GpiSetCurrentPosition`, `GpiPolyLine`, `GpiEndPath`, `GpiStrokePath`) are opaque to X and macOS developers. Including them in the interface creates a leaking abstraction that forces X developers to understand PM idioms. The interface also grows fragile — any PM SDK change breaks the interface.
### Step 5: Define balanced primitives (the actual output)
Design primitives that:
1. Every platform can satisfy in SOME form (even if mechanisms differ)
2. Express device-level operations, not domain-level operations
3. Enable the Abstraction to compose its higher-level calls
For windows:
| Implementor Primitive | How X satisfies it | How PM satisfies it |
|-----------------------|-------------------|---------------------|
| `DeviceRect(x0,y0,x1,y1)` | `XDrawRectangle` | `GpiBeginPath` + `GpiPolyLine` + `GpiStrokePath` |
| `DeviceLine(x0,y0,x1,y1)` | `XDrawLine` | `GpiLine` |
| `DeviceText(s, x, y)` | `XDrawString` | `GpiCharStringAt` |
| `DeviceBitmap(name, x, y)` | `XCopyPlane` | `GpiBitBlt` |
| `ImpTop()` | `XRaiseWindow` | `WinSetWindowPos(HWND_TOP)` |
| `ImpBottom()` | `XLowerWindow` | `WinSetWindowPos(HWND_BOTTOM)` |
Key insight: the PM implementation of `DeviceRect` uses a path — that's fine. The Implementor interface doesn't care HOW it's done, only that the outcome (a rectangle drawn) is achieved.
---
## 2. Structural Participants Reference
### Abstraction
- Defines the interface clients use
- Holds a reference `_imp` to an Implementor object
- Implements its operations by delegating to `_imp`
- Does NOT contain any platform-specific code
```cpp
class Window {
protected:
WindowImp* _imp;
WindowImp* GetWindowImp(); // may use factory to lazily initialize
public:
void DrawRect(const Point& p1, const Point& p2) {
_imp->DeviceRect(p1.X(), p1.Y(), p2.X(), p2.Y());
}
virtual void DrawContents() = 0;
// ...
};
```
### Refined Abstraction
- Extends Abstraction with additional operations or specializations
- Still delegates to `_imp` — adds no platform-specific code
- Examples: `ApplicationWindow`, `IconWindow`, `TransientWindow`
```cpp
class IconWindow : public Window {
public:
void DrawContents() {
WindowImp* imp = GetWindowImp();
if (imp) imp->DeviceBitmap(_bitmapName, 0.0, 0.0);
}
private:
const char* _bitmapName;
};
```
### Implementor
- Defines the primitive interface for platform-specific operations
- Does NOT mirror the Abstraction's interface — it's lower-level
- Operations are primitives that ConcreteImplementors translate to platform calls
```cpp
class WindowImp {
public:
virtual void ImpTop() = 0;
virtual void ImpBottom() = 0;
virtual void ImpSetExtent(const Point&) = 0;
virtual void ImpSetOrigin(const Point&) = 0;
virtual void DeviceRect(Coord, Coord, Coord, Coord) = 0;
virtual void DeviceText(const char*, Coord, Coord) = 0;
virtual void DeviceBitmap(const char*, Coord, Coord) = 0;
// ...
protected:
WindowImp();
};
```
### Concrete Implementor
- Inherits from Implementor
- Translates each primitive into the target platform's actual API calls
- All platform-specific code lives here and only here
See Section 4 below for the full X vs PM contrast.
---
## 3. Wiring Decision Matrix
How should the Abstraction get its Implementor? Choose based on your context.
| Wiring Strategy | How it works | When to use | Trade-off |
|-----------------|-------------|-------------|-----------|
| **Constructor parameter** | Caller passes a ConcreteImplementor directly | Tests, simple configurations, DI containers | Caller must know which Implementor to create |
| **Factory method in Abstraction** | Abstraction's constructor calls a virtual factory method to create its own Implementor | Single Implementor with possible subclass variation | Tight but contained; no external factory needed |
| **Abstract Factory (recommended for families)** | An external factory (e.g., `WindowSystemFactory`) encapsulates all platform-specific object creation | Multiple related implementation objects must be consistent (same platform family) | Requires factory infrastructure; best long-term |
| **Lazy initialization (default + switch)** | Abstraction starts with a default Implementor, switches based on runtime conditions | Adaptive algorithms (e.g., small collection = linked list, large = hash table) | Most flexible; adds complexity |
### Abstract Factory pattern for Bridge wiring
When your Bridge operates in a world where multiple implementation objects must all come from the same platform family (window system + color system + font system all must be PM or all X), the Abstract Factory pattern cleanly encapsulates this constraint.
```cpp
class WindowSystemFactory {
public:
virtual WindowImp* CreateWindowImp() = 0;
virtual ColorImp* CreateColorImp() = 0;
virtual FontImp* CreateFontImp() = 0;
};
class PMWindowSystemFactory : public WindowSystemFactory {
WindowImp* CreateWindowImp() { return new PMWindowImp(); }
ColorImp* CreateColorImp() { return new PMColorImp(); }
FontImp* CreateFontImp() { return new PMFontImp(); }
};
class XWindowSystemFactory : public WindowSystemFactory {
WindowImp* CreateWindowImp() { return new XWindowImp(); }
// ...
};
// Window constructor — knows nothing about X or PM
Window::Window() {
_imp = WindowSystemFactory::Instance()->CreateWindowImp();
}
```
The factory is configured once (typically as a Singleton) at application startup. The rest of the system never mentions `XWindowImp` or `PMWindowImp` directly.
---
## 4. Lexi Case Study — Window / WindowImp
### The problem
The Lexi document editor needs a Window abstraction that application code uses for drawing and window management. Lexi must run on multiple window systems: X Window System, IBM Presentation Manager, and macOS.
**Naive approach:** Create `XApplicationWindow`, `PMApplicationWindow`, `XIconWindow`, `PMIconWindow`, etc. Result: N window kinds × M platforms classes, plus every new platform requires editing every abstraction class.
**Bridge approach:** Separate the hierarchy.
- `Window` hierarchy: `ApplicationWindow`, `IconWindow`, `TransientWindow`, `DialogWindow` — captures kinds of windows
- `WindowImp` hierarchy: `XWindowImp`, `PMWindowImp`, `MacWindowImp` — captures platform implementations
### The evaluate-extremes analysis for Lexi's Window
**Intersection approach rejected:** Limiting `WindowImp` to only what all platforms share would omit PM-specific path capabilities. Worse, it would give the X implementation nothing to lean on for rectangle drawing — X HAS a rectangle primitive (`XDrawRectangle`) that would go unused.
**Union approach rejected:** Exposing `GpiBeginPath`, `GpiSetCurrentPosition`, `GpiPolyLine`, `GpiEndPath`, `GpiStrokePath` in the interface forces every X implementor to understand PM path semantics. Every PM SDK change would break the interface.
**Balanced primitives chosen:** `DeviceRect`, `DeviceLine`, `DeviceText`, `DeviceBitmap`, `ImpTop`, `ImpBottom`, `ImpSetOrigin`, `ImpSetExtent`. These are the minimal set that:
- Every window system can provide (even via emulation)
- Compose into all the operations Window needs
- Do not expose any platform-specific idioms
### Contrasting implementations: DeviceRect
**X Window System** — has a native rectangle-drawing primitive:
```cpp
void XWindowImp::DeviceRect(Coord x0, Coord y0, Coord x1, Coord y1) {
int x = round(min(x0, x1));
int y = round(min(y0, y1));
int w = round(abs(x0 - x1));
int h = round(abs(y0 - y1));
// XDrawRectangle takes lower-left corner + width + height
XDrawRectangle(_dpy, _winid, _gc, x, y, w, h);
}
```
Note: XDrawRectangle defines a rectangle by lower-left corner + dimensions, so DeviceRect must compute those from two arbitrary corner coordinates. Simple math, direct API.
**IBM Presentation Manager** — has no rectangle primitive; uses a general multi-segment path API:
```cpp
void PMWindowImp::DeviceRect(Coord x0, Coord y0, Coord x1, Coord y1) {
Coord left = min(x0, x1); Coord right = max(x0, x1);
Coord bottom = min(y0, y1); Coord top = max(y0, y1);
PPOINTL point[4];
point[0] = {left, top}; // top-left
point[1] = {right, top}; // top-right
point[2] = {right, bottom}; // bottom-right
point[3] = {left, bottom}; // bottom-left
if (GpiBeginPath(_hps, 1L) == false
|| GpiSetCurrentPosition(_hps, &point[3]) == false
|| GpiPolyLine(_hps, 4L, point) == GPI_ERROR
|| GpiEndPath(_hps) == false) {
// report error
} else {
GpiStrokePath(_hps, 1L, 0L);
}
}
```
Note: PM defines a rectangle by specifying 4 vertices of a polygon path, then stroking it. Completely different idiom — but both satisfy the same `DeviceRect(x0, y0, x1, y1)` signature.
### How the Abstraction uses these primitives
```cpp
// Window::DrawRect is completely platform-neutral
void Window::DrawRect(const Point& p1, const Point& p2) {
_imp->DeviceRect(p1.X(), p1.Y(), p2.X(), p2.Y());
}
// IconWindow uses another primitive
void IconWindow::DrawContents() {
WindowImp* imp = GetWindowImp();
if (imp != 0) {
imp->DeviceBitmap(_bitmapName, 0.0, 0.0);
}
}
```
`Window::DrawRect` is the same code for X, PM, and Mac. The platform difference is entirely hidden inside the ConcreteImplementors.
### Factory wiring in Lexi
```cpp
// Window's lazy GetWindowImp — uses Abstract Factory
WindowImp* Window::GetWindowImp() {
if (_imp == 0) {
_imp = WindowSystemFactory::Instance()->MakeWindowImp();
}
return _imp;
}
// Configured once at startup: windowSystemFactory = new XWindowSystemFactory()
// All subsequent Window instances automatically get XWindowImp
```
`WindowSystemFactory::Instance()` is a Singleton. It is set to the appropriate concrete factory at application launch based on the detected platform. No Window subclass ever mentions `XWindowImp` or `PMWindowImp`.
---
## 5. Implementation Considerations Checklist
### One Implementor (degenerate Bridge)
When only one implementation exists, you still create the Implementor interface even though there is only one ConcreteImplementor. Benefit: changes to the implementation class (even compilation) don't require recompiling any Abstraction class or its clients. This is the "Cheshire Cat" idiom in C++ — the implementation can be defined in a private header not shipped to clients, hiding the implementation entirely.
### Sharing Implementors (reference counting)
When multiple Abstraction objects share one Implementor (e.g., Coplien's Handle/Body idiom for shared string representations), the Implementor needs a reference count. The Abstraction's assignment operator manages Ref/Unref:
```cpp
Handle& Handle::operator=(const Handle& other) {
other._body->Ref();
_body->Unref();
if (_body->RefCount() == 0) delete _body;
_body = other._body;
return *this;
}
```
### Multiple inheritance — why it does NOT work
In C++, you might try to combine Abstraction and ConcreteImplementor via multiple inheritance (publicly from Abstraction, privately from ConcreteImplementor). This permanently binds one implementation to one abstraction via static inheritance. You cannot switch implementations at runtime. This is NOT a Bridge — it is the original problem Bridge solves.
---
## 6. Related Patterns
| Pattern | Relationship to Bridge |
|---------|----------------------|
| **Abstract Factory** | Natural partner for wiring. An Abstract Factory creates and configures a specific Bridge by returning the right ConcreteImplementor for the current context. |
| **Adapter** | Adapter makes unrelated classes work together — applied retroactively. Bridge is designed upfront to separate abstraction and implementation — applied proactively. Structurally similar, intent is different. |
| **Strategy** | Strategy varies an algorithm. Bridge varies a structural implementation. Strategy's context doesn't usually extend the strategy interface; Bridge's Abstraction has a full subclass hierarchy. |
Choose the right behavioral design pattern from the 11 GoF behavioral patterns. Use when someone asks "how do I make this algorithm swappable?", "should I us...
---
name: behavioral-pattern-selector
description: |
Choose the right behavioral design pattern from the 11 GoF behavioral patterns. Use when someone asks "how do I make this algorithm swappable?", "should I use Strategy or State?", "how do I decouple senders from receivers?", "what's the difference between Observer and Mediator?", "how do I add undo/redo?", "how do I traverse a collection without exposing its internals?", or "how do I add operations to classes I can't modify?" Also use when you see switch statements selecting between algorithms, tightly coupled event handlers, objects that change behavior based on state, or request-handling logic that should be distributed across a chain. Analyzes via two taxonomies — what aspect to encapsulate as an object (algorithm, state, protocol, traversal) and how to decouple senders from receivers (Command, Observer, Mediator, Chain of Responsibility) — with explicit trade-offs for each choice.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/behavioral-pattern-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- design-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [5]
tags: [design-patterns, behavioral, gof, strategy, state, observer, command, chain-of-responsibility, mediator, template-method, iterator, visitor, interpreter, memento]
execution:
tier: 1
mode: full
inputs:
- type: none
description: "A description of the behavioral problem — how objects communicate, how algorithms vary, or how responsibilities should be distributed"
tools-required: [TodoWrite]
tools-optional: [Read, Grep]
mcps-required: []
environment: "Any agent environment. A codebase is helpful but not required."
---
# Behavioral Pattern Selector
## When to Use
You have identified that a problem is behavioral — it concerns how objects communicate, distribute responsibilities, or vary their actions — and need to pick the right one of the 11 GoF behavioral patterns. This skill is appropriate when:
- You have algorithms that vary across contexts and clients should swap them independently
- An object must change its behavior radically when its internal state changes
- You need to issue requests without knowing the receiver or the operation in advance
- Multiple objects are communicating in complex, entangled ways and need decoupling
- You need to traverse or operate on a collection without exposing its internals
- You want to add operations to a stable class hierarchy without modifying it
- You need to save and restore an object's internal state without breaking encapsulation
- You want to define a grammar for simple sentences and interpret them
- You are deciding among behavioral patterns and want an explicit trade-off comparison
Before starting, confirm:
- Is the problem genuinely behavioral (communication/algorithm), not creational (object construction) or structural (interface composition)? If unsure, invoke `design-pattern-selector` first for purpose classification.
- Do you have at least a rough description of what the system needs to do differently — even informally?
---
## Context & Input Gathering
### Required Context (must have before proceeding)
- **Behavioral problem description:** What interaction, communication, or algorithmic variability challenge are you solving? Even a rough description works — "objects notify each other in a tangled web" or "I have switch statements picking between algorithms."
- **What varies or needs decoupling:** Identify whether it's an algorithm, state-dependent behavior, communication protocol, or request handling that needs flexibility.
### Observable Context (gather from environment)
- **Existing codebase:** If available, scan for behavioral smell signals — switch/if-else chains on type, tightly coupled event handlers, state-dependent conditionals, hard-coded request routing.
- **Related patterns already in use:** Check if creational or structural patterns are already applied — they constrain which behavioral patterns compose well.
### Sufficiency
```
SUFFICIENT: problem description + what varies identified
PROCEED WITH DEFAULTS: rough problem description only (skill will help classify)
MUST ASK: no behavioral problem articulated at all
```
## Process
### Step 1: Set Up Tracking and Characterize the Problem
**ACTION:** Use `TodoWrite` to track progress, then capture the behavioral problem in structured form.
**WHY:** Behavioral patterns share surface similarities — Strategy, State, and Template Method all vary an algorithm, but for fundamentally different reasons. Without explicitly articulating *what* varies and *why*, you risk picking a pattern that fits the shape of the problem but not the intent. Forcing structured framing surfaces the distinction.
```
TodoWrite:
- [ ] Step 1: Characterize the behavioral problem
- [ ] Step 2: Apply Taxonomy 1 — What aspect to encapsulate
- [ ] Step 3: Apply Taxonomy 2 — Sender-receiver decoupling (if applicable)
- [ ] Step 4: Resolve ambiguous pairs with differentiating questions
- [ ] Step 5: Produce recommendation with trade-offs
```
Capture these elements from the user's description:
| Element | Question to answer |
|---------|-------------------|
| **What varies** | What behavior or aspect is expected to change across contexts or time? |
| **Who decides** | Does the client choose, or does the object itself decide based on state? |
| **Coupling concern** | Must the sender of a request be decoupled from who handles it? |
| **Stability** | Is the object structure stable but operations on it need to grow? |
| **Lifecycle** | Does the varying thing need to exist independently (be passed around, stored, undone)? |
If more than one element is unclear, ask one targeted question and wait for a response before continuing.
---
### Step 2: Apply Taxonomy 1 — What Aspect to Encapsulate
**ACTION:** Match the problem against the encapsulation taxonomy. Identify which aspect of the program is expected to change and therefore needs its own object.
**WHY:** The defining theme of behavioral patterns is *encapsulating variation*. Each pattern names the thing that changes and lifts it into a dedicated object. If you can identify what that thing is, you can identify the pattern — because the pattern is literally named for it. Skipping this step leads to pattern matching by feel rather than by structure.
**Encapsulation taxonomy:**
| Encapsulated aspect | Pattern | The object holds... |
|---------------------|---------|---------------------|
| An algorithm | **Strategy** | The interchangeable computation — clients choose which strategy to install |
| State-dependent behavior | **State** | Behavior for one state of a context — object self-transitions at runtime |
| Inter-object protocol | **Mediator** | The communication rules between a group of colleague objects |
| Traversal method | **Iterator** | The logic for stepping through an aggregate's elements without exposing structure |
| A request (invocation) | **Command** | One operation — parameterized, deferrable, undoable, loggable |
| A state snapshot | **Memento** | A private checkpoint of an object's internals, opaque to all but the originator |
| Operations on a structure | **Visitor** | A family of operations across an object structure — defined outside the classes |
**Three patterns that do not fit this taxonomy** (they work at the class or communication level):
| Pattern | What it does instead |
|---------|---------------------|
| **Template Method** | Uses inheritance, not encapsulation in a separate object — a superclass skeleton calls abstract steps that subclasses fill in |
| **Observer** | Encapsulates a dependency relationship, not a discrete aspect — subjects and observers cooperate to maintain a constraint |
| **Chain of Responsibility** | Distributes handling across an open-ended chain — no single encapsulating object |
| **Interpreter** | Represents a grammar as a class hierarchy — each grammar rule is a class, not a standalone encapsulated object |
**Output:** Identify the encapsulated aspect and the candidate pattern(s) it points to. If the problem description mentions both an algorithm and state-driven behavior, or both communication and a protocol object, flag both candidates and proceed to Step 4.
---
### Step 3: Apply Taxonomy 2 — Sender-Receiver Decoupling (If Applicable)
**ACTION:** If the problem involves decoupling who sends a request from who handles it, evaluate the four decoupling patterns against each other.
**WHY:** Command, Observer, Mediator, and Chain of Responsibility all decouple senders from receivers, but with very different structures and trade-offs. Choosing the wrong one produces either the right decoupling for the wrong granularity (Observer when you need Command) or centralized logic that makes the system harder to understand (Mediator when Observer's distributed model is sufficient). This taxonomy makes the choice explicit.
**Apply only if** the problem includes any of these:
- "The sender should not know who handles the request"
- "Multiple objects need to react to a change"
- "Objects are too tightly coupled to each other"
- "Requests should be queued, logged, or undone"
**Decoupling comparison:**
| Pattern | Coupling model | Cardinality | Receiver identity | Key trade-off |
|---------|---------------|-------------|-------------------|---------------|
| **Command** | Object binding — command holds reference to receiver | 1 sender : 1 receiver (per command) | Known at command-creation time; hidden from invoker | Invoker stays decoupled; a subclass is nominally required per sender-receiver pair |
| **Observer** | Interface-based — subject notifies any attached observer | 1 subject : many observers, dynamic | Not known to subject; observers self-register | Promotes loose coupling and fine-grained reuse; communication flow is harder to trace |
| **Mediator** | Centralized — all colleagues route through mediator | Many senders : many receivers via one hub | Hidden from colleagues; mediator knows all | Simplifies colleague interaction; mediator itself can become a complexity bottleneck |
| **Chain of Responsibility** | Chain forwarding — request walks a linked list of handlers | 1 sender : implicit receiver (first willing handler) | Not known to sender; determined at runtime by chain order | Very loose coupling; **warning: request may go unhandled if no handler claims it** |
**Decision rules:**
- Need **undo/redo**, **queuing**, or **logging of requests**? → Command (the request is a first-class object)
- One object changes and **an unknown number of others must react**? → Observer (broadcast, data dependency)
- **Many objects interact in complex ways** and the interaction logic should live in one place? → Mediator (centralize the protocol)
- **Multiple candidate handlers exist** and the right one is determined at runtime, but any one is sufficient? → Chain of Responsibility (but add a fallback handler to prevent silent drops)
**Output:** If this taxonomy applies, add the appropriate decoupling pattern as a candidate or as a confirmation of the candidate from Taxonomy 1.
---
### Step 4: Resolve Ambiguous Pattern Pairs
**ACTION:** If two or more candidates remain after Steps 2–3, apply the differentiating questions for the relevant pair.
**WHY:** Several behavioral patterns share structural similarities. Strategy and State have identical class diagrams. Template Method and Strategy both vary algorithms. Without asking the right differentiating question, you will correctly identify the problem category but pick the wrong pattern.
**Pair: Strategy vs. State**
Both encapsulate behavior that can be swapped at runtime via a context object. The difference is *who initiates the swap and why*:
| Question | Strategy → if... | State → if... |
|----------|-----------------|---------------|
| Who changes the behavior? | The **client** installs a strategy explicitly | The **context object itself** transitions to a new state based on internal logic |
| Do the concrete variants know about each other? | No — strategies are independent | Yes — state objects often initiate transitions to sibling states |
| Is the variation about selecting an algorithm? | Yes — same interface, different computation | No — about representing a lifecycle that drives behavior |
**Pair: Template Method vs. Strategy**
Both vary an algorithm, but via opposite mechanisms:
| Question | Template Method → if... | Strategy → if... |
|----------|------------------------|-----------------|
| How is variation expressed? | **Inheritance** — a subclass overrides abstract steps | **Composition** — a context object holds a replaceable strategy |
| Can the algorithm be swapped at runtime? | No — fixed at class definition time | Yes — a different strategy object can be installed at any time |
| Do you control the superclass? | Yes — you define the skeleton | Doesn't matter — client provides the algorithm |
**Pair: Observer vs. Mediator**
Both decouple objects from direct references, but in opposite directions:
| Question | Observer → if... | Mediator → if... |
|----------|-----------------|-----------------|
| Where does communication logic live? | **Distributed** — observers and subjects cooperate to maintain the constraint | **Centralized** — the mediator owns all routing logic |
| What is the coupling concern? | One object changes; **unknown others must react** | Many objects interact; **the interactions themselves are complex** |
| What is more important? | Reusability of subjects and observers | Understandability of the overall interaction flow |
**Pair: Command vs. Chain of Responsibility**
Both issue a request without the sender knowing the receiver, but via different mechanisms:
| Question | Command → if... | Chain of Responsibility → if... |
|----------|-----------------|---------------------------------|
| Is there a known receiver at request-creation time? | Yes — command holds a reference to its receiver | No — the receiver is determined by walking the chain |
| Does the request need to be stored, undone, or replayed? | Yes — Command is a first-class object | No — the request is transient |
| Is exactly one handler expected? | Yes | Not necessarily — could be zero (silent drop) or one |
**Output:** A single confirmed pattern, or at most two if the problem genuinely requires both (e.g., Command + Chain of Responsibility is a valid composition — the chain uses Command objects as the request representation).
---
### Step 5: Produce Recommendation with Trade-offs
**ACTION:** Deliver the recommendation in the format below. Include the taxonomy reasoning that led to it, the key trade-off, and the runner-up pattern with why it was not chosen.
**WHY:** A recommendation without reasoning is a guess. The user needs to understand the taxonomy path that produced the choice — both to validate it and to apply the pattern correctly. The trade-off section prevents misapplication in edge cases.
**Recommendation format:**
```
## Pattern Recommendation: [Pattern Name]
**Category:** Behavioral — [Class / Object] scope
**Encapsulation taxonomy:** [What aspect is being encapsulated as an object]
**Decoupling taxonomy:** [If applicable — which sender-receiver model applies]
**Why this pattern:** [1–2 sentences connecting the pattern's intent to the specific problem]
**What it enables:** [What becomes possible that wasn't before]
**Key trade-off:** [What complexity or constraint is introduced]
**Warning (if applicable):** [Chain of Responsibility: silent drop risk; Mediator: bottleneck risk; etc.]
**Alternative considered:** [Runner-up pattern] — [Why it was not chosen]
**Patterns to combine with:** [Any complementary patterns]
```
**Common combinations:**
- **Chain of Responsibility + Command** — chain uses Command objects so requests can be logged or replayed; Template Method defines handler dispatch logic
- **Observer + Mediator** — Observer for subject-observer notification; Mediator to coordinate the subjects themselves
- **Strategy + Template Method** — Template Method defines the algorithm skeleton; Strategy fills in variable steps via composition (evolution path when inheritance becomes limiting)
- **Iterator + Visitor** — Iterator traverses the aggregate; Visitor applies an operation to each element
- **Command + Memento** — Command executes operations; Memento provides state snapshots to support undo
---
## Examples
### Example 1: Order Processing with Multiple Tax Algorithms
**Scenario:** An e-commerce system computes tax differently depending on region (US federal, EU VAT, Canadian GST). The tax logic is embedded in the `Order` class as a large `if/else` block. A new region is coming and the team does not want to keep modifying `Order`.
**Trigger:** "We have different tax rules per region and the order class keeps getting bigger every time we add one."
**Process:**
- Step 2 (Taxonomy 1): What varies is the *algorithm* for tax computation. Clients (or configuration) choose which algorithm to apply. → Strategy
- Step 3 (Taxonomy 2): No sender-receiver decoupling concern — Order calculates tax directly.
- Step 4: Strategy vs. State? The variation is client-selected (region at order creation), not self-transitioning behavior. Confirmed: Strategy.
**Output:**
```
## Pattern Recommendation: Strategy
**Category:** Behavioral — Object scope
**Encapsulation taxonomy:** Algorithm (tax computation rule) is extracted into a TaxStrategy object
**Why this pattern:** Strategy replaces the if/else dispatch with a TaxStrategy
interface. Each region (USTaxStrategy, EUVATStrategy, CAGSTStrategy) is a
ConcreteStrategy. Order holds a reference and delegates to it.
**What it enables:** New regions are new TaxStrategy subclasses — Order is never
modified. Strategies can be tested independently.
**Key trade-off:** Clients must know which strategy to instantiate; this knowledge
must come from somewhere (configuration, factory). If strategy selection is itself
complex, pair with Abstract Factory.
**Alternative considered:** Template Method — would require subclassing Order for
each region (OrderForUS, OrderForEU), which tightly couples the algorithm variation
to the order lifecycle. Strategy via composition is cleaner.
**Patterns to combine with:** Abstract Factory (to instantiate the correct strategy
based on region configuration)
```
---
### Example 2: UI Form with Enable/Disable Logic Across States
**Scenario:** A wizard dialog has three states: empty (all fields disabled except the first), partial (some fields enabled, Submit disabled), and complete (Submit enabled). Currently, every button-click handler contains the same conditional: "if state == X, enable Y, disable Z." The logic is duplicated across 12 event handlers.
**Trigger:** "Every event handler has to know what state the dialog is in. Adding a new state means touching all 12 handlers."
**Process:**
- Step 2 (Taxonomy 1): What varies is *behavior per state* — not a client-selected algorithm, but self-determined transitions. The dialog itself decides when to move from partial to complete. → State
- Step 4: Strategy vs. State? The dialog transitions itself (the "partial" state knows when to transition to "complete" based on field validation). No external client selects the behavior. Confirmed: State.
**Output:**
```
## Pattern Recommendation: State
**Category:** Behavioral — Object scope
**Encapsulation taxonomy:** State-dependent behavior — each wizard state (EmptyState,
PartialState, CompleteState) encapsulates its own enable/disable logic
**Why this pattern:** Each state object handles the event and applies the correct
enable/disable rules for that state. The wizard context delegates all event handling
to its current state object. Transitions (EmptyState → PartialState) are initiated by
the state objects themselves when conditions are met.
**What it enables:** Adding a new state is a new class — existing handlers are not
touched. The 12 duplicated conditionals collapse into the state objects that own them.
**Key trade-off:** State objects may need access to the context to initiate transitions,
introducing some coupling. The number of classes grows by one per state.
**Alternative considered:** Strategy — the behavior here is not client-selected; the
dialog self-transitions. Strategy would require the external caller to manage state
changes, which belongs inside the dialog lifecycle.
```
---
### Example 3: Audit Log for Spreadsheet Operations
**Scenario:** A spreadsheet application must support unlimited undo/redo for cell edits, formula applications, and formatting changes. Operations may take several seconds. The team also wants a log of all executed operations for crash recovery.
**Trigger:** "We need full undo/redo and also want to replay operations from a log after a crash."
**Process:**
- Step 2 (Taxonomy 1): The *request itself* (each operation) needs to be a first-class object — stored, reversed, and replayed. → Command
- Step 3 (Taxonomy 2): Sender-receiver decoupling: the UI (invoker) must not know the cell/formula/formatting receiver details. Command is the binding object. 1:1 per operation. Confirmed: Command.
- Step 4: Command vs. Chain? The receiver is known at command-creation time (the specific cell being edited). No chain needed. Command confirmed.
**Output:**
```
## Pattern Recommendation: Command
**Category:** Behavioral — Object scope
**Encapsulation taxonomy:** Request (each operation) is encapsulated as a Command
object with Execute() and Unexecute() operations
**Decoupling taxonomy:** 1 invoker (toolbar/keyboard) : 1 receiver (cell/formula/formatter)
per command — invoker stays decoupled from receiver details
**Why this pattern:** Each operation (EditCellCommand, ApplyFormulaCommand,
FormatRangeCommand) is an object. Execute() applies it; Unexecute() reverses it.
The history list holds executed commands. Undo traverses it backward; redo forward.
Crash recovery replays the log by re-executing stored commands.
**What it enables:** Unlimited undo/redo with no special logic in the UI. Crash
recovery by replaying the command log. Macro recording by composing commands into
MacroCommands.
**Key trade-off:** Each unique operation requires a Command subclass. If operations
are highly varied (many receiver types), this produces many classes. Closures or
lambdas can replace subclasses in languages that support first-class functions.
**Alternative considered:** Memento — captures state snapshots for undo but stores
full state copies, which is expensive for large spreadsheets. Command stores only the
delta (what changed), which is far more efficient.
**Patterns to combine with:** Memento (for operations whose reversal cannot be
computed — store a snapshot as part of the command's undo data)
```
---
## Key Principles
- **Two taxonomies, not one.** Behavioral patterns can be classified by what they encapsulate (algorithm, state, protocol, traversal) AND by how they decouple senders from receivers. Use both lenses — one narrows the candidate set, the other confirms the choice. WHY: Patterns that look similar under one taxonomy (Strategy vs State) are clearly different under the other.
- **Strategy and State have identical structure but opposite intents.** Strategy varies the algorithm while context stays the same. State varies the context's behavior when its internal state changes. If the "which object to delegate to" decision is made by the client, it's Strategy. If it's made by the context itself (state transitions), it's State. WHY: This is the single most confused pair in the behavioral catalog.
- **Observer distributes, Mediator centralizes.** Both decouple interacting objects. Observer promotes finer-grained, more reusable classes but makes communication flow harder to trace. Mediator makes flow explicit but can become a monolith. WHY: Choosing wrong creates either untraceable event spaghetti (Observer misuse) or a god-object mediator.
- **Chain of Responsibility has no delivery guarantee.** A request can fall off the end of the chain unhandled, silently. Always include a catch-all handler or an assertion that the chain is complete. WHY: Silent request drops are the hardest bugs to diagnose.
## Reference Files
| File | Contents |
|------|----------|
| `references/behavioral-comparison.md` | All 11 patterns: applicability summary, encapsulation type, decoupling role, key trade-offs, and confusable pairs |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-design-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/behavioral-comparison.md
# Behavioral Pattern Comparison Reference
> Source: *Design Patterns: Elements of Reusable Object-Oriented Software* (Gamma, Helm, Johnson, Vlissides), Chapter 5
> Used by: `behavioral-pattern-selector` — Steps 2, 3, 4
---
## Quick-Select Table
| Pattern | Scope | Use when... | Encapsulates | Decoupling role |
|---------|-------|-------------|--------------|-----------------|
| Chain of Responsibility | Object | More than one object may handle a request; handler not known a priori | — | Sender from receiver via forwarding chain |
| Command | Object | Parameterize, queue, log, or undo operations | Request (invocation) | Invoker from receiver via command object |
| Interpreter | Class | Sentences in a simple grammar need to be interpreted | Grammar rule (as a class) | — |
| Iterator | Object | Access aggregate contents without exposing internal structure | Traversal method | — |
| Mediator | Object | Complex many:many object interactions need to be simplified | Inter-object protocol | Colleagues from each other via central hub |
| Memento | Object | Object state must be saved and restored without breaking encapsulation | State snapshot | — |
| Observer | Object | One object changes and an unknown number of dependents must react | — | Subject from observers via interface subscription |
| State | Object | Object behavior must change radically when internal state changes | State-dependent behavior | — |
| Strategy | Object | A family of algorithms must be interchangeable | Algorithm | — |
| Template Method | Class | Invariant algorithm steps should be fixed; variable steps deferred to subclasses | — (uses inheritance) | — |
| Visitor | Object | Operations on a stable object structure must grow without modifying the classes | Operations on a structure | — |
---
## Full Applicability — Per Pattern
### Chain of Responsibility
**Use when:**
- More than one object may handle a request, and the handler is not known a priori — it should be ascertained automatically.
- You want to issue a request to one of several objects without specifying the receiver explicitly.
- The set of objects that can handle a request should be specified dynamically.
**Encapsulation type:** None — the pattern works with an existing set of objects; it does not extract a variation into a standalone encapsulating object.
**Decoupling role:** Sender from receiver. The sender submits a request to the head of the chain; any handler in the chain may fulfill it. The sender has no reference to any specific receiver.
**Key trade-off:** The chain can be changed and extended easily at runtime. However, **there is no guarantee a request is handled** — if no handler claims it, the request falls off the end of the chain silently. Always include a fallback handler (a "catch-all" at the chain's tail) in production systems.
**Confusable with:** Command — both decouple sender from receiver, but Command binds a specific receiver at creation time, whereas Chain of Responsibility leaves the receiver implicit and determined at runtime by chain order.
---
### Command
**Use when:**
- You want to parameterize objects by an action to perform (an object-oriented replacement for callbacks).
- You need to specify, queue, and execute requests at different times — a command object can outlive the original request.
- You need to support undo: `Execute()` performs the operation; `Unexecute()` reverses it. Unlimited undo/redo is achieved by maintaining a history list and traversing it.
- You need to support crash recovery via logging: commands are stored to disk and re-executed on restart.
- You want to structure a system around high-level transactions built on primitive operations.
**Encapsulation type:** Request (invocation). The command object holds the receiver reference, the operation to invoke, and any arguments. The invoker never sees the receiver.
**Decoupling role:** 1 invoker (sender) : 1 receiver per command. The Command object is the explicit binding between them.
**Key trade-off:** Each unique sender-receiver binding nominally requires a Command subclass. In languages with first-class functions, closures or lambdas can replace subclasses. MacroCommand composes multiple commands — this composability is a significant advantage for scripting and transaction support.
**Confusable with:**
- Chain of Responsibility — Command knows its receiver; Chain does not.
- Memento — both are "magic tokens" passed around as arguments; Command encapsulates a request (executable), Memento encapsulates state (not executable, opaque to all but the originator).
---
### Interpreter
**Use when:**
- There is a language (or notation) to interpret, and its statements can be represented as abstract syntax trees.
- The grammar is simple — for complex grammars, the class hierarchy becomes large and unmanageable; use a parser generator instead.
- Efficiency is not a critical concern — the most efficient interpreters usually compile to another form rather than interpreting parse trees directly.
**Encapsulation type:** Grammar rule. Each rule in the grammar becomes a class in the hierarchy. The pattern derives its name from the class that implements interpretation across that hierarchy.
**Decoupling role:** None specific to sender/receiver decoupling.
**Key trade-off:** Easy to extend the grammar (add a new rule = add a new class). Hard to maintain when the grammar grows large. For production languages, use parser generators (yacc, ANTLR) that can interpret without building full abstract syntax trees.
**Confusable with:** Composite — Interpreter commonly uses Composite to structure the abstract syntax tree. Non-terminal expression classes are Composite nodes; terminal expression classes are leaves.
---
### Iterator
**Use when:**
- You need to access an aggregate object's contents without exposing its internal representation.
- You need to support multiple simultaneous traversals of the same aggregate.
- You want to provide a uniform interface for traversing different aggregate structures (polymorphic iteration).
**Encapsulation type:** Traversal method. The iterator object holds the traversal state (current position) and the traversal algorithm, separate from the aggregate that holds the elements.
**Decoupling role:** Client from aggregate internal structure. The client calls `First()`, `Next()`, `IsDone()`, `CurrentItem()` without knowing whether the aggregate is a list, a skip list, a tree, or any other structure.
**Key trade-off:** Separating traversal from the aggregate enables multiple simultaneous traversals of the same collection without interference. The aggregate must provide a factory method (`CreateIterator()`) to produce the appropriate iterator — this is a Factory Method application. Internal iterators (where the aggregate controls traversal) are simpler to use but less flexible than external iterators (where the client controls traversal).
**Confusable with:** Visitor — Visitor applies an operation to each element during traversal; Iterator handles the traversal itself. They compose: an Iterator traverses while a Visitor performs the operation on each item visited.
---
### Mediator
**Use when:**
- A set of objects communicate in well-defined but complex ways. The resulting interdependencies are unstructured and difficult to understand.
- Reusing an object is difficult because it refers to and communicates with many other objects.
- A behavior distributed between several classes should be customizable without a lot of subclassing.
**Encapsulation type:** Inter-object protocol. The mediator object owns all the routing logic between colleague objects. Colleagues refer only to the mediator, never to each other directly.
**Decoupling role:** Many colleagues from each other via a central hub. The mediator centralizes rather than distributes communication. This is the opposite architectural direction from Observer.
**Key trade-off:**
- *Advantage over Observer:* Communication flow is easier to understand and change — all logic lives in one place.
- *Disadvantage vs. Observer:* The mediator itself can become a monolithic bottleneck — a "god object" that knows too much. Observer produces more reusable subjects and observers (finer-grained classes) but harder-to-trace communication flow.
- Mediator can reduce the need for subclassing by centralizing behavior that would otherwise be distributed across subclasses.
**Confusable with:** Observer — see "Observer vs. Mediator" section below.
---
### Memento
**Use when:**
- A snapshot of (some portion of) an object's state must be saved so it can be restored later, **and**
- A direct interface to obtaining that state would expose implementation details and break the object's encapsulation.
**Encapsulation type:** State snapshot. The memento holds the originator's internal state at a point in time. Only the originator can write to or read from the memento — it is opaque to all other objects (the caretaker holds it but never inspects it).
**Decoupling role:** None specific to sender/receiver. Memento supports undo mechanisms and checkpointing without coupling the undo mechanism to the originator's internal structure.
**Key trade-off:** If the originator must copy large amounts of state frequently, mementos can be expensive in memory. Incremental mementos (storing only the diff) can mitigate this but add complexity. The narrow interface (caretaker cannot inspect) preserves encapsulation but prevents caretakers from managing memento storage efficiently.
**Confusable with:** Command — both are "tokens" passed as arguments. Command is executable (it has a polymorphic `Execute()` operation); Memento's interface is so narrow it presents no polymorphic operations to its clients.
---
### Observer
**Use when:**
- An abstraction has two aspects, one dependent on the other. Encapsulating them in separate objects lets you vary and reuse them independently.
- A change to one object requires changing others, and you do not know how many objects need to change.
- An object should be able to notify other objects without making assumptions about who those objects are — without tight coupling.
**Encapsulation type:** None in the "extract a variation" sense — Observer encapsulates a dependency relationship. The Subject and Observer interfaces together maintain the constraint, distributed across both sides.
**Decoupling role:** 1 subject (sender) : many observers (receivers), dynamic membership. Observers self-register and self-deregister. The subject has no direct references to any observer.
**Key trade-off:**
- Promotes loose coupling and finer-grained classes (more reusable subjects and observers) — *advantage over Mediator*.
- Communication flow is hard to trace — when observers and subjects are connected indirectly, the "why did this change?" question requires following indirect chains — *disadvantage vs. Mediator*.
- Update cascades are possible: if an observer is also a subject, a notification can trigger a chain of updates that is difficult to debug.
**Confusable with:** Mediator — see "Observer vs. Mediator" section below.
---
### State
**Use when:**
- An object's behavior depends on its state, and it must change its behavior at runtime depending on that state.
- Operations have large multipart conditional statements that depend on the object's state (typically represented by enumerated constants). The State pattern puts each branch of the conditional into a separate class.
**Encapsulation type:** State-dependent behavior. Each `ConcreteState` class encapsulates all behavior appropriate to one state of the context. The context delegates requests to its current state object.
**Decoupling role:** None specific to sender/receiver. The context decouples its behavior from its state by delegating to state objects rather than inspecting `if state == X` inline.
**Key trade-off:** State transitions can be defined in either the `ConcreteState` classes or the `Context` class. Putting transitions in state classes is more flexible but introduces dependencies between state classes (they know their successors). Putting transitions in `Context` keeps state classes independent but centralizes transition logic. The number of classes grows by one per state, which is the appropriate trade-off when states have meaningfully different behavior.
**Confusable with:** Strategy — see "State vs. Strategy" section below.
---
### Strategy
**Use when:**
- Many related classes differ only in their behavior. Strategy provides a way to configure a class with one of many behaviors.
- You need different variants of an algorithm — for example, algorithms with different space/time trade-offs.
- An algorithm uses data that clients should not know about. Use Strategy to avoid exposing complex, algorithm-specific data structures.
- A class defines many behaviors that appear as multiple conditional statements in its operations. Move related conditional branches into their own Strategy class.
**Encapsulation type:** Algorithm. The strategy object holds one interchangeable computation. Context holds a reference to a Strategy interface and delegates the algorithm to whichever concrete strategy is installed.
**Decoupling role:** None specific to sender/receiver. Strategy decouples algorithm selection from algorithm use — the context does not know which algorithm is running.
**Key trade-off:** Clients must know which strategy to select. If strategy selection is complex, pair with a factory. All strategies share the same interface — if different strategies need very different data from the context, the Strategy interface may become bloated, or strategies may receive data they do not use.
**Confusable with:**
- State — same class diagram, different intent. See section below.
- Template Method — same goal (varying an algorithm), opposite mechanism. See section below.
---
### Template Method
**Use when:**
- You want to implement the invariant parts of an algorithm once and leave it to subclasses to implement the variable behavior.
- Common behavior among subclasses should be factored and localized in a common class to avoid code duplication.
- You want to control which points subclasses can extend. A template method that calls "hook" operations permits extension only at those points.
**Encapsulation type:** None in the "separate object" sense — Template Method uses inheritance. The abstract class owns the algorithm skeleton (the template method); concrete subclasses fill in the abstract steps.
**Decoupling role:** None specific to sender/receiver.
**Key trade-off:** Template Method is the fundamental technique for code reuse in class libraries. The inversion of control ("don't call us, we'll call you") is its defining characteristic — the parent class calls the subclass's overrides, not the other way around. The cost is that variation is fixed at compile time (subclass selection) and cannot be changed at runtime.
**Confusable with:** Strategy — see section below.
---
### Visitor
**Use when:**
- An object structure contains many classes of objects with differing interfaces, and you want to perform operations on them that depend on their concrete classes.
- Many distinct and unrelated operations need to be performed on objects in an object structure, and you want to avoid "polluting" their classes with these operations. Visitor groups related operations in one class.
- The classes defining the object structure rarely change, but you often want to define new operations over the structure. (If the object structure changes frequently, it is usually better to define operations in those classes instead of using Visitor.)
**Encapsulation type:** Operations on a structure. The visitor object holds a family of operations — one per element type in the object structure — and defines them outside those classes.
**Decoupling role:** None specific to sender/receiver.
**Key trade-off:** Adding a new operation is easy — add a new Visitor class. Adding a new element type is hard — requires updating every Visitor class. The pattern is appropriate when the object structure is stable but the set of operations over it is expected to grow. Visitor breaks encapsulation: element classes must expose enough state for visitors to do their work, which may expose internals.
**Confusable with:** Iterator — Iterator traverses; Visitor operates. They compose cleanly.
---
## Key Differentiating Pairs
### Observer vs. Mediator
The GoF explicitly frames these as competing patterns. The choice is a trade-off between two architectural values:
| | Observer | Mediator |
|---|----------|----------|
| Communication model | **Distributed** — subjects and observers maintain the constraint together | **Centralized** — mediator owns all routing logic |
| Reusability | Higher — subjects and observers are fine-grained, independently reusable | Lower — mediator is application-specific |
| Understandability | Lower — indirect connections are hard to trace | Higher — all communication logic in one place |
| Typical use | Data dependency: one thing changes, many unknown others must react | Interaction complexity: many things communicate in structured but tangled ways |
Both can be present in the same system: Observer for state-change notification between subjects and their dependents; Mediator to coordinate the subjects themselves.
---
### State vs. Strategy
Same class diagram. Completely different intent.
| | Strategy | State |
|---|----------|-------|
| Who installs the variant? | The **client** (externally, explicitly) | The **context itself** (internally, based on its own logic) |
| Do concrete variants know about each other? | No — strategies are independent | Often yes — state objects initiate transitions to sibling states |
| Question being answered | "Which algorithm should run?" | "What behavior is appropriate for the current lifecycle stage?" |
| Runtime changeability | Yes — by client replacing the strategy object | Yes — by the context or state object transitioning to a new state |
If the behavior changes because someone outside the object chose a configuration → Strategy.
If the behavior changes because the object itself reached a new lifecycle milestone → State.
---
### Template Method vs. Strategy
Both vary an algorithm. The mechanism is opposite.
| | Template Method | Strategy |
|---|----------------|----------|
| Mechanism | **Inheritance** — abstract class skeleton, subclass fills steps | **Composition** — context holds a replaceable strategy object |
| Runtime flexibility | None — algorithm variant is fixed at class definition | Full — strategy can be swapped at runtime |
| Coupling | Subclass is tightly coupled to the superclass skeleton | Context and strategy are loosely coupled via interface |
| Migration path | Often the starting point (simpler) | Often the refactoring target when Template Method becomes too inflexible |
A system commonly begins with Template Method (simpler, no extra objects) and refactors to Strategy when the variation needs to be swapped at runtime or composed dynamically.
---
### Command vs. Chain of Responsibility
Both decouple sender from receiver, but the receiver's identity is handled differently.
| | Command | Chain of Responsibility |
|---|---------|------------------------|
| Receiver identity | **Known** at command-creation time (held by the command object) | **Unknown** — determined at runtime by chain order |
| Request lifecycle | First-class object — storable, undoable, loggable | Transient — request is forwarded until handled |
| Guarantee of handling | Yes — command always has a receiver | No — request may go unhandled if chain is exhausted |
| Composition | Commands compose into MacroCommands | Chain can use Commands as request objects |
Use Command when the request needs a lifecycle beyond the invocation call. Use Chain when the right handler is context-dependent and should be discovered at runtime — but always add a fallback handler.
---
## Pattern Interaction Map
Behavioral patterns that frequently combine:
| Combination | Why |
|-------------|-----|
| Chain of Responsibility + Command | Chain uses Command objects as request representations; request is then loggable and replayable |
| Chain of Responsibility + Template Method | Template Method defines each handler's dispatch logic (check → handle or forward) |
| Command + Memento | Command performs operations; Memento provides state snapshot for operations whose reversal cannot be computed from the operation alone |
| Iterator + Visitor | Iterator traverses the aggregate; Visitor applies an operation to each element visited |
| Observer + Mediator | Observer for state-change notification; Mediator to coordinate subjects that interact with each other |
| Strategy + Template Method | Template Method provides the skeleton; Strategy fills in variable steps (evolution path from inheritance to composition) |
| Interpreter + Composite | Non-terminal expression classes are Composite nodes; Interpreter uses the composite structure to interpret sentences |
| Interpreter + Iterator | Iterator traverses the abstract syntax tree that Interpreter evaluates |
Implement the Abstract Factory pattern to create families of related objects without specifying their concrete classes. Use when a system must be independent...
---
name: abstract-factory-implementor
description: |
Implement the Abstract Factory pattern to create families of related objects without specifying their concrete classes. Use when a system must be independent of how its products are created, when it must work with multiple product families, when products are designed to be used together and you need to enforce that constraint, or when providing a class library revealing only interfaces. Covers factory-as-singleton, prototype-based factories, and extensible factory interfaces.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/design-patterns-gof/skills/abstract-factory-implementor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- creational-pattern-selector
source-books:
- id: design-patterns-gof
title: "Design Patterns: Elements of Reusable Object-Oriented Software"
authors: ["Erich Gamma", "Richard Helm", "Ralph Johnson", "John Vlissides"]
chapters: [3]
pages: [54-58, 88-96]
tags: [design-patterns, creational, gof, abstract-factory, product-families, portability, refactoring]
execution:
tier: 2
mode: full
inputs:
- type: code
description: "Existing code with hard-coded concrete class instantiation spread across a client, or a design description of a system that must support multiple product families"
tools-required: [TodoWrite, Read]
tools-optional: [Grep, Edit]
mcps-required: []
environment: "Any codebase. Language-agnostic — examples use C++ and Python but the pattern applies universally."
---
# Abstract Factory Implementor
## When to Use
You are creating families of related objects and the client must stay independent of which concrete family it uses. This skill applies under four specific conditions — all are sufficient; any one justifies the pattern:
1. **Product-creation independence** — a system should not depend on how its products are created, composed, or represented. Clients should work against abstract interfaces, not constructor calls.
2. **Multiple product families** — the system must be configured with one of several families of products (e.g., Motif widgets vs. Presentation Manager widgets vs. Mac widgets), and the choice of family should be a single point of configuration, not scattered throughout the codebase.
3. **Enforced co-usage** — products within a family are designed to work together, and you need a structural guarantee that clients never mix products from different families (a MotifScrollBar with a PMButton, for instance, is a violation the pattern makes structurally impossible).
4. **Interface-only library** — you are providing a class library and want to reveal only interfaces, not implementations. Clients code against the abstract factory and abstract product classes; the concrete implementations are deployment-time choices.
Before starting, confirm this is not a simpler case:
- If only **one** product type varies, prefer Factory Method — Abstract Factory is for *families* of products that must vary together.
- If the products are not related (no co-usage constraint), the coordination overhead of Abstract Factory is not justified.
---
## Process
### Step 1: Set Up Tracking and Map the Product Families
**ACTION:** Use `TodoWrite` to track progress, then identify all product types and all families.
**WHY:** The Abstract Factory interface is fixed at design time based on the product types you identify here. Missing a product type now means changing the AbstractFactory interface later, which requires updating every ConcreteFactory — the most expensive structural change possible. A complete product-family map made upfront prevents this.
```
TodoWrite:
- [ ] Step 1: Map product types and product families
- [ ] Step 2: Define AbstractProduct interfaces
- [ ] Step 3: Define the AbstractFactory interface
- [ ] Step 4: Implement ConcreteFactory classes
- [ ] Step 5: Wire the client to use the factory
- [ ] Step 6: Handle singleton, prototype, or extensibility concerns
- [ ] Step 7: Validate and document known trade-offs
```
**Applicability gate — confirm at least one of these four GoF conditions holds before proceeding:**
| # | Condition | Check |
|---|-----------|-------|
| 1 | System should be independent of how its products are created/composed/represented | Does client code currently name concrete product classes? |
| 2 | System must be configured with one of multiple product families | Are there 2+ families that could be swapped? |
| 3 | Products within a family are designed to be used together — co-usage must be enforced | Would mixing products from different families cause bugs? |
| 4 | You want to reveal only product interfaces, not implementations | Should concrete classes be hidden from library consumers? |
**IF none hold → this is not an Abstract Factory problem.** Consider Factory Method (single product) or Builder (complex construction).
**Alternative to consider — Prototype-based factory:** If many product families exist or families differ only slightly, a single factory initialized with prototype instances (cloning instead of subclassing) eliminates the ConcreteFactory subclass explosion. Evaluate this option in Step 6 — but be aware of it now so you don't commit to a subclass-per-family design prematurely.
Produce a product matrix before writing any code:
| Product type | Family A | Family B | Family C |
|--------------|----------|----------|----------|
| ScrollBar | MotifScrollBar | PMScrollBar | MacScrollBar |
| Button | MotifButton | PMButton | MacButton |
| Menu | MotifMenu | PMMenu | MacMenu |
The columns are your ConcreteFactories. The rows are your AbstractProduct interfaces. Count the rows — that is the number of factory methods on your AbstractFactory interface.
---
### Step 2: Define AbstractProduct Interfaces
**ACTION:** For each product type (each row in the matrix), declare an abstract interface or base class.
**WHY:** The AbstractProduct interfaces are the contracts the client codes against. They are what make the client independent of concrete classes. Every method the client needs to call on a product must appear on the AbstractProduct interface — if a client has to downcast to access concrete behavior, the interface is incomplete and the isolation benefit disappears.
For each product type:
1. Name it for the domain concept, not the family (e.g., `ScrollBar`, not `MotifScrollBar`)
2. Declare only the operations all families must support
3. Keep it minimal — only what the client actually calls
See the reference file for full code examples in C++, Python, and Smalltalk.
---
### Step 3: Define the AbstractFactory Interface
**ACTION:** Declare an abstract factory class with one creation method per product type.
**WHY:** The AbstractFactory interface encodes the complete set of products the system knows about. Its method signatures return AbstractProduct types, never concrete types — this is what prevents concrete class names from appearing in client code. The factory interface is the "catalog" of the product family; each ConcreteFactory is one edition of that catalog.
```cpp
class GUIFactory {
public:
virtual ScrollBar* CreateScrollBar() = 0;
virtual Button* CreateButton() = 0;
virtual Menu* CreateMenu() = 0;
virtual ~GUIFactory() = default;
};
```
**Naming convention:** Prefix with `Create` (C++/Java) or `create_` (Python). This makes factory methods instantly recognizable in code review and distinguishes them from ordinary method calls.
---
### Step 4: Implement ConcreteFactory Classes
**ACTION:** For each product family (each column in the matrix), create a ConcreteFactory subclass that implements every creation method.
**WHY:** Each ConcreteFactory encapsulates complete knowledge of one product family. A client that holds a `MotifFactory` reference is guaranteed to receive only Motif products — the factory enforces the co-usage constraint structurally, not through documentation or convention. This is the core value: consistency across a product family is automatic, not policed.
```cpp
class MotifFactory : public GUIFactory {
public:
ScrollBar* CreateScrollBar() override { return new MotifScrollBar; }
Button* CreateButton() override { return new MotifButton; }
Menu* CreateMenu() override { return new MotifMenu; }
};
class PMFactory : public GUIFactory {
public:
ScrollBar* CreateScrollBar() override { return new PMScrollBar; }
Button* CreateButton() override { return new PMButton; }
Menu* CreateMenu() override { return new PMMenu; }
};
```
For each ConcreteFactory, implement the corresponding ConcreteProduct classes:
```cpp
class MotifScrollBar : public ScrollBar {
public:
void ScrollTo(int position) override {
// Motif-specific scroll rendering
}
};
class PMScrollBar : public ScrollBar {
public:
void ScrollTo(int position) override {
// Presentation Manager scroll rendering
}
};
```
---
### Step 5: Wire the Client to Use the Factory
**ACTION:** Refactor the client to receive an AbstractFactory reference and call only factory methods and AbstractProduct interfaces — no `new ConcreteProduct(...)` calls anywhere in client code.
**WHY:** The client has to commit to using the factory interface consistently. A single stray `new MotifScrollBar` in the client code breaks the isolation guarantee — the concrete class name is back in client code and the family-switching benefit is lost. This step is not complete until `grep` finds zero concrete class names imported or instantiated in client code.
**Before (hard-coded, platform-specific):**
```cpp
// Every widget creation exposes a concrete class name
ScrollBar* sb = new MotifScrollBar;
Button* btn = new MotifButton;
Menu* menu = new MotifMenu;
```
**After (factory-mediated, platform-independent):**
```cpp
void Application::BuildInterface(GUIFactory* factory) {
// No concrete class names — family is entirely encapsulated in factory
ScrollBar* sb = factory->CreateScrollBar();
Button* btn = factory->CreateButton();
Menu* menu = factory->CreateMenu();
// ...
}
```
**Factory initialization** — this must happen once, before the client first needs a product, at the program's configuration point:
```cpp
// Compile-time known family
GUIFactory* guiFactory = new MotifFactory;
// Runtime-selected family (environment or config)
const char* styleName = getenv("LOOK_AND_FEEL");
if (strcmp(styleName, "Motif") == 0) {
guiFactory = new MotifFactory;
} else if (strcmp(styleName, "Presentation_Manager") == 0) {
guiFactory = new PMFactory;
} else {
guiFactory = new DefaultGUIFactory;
}
// Pass guiFactory to all subsystems — they never see a concrete class name
```
---
### Step 6: Handle Singleton, Prototype, or Extensibility Concerns
**ACTION:** Apply one or more of the three standard implementation refinements based on the system's constraints.
**WHY:** The basic four-step structure above covers the standard case. The three refinements address recurring complications. Choosing the wrong refinement (or skipping one when it is needed) leads to: unnecessary multi-instantiation of factories (singleton concern), an explosion of ConcreteFactory subclasses when families multiply (prototype concern), or a brittle interface that breaks when new product types are added (extensibility concern).
**Refinement A — Factories as Singletons**
An application typically needs exactly one instance of each ConcreteFactory per product family. Multiple instances would allow different parts of the application to use different families simultaneously, which is precisely the co-usage violation the pattern prevents.
Apply the Singleton pattern to each ConcreteFactory class. The factory's construction is then controlled and the single instance is shared across all clients.
**Refinement B — Prototype-based Factories (when families are many and similar)**
The factory-method approach requires a new ConcreteFactory subclass for each product family. When many product families exist (or families differ only slightly), this produces a combinatorial number of factory subclasses.
The alternative: a single concrete factory initialized with prototype instances. The factory's `Make` method clones the appropriate prototype rather than calling a subclass-specific factory method:
```smalltalk
"Smalltalk prototype-based factory"
make: partName
^ (partCatalog at: partName) copy
addPart: partTemplate named: partName
partCatalog at: partName put: partTemplate
```
Creating an EnchantedMazeFactory requires no new class — only a different factory initialization:
```smalltalk
createMazeFactory
^ (MazeFactory new
addPart: Wall named: #wall;
addPart: EnchantedRoom named: #room;
addPart: DoorNeedingSpell named: #door;
yourself)
```
Prefer the prototype approach when: (a) many product families exist, (b) families differ only in which concrete products they use, or (c) you need to create new families at runtime by composing existing prototypes.
**Refinement C — Extensible Factory Interface**
The standard AbstractFactory interface is closed: adding a new product type requires changing the interface and all ConcreteFactories. A more flexible (but less type-safe) alternative is a single parameterized `Make` method:
```cpp
// Type-safe but closed
virtual ScrollBar* CreateScrollBar() = 0;
virtual Button* CreateButton() = 0;
// Flexible but loses static type safety
virtual AbstractProduct* Make(const string& productKind) = 0;
```
The parameterized version needs only one method change when new product types are added, but products are returned as a common abstract type — clients must cast (or use dynamic typing). This trade-off is the classic **extensibility vs. type safety** tension.
Use the extensible form when: the product type set is known to grow, you are in a dynamically typed language (where the casting cost disappears), or the flexibility benefit outweighs the safety cost. Stick with the standard form when: the product set is stable, static type checking matters, and you want the compiler to catch misuse.
---
### Step 7: Validate and Document Known Trade-offs
**ACTION:** Verify the implementation satisfies all invariants, then explicitly document the extensibility liability.
**WHY:** Two of the four consequences of Abstract Factory are benefits (isolation, easy exchange), one is a benefit-with-a-condition (consistency, which only holds if you never mix factories), and one is a genuine liability (supporting new product kinds is hard). Documenting the liability upfront prevents the team from discovering it painfully later when someone tries to add a new product type.
**Validation checklist:**
- [ ] No concrete product class names appear in client code (verify with grep)
- [ ] Every product type in the matrix has an AbstractProduct interface
- [ ] Every ConcreteFactory implements every AbstractProduct creation method
- [ ] Factory is initialized once, at application startup or configuration boundary
- [ ] If Singleton refinement was applied — factory construction is controlled
- [ ] If Prototype refinement was applied — prototype registry is populated before first use
- [ ] If Extensibility refinement was applied — all clients handle the downcast safely
**Document the trade-off:**
| Consequence | Effect |
|-------------|--------|
| Isolates concrete classes | Client code contains zero concrete class names; all coupling runs through abstract interfaces |
| Easy family exchange | Changing the entire product family requires changing exactly one factory instantiation site |
| Enforces product consistency | Using objects from different families simultaneously is structurally impossible — the factory guarantees the constraint |
| Hard to add new product types | Adding a new product type requires changing AbstractFactory and every ConcreteFactory — this cascades across the entire hierarchy |
---
## Inputs
- Existing code with scattered `new ConcreteProduct(...)` calls that need to be family-switched, OR
- A design description identifying: (a) product types, (b) product families, (c) the client that consumes products
## Outputs
- `abstract_factory.py` (or `.h`/`.cpp`, `.java`, etc.) — AbstractFactory + AbstractProduct interfaces
- `concrete_factories.py` — one ConcreteFactory per family, with ConcreteProduct implementations
- Updated client code — factory-mediated, with all concrete class names removed
---
## Key Principles
- **The interface is the commitment** — every product type your AbstractFactory declares is a permanent entry in the contract. Adding product types later is costly. Invest in completeness at design time rather than patching the interface repeatedly.
- **One factory initialization point** — the concrete factory should be chosen exactly once, at the application's configuration boundary. Allowing factory selection to spread through the codebase recreates the family-mixing problem the pattern solves.
- **Consistency is structural, not conventional** — the value of Abstract Factory over a naming convention ("always use Motif widgets together") is that the structure makes family mixing impossible. If clients can bypass the factory, the guarantee evaporates.
- **Prototype over subclass when families multiply** — factory-method-based AbstractFactory creates one ConcreteFactory subclass per product family. When families are many or differ only slightly, the prototype approach eliminates the subclass explosion without sacrificing the interface guarantee.
- **The extensibility liability is real** — Abstract Factory trades extensibility for consistency. The pattern is appropriate when the product type set is stable. If new product types are expected frequently, the parameterized `Make` interface or a different pattern (Prototype, Builder) may be more appropriate.
---
## Examples
### Example 1: Lexi GUIFactory — Look-and-Feel Portability
**Scenario:** Lexi, a document editor, must run on Motif, Presentation Manager, and Mac. Every widget creation call currently names a concrete class, making it impossible to switch look-and-feel at runtime or port to a new platform without auditing the entire codebase.
**Trigger:** "We need to support a new look-and-feel standard, but widgets are instantiated directly throughout the application."
**Product matrix:**
| Product type | MotifFactory | PMFactory | MacFactory |
|---|---|---|---|
| ScrollBar | MotifScrollBar | PMScrollBar | MacScrollBar |
| Button | MotifButton | PMButton | MacButton |
| Menu | MotifMenu | PMMenu | MacMenu |
**AbstractFactory interface:**
```cpp
class GUIFactory {
public:
virtual ScrollBar* CreateScrollBar() = 0;
virtual Button* CreateButton() = 0;
virtual Menu* CreateMenu() = 0;
};
```
**Runtime family selection at startup:**
```cpp
GUIFactory* guiFactory;
const char* styleName = getenv("LOOK_AND_FEEL");
if (strcmp(styleName, "Motif") == 0)
guiFactory = new MotifFactory;
else if (strcmp(styleName, "Presentation_Manager") == 0)
guiFactory = new PMFactory;
else
guiFactory = new DefaultGUIFactory;
// guiFactory passed to all Application subsystems
// No subsystem ever imports or names MotifScrollBar, PMButton, etc.
```
**Output:** Adding Mac support requires a `MacFactory` + three `MacProduct` classes. Zero changes to application logic, layout code, or event handling. A `MotifMenu` can never appear beside a `PMButton` — the factory makes this impossible.
---
### Example 2: MazeFactory — Swappable Maze Components
**Scenario:** A maze game hard-codes `Room`, `Wall`, and `Door` in its `CreateMaze` function. Adding enchanted mazes or bombed mazes requires forking `CreateMaze` for each variant — a maintenance problem.
**Trigger:** "Every new maze type requires duplicating the maze-building logic with different class names."
**AbstractFactory (also acts as ConcreteFactory for the default case):**
```cpp
class MazeFactory {
public:
virtual Maze* MakeMaze() const { return new Maze; }
virtual Wall* MakeWall() const { return new Wall; }
virtual Room* MakeRoom(int n) const { return new Room(n); }
virtual Door* MakeDoor(Room* r1, Room* r2) const { return new Door(r1, r2); }
};
```
**Client uses only factory calls — no concrete class names:**
```cpp
Maze* MazeGame::CreateMaze(MazeFactory& factory) {
Maze* aMaze = factory.MakeMaze();
Room* r1 = factory.MakeRoom(1);
Room* r2 = factory.MakeRoom(2);
Door* aDoor = factory.MakeDoor(r1, r2);
aMaze->AddRoom(r1);
aMaze->AddRoom(r2);
r1->SetSide(North, factory.MakeWall());
r1->SetSide(East, aDoor);
r1->SetSide(South, factory.MakeWall());
r1->SetSide(West, factory.MakeWall());
// ...
return aMaze;
}
```
**ConcreteFactories override only the methods that differ:**
```cpp
class EnchantedMazeFactory : public MazeFactory {
public:
Room* MakeRoom(int n) const override {
return new EnchantedRoom(n, CastSpell());
}
Door* MakeDoor(Room* r1, Room* r2) const override {
return new DoorNeedingSpell(r1, r2);
}
protected:
Spell* CastSpell() const;
};
class BombedMazeFactory : public MazeFactory {
public:
Wall* MakeWall() const override { return new BombedWall; }
Room* MakeRoom(int n) const override { return new RoomWithABomb(n); }
};
```
**Usage — single factory swap changes the entire maze variant:**
```cpp
MazeGame game;
BombedMazeFactory factory;
game.CreateMaze(factory); // builds a maze with bombs — zero changes to CreateMaze
```
**Output:** `CreateMaze` is written once. Supporting three maze variants requires three factory classes — none of which touch the maze-building algorithm. The Smalltalk prototype-based version creates `EnchantedMazeFactory` without a new class, by initializing a `MazeFactory` with different prototype instances for `#room` and `#door`.
---
### Example 3: Database Abstraction Layer
**Scenario:** A backend application runs tests against SQLite and production against PostgreSQL. Connection, query builder, and transaction objects all vary by database. Currently the database type is conditionally checked in dozens of places.
**Trigger:** "We have `if db == 'postgres'` scattered everywhere — adding a MySQL test environment would require touching every one."
**Product matrix:**
| Product type | PostgresFactory | SQLiteFactory |
|---|---|---|
| Connection | PostgresConnection | SQLiteConnection |
| QueryBuilder | PostgresQueryBuilder | SQLiteQueryBuilder |
| Transaction | PostgresTransaction | SQLiteTransaction |
**AbstractFactory + client:**
```python
class DatabaseFactory(ABC):
@abstractmethod
def create_connection(self) -> Connection: ...
@abstractmethod
def create_query_builder(self) -> QueryBuilder: ...
@abstractmethod
def create_transaction(self) -> Transaction: ...
# Client receives factory — never imports Postgres or SQLite classes
class Repository:
def __init__(self, factory: DatabaseFactory):
self._conn = factory.create_connection()
self._qb = factory.create_query_builder()
def find_by_id(self, id: int):
query = self._qb.select("users").where("id", id).build()
return self._conn.execute(query)
```
**Test environment wiring:**
```python
# tests/conftest.py
factory = SQLiteFactory(":memory:")
# production
factory = PostgresFactory(os.environ["DATABASE_URL"])
repo = Repository(factory)
```
**Output:** Adding MySQL support is a `MySQLFactory` class and three concrete products. Zero changes to `Repository`, zero changes to test setup beyond adding a new factory option.
---
## Reference Files
| File | Contents |
|------|----------|
| `references/abstract-factory-implementation-guide.md` | Full participants catalog, consequences analysis, prototype-based factory deep dive, extensible factory trade-off analysis, known uses (InterViews, ET++) |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-creational-pattern-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/abstract-factory-implementation-guide.md
# Abstract Factory — Implementation Reference Guide
> Source: Design Patterns: Elements of Reusable Object-Oriented Software (GoF), Chapter 3, pages 87-96. Lexi case study: pages 54-58.
---
## Participants
### AbstractFactory (e.g., `GUIFactory`, `MazeFactory`)
Declares an interface for operations that create each abstract product object. One creation method per product type. Returns AbstractProduct types, never concrete types.
### ConcreteFactory (e.g., `MotifFactory`, `PMFactory`, `EnchantedMazeFactory`)
Implements the creation operations to produce concrete product objects. Exactly one ConcreteFactory per product family. ConcreteFactory class names are the *only* place concrete product class names appear — they are hidden from the client entirely.
### AbstractProduct (e.g., `ScrollBar`, `Button`, `Room`, `Wall`)
Declares an interface for a type of product object. One AbstractProduct per product type. The client codes against this interface exclusively.
### ConcreteProduct (e.g., `MotifScrollBar`, `PMButton`, `EnchantedRoom`)
Defines a product object to be created by the corresponding ConcreteFactory. Implements the AbstractProduct interface.
### Client
Uses only interfaces declared by AbstractFactory and AbstractProduct classes. No concrete class names. No `new ConcreteProduct(...)` calls. Receives the factory reference at initialization — typically via constructor injection or a configuration step at startup.
---
## Consequences — Full Analysis
### Benefit 1: It Isolates Concrete Classes
The factory encapsulates the responsibility and process of creating product objects. It isolates clients from implementation classes. Clients manipulate instances through their abstract interfaces. Concrete product class names are isolated inside the ConcreteFactory — they do not appear in client code at all.
**Practical effect:** You can audit isolation with `grep`. If a file imports or references `MotifScrollBar`, `PostgresConnection`, or any concrete product class, the isolation guarantee is broken.
### Benefit 2: It Makes Exchanging Product Families Easy
A ConcreteFactory appears only once in an application — at its instantiation site. Changing the entire product family requires changing this single line. Because a factory creates a complete product family, the whole family changes atomically.
**Practical effect:** In the Lexi example, switching from Motif to Presentation Manager widgets requires changing `new MotifFactory` to `new PMFactory` at startup. All downstream widget creation automatically uses PM products.
### Benefit 3: It Promotes Consistency Among Products
When product objects in a family are designed to work together, it is important that the application use objects from only one family at a time. The AbstractFactory makes this easy to enforce — a client holding a `MotifFactory` reference can only receive Motif products. Cross-family mixing (a Motif scroll bar with a PM button) is structurally prevented.
**Practical effect:** The consistency guarantee holds *only if* clients obtain all products through the factory. If a client directly instantiates even one concrete product, the guarantee is void for that product.
### Liability: Supporting New Kinds of Products Is Difficult
The AbstractFactory interface fixes the set of products that can be created. Adding a new product type (e.g., adding a `Slider` to `GUIFactory`) requires:
1. Changing the AbstractFactory interface
2. Changing every ConcreteFactory subclass
3. Changing every client that uses the factory (if they need the new product)
This cascades across the entire hierarchy. The pattern is appropriate when the product type set is **stable**. If new product types are expected frequently, evaluate the extensible factory approach (see below) or a different pattern.
---
## Implementation Concern 1: Factories as Singletons
An application typically needs only one instance of a ConcreteFactory per product family. Multiple instances would allow different parts of the application to hold different factories and silently produce products from different families — the consistency benefit disappears.
**Standard approach:** Apply the Singleton pattern to each ConcreteFactory. The factory's constructor is controlled and the instance is shared.
```cpp
// C++ Singleton factory
class MotifFactory : public GUIFactory {
public:
static MotifFactory* Instance() {
if (!instance_) instance_ = new MotifFactory;
return instance_;
}
private:
MotifFactory() = default;
static MotifFactory* instance_;
};
```
**Alternative:** If Singleton feels heavy, pass the factory reference via constructor injection and ensure only one instantiation site exists in the application bootstrap. The injection approach is more testable.
---
## Implementation Concern 2: Creating the Products
AbstractFactory only **declares** an interface for creating products. ConcreteFactory subclasses **define** how products are created. Two main approaches:
### Factory Method per Product (Standard)
Each ConcreteFactory overrides a virtual factory method for each product type. Simple, type-safe, and idiomatic in statically typed languages.
```cpp
class MotifFactory : public GUIFactory {
ScrollBar* CreateScrollBar() override { return new MotifScrollBar; }
Button* CreateButton() override { return new MotifButton; }
};
```
**Drawback:** Requires one ConcreteFactory subclass per product family. When many families exist (or families differ only slightly), the number of subclasses grows linearly with family count.
### Prototype-Based Factory
The concrete factory stores a prototypical instance of each product. The `Make` method retrieves and clones the appropriate prototype. A single factory class serves all product families — differentiated only by which prototypes it is initialized with.
**Smalltalk implementation:**
```smalltalk
"Factory stores prototypes in a dictionary keyed by part name"
make: partName
^ (partCatalog at: partName) copy
addPart: partTemplate named: partName
partCatalog at: partName put: partTemplate
"Adding prototypes to the factory"
aFactory addPart: aPrototype named: #ACMEWidget
```
**Creating an EnchantedMazeFactory without subclassing:**
```smalltalk
createMazeFactory
^ (MazeFactory new
addPart: Wall named: #wall;
addPart: EnchantedRoom named: #room;
addPart: DoorNeedingSpell named: #door;
yourself)
```
Compare to a BombedMazeFactory — identical structure, different prototypes:
```smalltalk
createBombedMazeFactory
^ (MazeFactory new
addPart: BombedWall named: #wall;
addPart: RoomWithABomb named: #room;
addPart: Door named: #door;
yourself)
```
**Class-based variant (Smalltalk/Objective-C):** In languages where classes are first-class objects, `partCatalog` stores classes rather than prototypes. The `make:` method sends `new` to the class instead of `copy` to a prototype:
```smalltalk
make: partName
^ (partCatalog at: partName) new
```
This is structurally identical to the prototype approach but leverages the language's class system. It takes advantage of language characteristics and is not portable to C++ or Java without additional machinery.
**When to prefer prototype over factory-method:**
- Many product families exist (avoids N ConcreteFactory subclasses)
- Families differ only in which concrete products they use
- New families need to be created at runtime by composing existing products
- Language treats classes as first-class objects
---
## Implementation Concern 3: Defining Extensible Factories
AbstractFactory normally defines a different operation for each product type. The product types are encoded in the operation signatures. Adding a new product type requires changing the AbstractFactory interface and all classes that depend on it.
**The extensible alternative:** A single parameterized `Make` operation that accepts an identifier for the kind of object to create.
```cpp
// Standard — type-safe but closed
class GUIFactory {
public:
virtual ScrollBar* CreateScrollBar() = 0;
virtual Button* CreateButton() = 0;
};
// Extensible — flexible but loses static type safety
class GUIFactory {
public:
virtual AbstractWidget* Make(const string& widgetKind) = 0;
};
```
**Use the parameterized form when:**
- New product types are expected to be added (not just new families)
- You are in a dynamically typed language (Python, Ruby, Smalltalk) — the type-safety cost is zero
- All products can reasonably share a common abstract base class
**Avoid the parameterized form when:**
- Static type checking is important (C++, Java, C#)
- Products are heterogeneous in interface (a ScrollBar and a Menu have nothing in common)
- You want the compiler to catch "I forgot to handle this product type"
**The inherent problem with the parameterized form:** All products are returned via the same abstract return type. The client receives a `AbstractWidget*` or `object` and must know which concrete type to cast or coerce to. If the client performs subclass-specific operations on a product, it needs a downcast, and that downcast can fail. This is the classic **extensibility vs. type safety** trade-off — a highly flexible interface with less static safety.
---
## Lexi GUIFactory Case Study — Full Detail
Lexi is a WYSIWYG document editor that must run on Motif, Presentation Manager (PM), and Macintosh. Its user interface is built from widgets — scroll bars, buttons, menus — that must conform to the platform's look-and-feel standard.
**The problem:** Widget creation calls naming concrete classes (`new MotifScrollBar`) scattered across the application code make it impossible to:
- Switch look-and-feel at runtime
- Port to a new platform without auditing the entire codebase
- Test the UI against different look-and-feel standards
**The solution structure:**
```
GUIFactory (AbstractFactory)
CreateScrollBar() → ScrollBar
CreateButton() → Button
CreateMenu() → Menu
MotifFactory : GUIFactory
CreateScrollBar() → MotifScrollBar
CreateButton() → MotifButton
CreateMenu() → MotifMenu
PMFactory : GUIFactory
CreateScrollBar() → PMScrollBar
CreateButton() → PMButton
CreateMenu() → PMMenu
MacFactory : GUIFactory
CreateScrollBar() → MacScrollBar
CreateButton() → MacButton
CreateMenu() → MacMenu
```
Product hierarchies (Figure 2.10 in GoF):
```
ScrollBar (AbstractProduct)
MotifScrollBar
PMScrollBar
MacScrollBar
Button (AbstractProduct)
MotifButton
PMButton
MacButton
Menu (AbstractProduct)
MotifMenu
PMMenu
MacMenu
```
**Factory initialization (pages 57-58):**
```cpp
// Compile-time known look and feel
GUIFactory* guiFactory = new MotifFactory;
// Runtime-selected via environment variable
GUIFactory* guiFactory;
const char* styleName = getenv("LOOK_AND_FEEL");
if (strcmp(styleName, "Motif") == 0) {
guiFactory = new MotifFactory;
} else if (strcmp(styleName, "Presentation_Manager") == 0) {
guiFactory = new PMFactory;
} else {
guiFactory = new DefaultGUIFactory;
}
```
**Key insight from the Lexi case:** `guiFactory` could be a global, a static member of a well-known class, or a local variable if the entire UI is created within one function. The important constraint is: initialize `guiFactory` *before* it is first used to create widgets, and *after* the desired look and feel is known. The pattern does not mandate a specific storage mechanism — it mandates a specific initialization discipline.
---
## MazeFactory Case Study — Full Code
The MazeFactory example from GoF demonstrates Abstract Factory on a smaller, concrete domain.
**AbstractFactory (also serves as default ConcreteFactory via non-abstract base):**
```cpp
class MazeFactory {
public:
MazeFactory();
virtual Maze* MakeMaze() const { return new Maze; }
virtual Wall* MakeWall() const { return new Wall; }
virtual Room* MakeRoom(int n) const { return new Room(n); }
virtual Door* MakeDoor(Room* r1, Room* r2) const {
return new Door(r1, r2);
}
};
```
**Note:** `MazeFactory` is not abstract — it acts as both the AbstractFactory *and* the ConcreteFactory for the default (plain) maze. This is a common simplification for small applications. It means creating a new maze variant requires only overriding the methods that differ — inherited defaults handle the rest.
**Client — factory-mediated creation:**
```cpp
Maze* MazeGame::CreateMaze(MazeFactory& factory) {
Maze* aMaze = factory.MakeMaze();
Room* r1 = factory.MakeRoom(1);
Room* r2 = factory.MakeRoom(2);
Door* aDoor = factory.MakeDoor(r1, r2);
aMaze->AddRoom(r1);
aMaze->AddRoom(r2);
r1->SetSide(North, factory.MakeWall());
r1->SetSide(East, aDoor);
r1->SetSide(South, factory.MakeWall());
r1->SetSide(West, factory.MakeWall());
r2->SetSide(North, factory.MakeWall());
r2->SetSide(East, factory.MakeWall());
r2->SetSide(South, factory.MakeWall());
r2->SetSide(West, aDoor);
return aMaze;
}
```
**EnchantedMazeFactory — overrides room and door creation:**
```cpp
class EnchantedMazeFactory : public MazeFactory {
public:
EnchantedMazeFactory();
Room* MakeRoom(int n) const override {
return new EnchantedRoom(n, CastSpell());
}
Door* MakeDoor(Room* r1, Room* r2) const override {
return new DoorNeedingSpell(r1, r2);
}
protected:
Spell* CastSpell() const;
};
```
**BombedMazeFactory — overrides wall and room creation:**
```cpp
class BombedMazeFactory : public MazeFactory {
public:
Wall* MakeWall() const override { return new BombedWall; }
Room* MakeRoom(int n) const override { return new RoomWithABomb(n); }
};
```
**Usage:**
```cpp
MazeGame game;
BombedMazeFactory factory;
game.CreateMaze(factory); // all rooms are RoomWithABomb, all walls are BombedWall
// Alternatively
EnchantedMazeFactory enchantedFactory;
game.CreateMaze(enchantedFactory); // all rooms are EnchantedRoom, doors need spells
```
**Smalltalk prototype-based equivalent:**
The Smalltalk version uses a single `MazeFactory` class with a `partCatalog` dictionary, eliminating the `EnchantedMazeFactory` and `BombedMazeFactory` subclasses:
```smalltalk
"createMaze equivalent in Smalltalk"
createMaze: aFactory
| room1 room2 aDoor |
room1 := (aFactory make: #room) number: 1.
room2 := (aFactory make: #room) number: 2.
aDoor := (aFactory make: #door) from: room1 to: room2.
room1 atSide: #north put: (aFactory make: #wall).
room1 atSide: #east put: aDoor.
room1 atSide: #south put: (aFactory make: #wall).
room1 atSide: #west put: (aFactory make: #wall).
room2 atSide: #north put: (aFactory make: #wall).
room2 atSide: #east put: (aFactory make: #wall).
room2 atSide: #south put: (aFactory make: #wall).
room2 atSide: #west put: aDoor.
^ Maze new addRoom: room1; addRoom: room2; yourself
"Default MazeFactory — classes as prototypes"
createMazeFactory
^ (MazeFactory new
addPart: Wall named: #wall;
addPart: Room named: #room;
addPart: Door named: #door;
yourself)
"EnchantedMazeFactory — no new class needed"
createEnchantedMazeFactory
^ (MazeFactory new
addPart: Wall named: #wall;
addPart: EnchantedRoom named: #room;
addPart: DoorNeedingSpell named: #door;
yourself)
```
---
## Known Uses
**InterViews** uses the "Kit" suffix to denote AbstractFactory classes. `WidgetKit` and `DialogKit` are abstract factories for generating look-and-feel-specific user interface objects. `LayoutKit` generates different composition objects depending on document orientation (portrait or landscape).
**ET++** uses Abstract Factory to achieve portability across window systems (X Windows and SunView). The `WindowSystem` abstract base class defines the interface for creating objects representing window system resources (`MakeWindow`, `MakeFont`, `MakeColor`). Concrete subclasses implement the interfaces for a specific window system. At runtime, ET++ creates an instance of a concrete `WindowSystem` subclass.
---
## Related Patterns
- **Factory Method** — AbstractFactory classes are often implemented with factory methods. The ConcreteFactory is a collection of factory methods, one per product type.
- **Prototype** — AbstractFactory classes can also be implemented using the Prototype pattern. The factory stores prototype instances and clones them to create new products. Eliminates the need for a ConcreteFactory subclass per family.
- **Singleton** — A ConcreteFactory is often a Singleton. An application typically needs only one instance per product family.
Select and implement a layered security testing strategy for a codebase: design unit tests for security properties (boundary conditions, negative inputs, acc...
---
name: security-testing-strategy
description: |
Select and implement a layered security testing strategy for a codebase: design unit tests for security properties (boundary conditions, negative inputs, access control invariants), set up integration testing with non-production seed data (avoiding the production data copy anti-pattern), choose and configure dynamic analysis sanitizers (AddressSanitizer for memory corruption, ThreadSanitizer for race conditions, MemorySanitizer for uninitialized reads — with their performance cost tradeoff accounted for in CI/CD scheduling), plan fuzz testing (write effective fuzz drivers using libFuzzer/AFL, apply dictionary inputs for structured formats like HTTP/SQL/JSON, build a seed corpus, integrate continuous fuzzing via ClusterFuzz or OSS-Fuzz), and integrate static analysis at the right depth for each development stage (linters in the IDE commit loop, abstract interpretation nightly, formal methods for safety-critical paths). Use when creating a security testing plan for a new service, setting up fuzz testing for a parser or protocol implementation, integrating static analysis into a CI/CD pipeline, adding sanitizer-enhanced nightly builds, or auditing coverage gaps found during secure-code-review. Produces a security testing strategy document with tool selection rationale, CI/CD integration plan, and coverage priorities derived from code review findings.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/building-secure-and-reliable-systems/skills/security-testing-strategy
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- secure-code-review
source-books:
- id: building-secure-and-reliable-systems
title: "Building Secure and Reliable Systems"
authors: ["Heather Adkins", "Betsy Beyer", "Paul Blankinship", "Piotr Lewandowski", "Ana Oprea", "Adam Stubblefield"]
chapters: [13]
tags:
- security
- testing
- fuzzing
- static-analysis
- dynamic-analysis
- ci-cd
execution:
tier: 3
mode: full
inputs:
- type: document
description: "Secure code review findings report (output of secure-code-review) — drives testing priorities and tool selection"
- type: codebase
description: "Source code directory or architecture description of the system under test — determines applicable sanitizers, fuzz targets, and static analysis tools"
- type: document
description: "Existing CI/CD pipeline description (optional) — determines where to integrate each testing layer"
tools-required: [Read, Grep]
tools-optional: [Bash, Write]
mcps-required: []
environment: "Run inside the repository root or against an architecture description. Grep identifies candidate fuzz targets (parsers, protocol handlers, codec implementations) and existing test infrastructure. Write produces the strategy document."
discovery:
goal: "Produce a security testing strategy document: layered tool selection (unit → integration → dynamic analysis → fuzzing → static analysis), CI/CD integration plan, seed corpus and fuzz driver guidance, and coverage priorities derived from code review findings"
tasks:
- "Map secure-code-review findings to testing gaps — each finding class has a corresponding testing layer that can prevent future regressions"
- "Identify unit testing priorities: security boundary conditions, negative inputs, access control invariants, capacity overflow scenarios"
- "Audit integration test environment setup for the production data copy anti-pattern"
- "Select and configure sanitizers (ASan/UBSan/TSan/MSan/LSan) matched to language and bug classes; schedule in CI/CD accounting for performance cost"
- "Identify fuzz targets: parsers, decoders, protocol handlers, compression implementations"
- "Plan fuzz driver design: avoid nondeterminism, slow I/O, intentional crashes, and integrity checks that block fuzzer progress"
- "Select seed corpus sources and apply dictionary-based input for structured formats"
- "Select static analysis depth per development stage: linters at commit, abstract interpretation nightly or at code review, formal methods for safety-critical paths"
- "Integrate ClusterFuzz or OSS-Fuzz for continuous fuzzing if the project is open source or has sustained infrastructure"
audience:
roles: ["security-engineer", "software-engineer", "site-reliability-engineer", "tech-lead"]
experience: "intermediate — assumes familiarity with CI/CD pipelines, basic compiler toolchains, and at least one unit testing framework"
triggers:
- "Security testing plan needed for a new service or module"
- "Set up fuzz testing for a parser, decoder, or protocol implementation"
- "Integrate static analysis into a CI/CD pipeline"
- "Add sanitizer-enhanced nightly builds to catch memory corruption"
- "Coverage gaps identified during secure-code-review need a testing plan to prevent regressions"
- "Preparing a service for security review before launch"
not_for:
- "Threat modeling or adversary profiling — use adversary-profiling-and-threat-modeling"
- "Reviewing existing code for vulnerabilities — use secure-code-review"
- "Deployment pipeline security — use secure-deployment-pipeline"
- "Designing access control policy — use least-privilege-access-design"
---
## When to Use
Use this skill when you need to select and integrate the right testing layers to make a codebase resilient against the inputs it encounters — including inputs an attacker controls.
Invoke it for:
- New services or modules that lack a security testing plan
- Systems that parse external input (files, network protocols, API requests) and have no fuzz testing in place
- Codebases where secure-code-review found vulnerability classes (injection, memory corruption, race conditions) that need automated test coverage to prevent regression
- CI/CD pipelines that run only unit tests — missing dynamic analysis, sanitizers, and static analysis
- Open source projects eligible for OSS-Fuzz integration
Start by reading the secure-code-review findings report. The anti-patterns found there drive testing priorities: SQL injection findings mean unit tests for parameterized query enforcement; memory corruption patterns mean AddressSanitizer integration; concurrent access patterns mean ThreadSanitizer.
Do not use this skill for threat modeling, access control design, or reviewing existing code for vulnerabilities — those are separate skills.
---
## Context and Input Gathering
Before building the strategy, establish:
1. **Language and runtime**: What language(s) is the codebase in? (Determines available sanitizers and static analysis tools. ASan/TSan/MSan are C/C++ compiler-instrumentation tools; Go has go-fuzz and race detector; Java has Error Prone.)
2. **Input surface**: Where does external input enter the system — HTTP request bodies, file uploads, network packets, deserialized data? These are the fuzz targets.
3. **Existing test infrastructure**: What unit and integration test frameworks are already in use? What CI/CD system? (Determines integration points.)
4. **Secure-code-review findings**: Which vulnerability classes were identified? These determine which sanitizers and test priorities are highest value.
5. **Safety-criticality level**: Is this avionics, medical device, or cryptographic infrastructure? (Determines whether formal methods are warranted.)
6. **Infrastructure budget**: Can the project sustain continuous fuzzing VMs (ClusterFuzz), or is periodic fuzzing more appropriate?
---
## Process
### Step 1 — Map Code Review Findings to Testing Gaps
**WHY**: Testing priorities should be driven by known vulnerability classes, not coverage metrics alone. Each class of finding from secure-code-review has a testing layer that can detect it automatically and prevent future regressions. Without this mapping, teams add tests indiscriminately and miss the highest-risk paths.
For each finding class from the secure-code-review report, map it to a testing layer:
| Finding Class | Primary Testing Layer | Secondary Layer |
|---|---|---|
| SQL injection / parameterized query gaps | Unit tests verifying parameterized query enforcement | Integration tests with a QA database instance |
| XSS / unescaped HTML rendering | Unit tests with malicious input payloads (`<script>`, `javascript:`) | Fuzzing of template rendering paths |
| Memory corruption (buffer overflows, use-after-free) | AddressSanitizer + fuzzing of affected parsers | Nightly sanitizer-enhanced CI builds |
| Race conditions | ThreadSanitizer on concurrent code paths | Integration tests with parallelism stress |
| Uninitialized memory reads | MemorySanitizer | AddressSanitizer (catches many cases) |
| Integer overflow / undefined behavior | UndefinedBehaviorSanitizer | Fuzz testing with edge-case inputs |
| Injection into parsers / decoders | Fuzz testing with a seed corpus + dictionary | ASan + fuzz driver |
| Auth bypass from deep nesting | Unit tests explicitly testing each nested branch | Mutation testing |
Produce a prioritized list of testing gaps to address, ordered by severity of the original finding.
### Step 2 — Design Unit Tests for Security Properties
**WHY**: Unit tests are the fastest feedback loop — they run locally and in every CI commit. For security, they must go beyond happy-path validation to exercise the boundary conditions, malformed inputs, and access control invariants that are most likely to fail under adversarial use. Tests written alongside security reviews capture the cases the engineer checked manually and prevent those cases from regressing as the codebase evolves.
Design unit tests for these security-relevant categories:
**Boundary and overflow conditions**: For any function that accepts a size, quota, or numeric parameter, test values near the limits of the variable's type. A quota management system, for example, should be tested with requests involving negative byte values, values that would overflow the integer type used to represent them, and transfers that bring the total quota to exactly the limit.
**Malicious and malformed inputs**: For every entry point that receives user-controlled data, write tests with inputs that represent known attack patterns:
- SQL injection strings: `' OR username='admin`
- XSS payloads: `<script>alert(1)</script>`, `javascript:void(0)`
- Path traversal: `../../etc/passwd`
- Null bytes, extremely long strings, binary data
**Access control invariants**: For systems with permission models, write parameterized tests that exercise the Cartesian product of roles and actions:
```python
# Parameterized test — run the same test across multiple inputs
# to verify access control without duplicating boilerplate
@parameterize([
("billing_admin", "request_quota", True), # should succeed
("standard_user", "request_quota", False), # should be denied
("billing_admin", "delete_cluster", True),
("standard_user", "delete_cluster", False),
])
def test_quota_access_control(role, action, expected_allowed):
assert service.can_perform(role, action) == expected_allowed
```
**Error message safety**: Verify that error responses do not leak internal state, stack traces, or system information that would assist an attacker.
**When to write these tests**: Write them immediately after writing the security-relevant code, so the cases the engineer checked manually are captured. In organizations with code review, a peer reviewer should verify that the tests are robust — checking that new tests can't trivially pass if the security check is removed (mutation testing criterion).
### Step 3 — Audit Integration Test Environment Setup
**WHY**: Integration tests exercise real dependencies — databases, network services, external APIs — and therefore expose real data to anyone who can run the tests. Copying production databases into test environments is a named anti-pattern: production data contains sensitive personal information that will be accessible to all test runners and visible in test logs, violating the principle of least privilege and creating a data breach risk from the test infrastructure itself. Seeded non-production data also makes tests hermetic — the environment starts from a known clean state, which eliminates a major source of flakiness.
**The anti-pattern to identify and eliminate**:
```
# ANTI-PATTERN: production database mirrored to test environment
test_db = copy_production_database(prod_db) # exposes PII to test runners
integration_test(test_db)
```
**The correct pattern**:
```
# CORRECT: seed the test environment with non-sensitive synthetic data
test_db = create_empty_database()
seed_with_synthetic_data(test_db, [
{"team": "imaginary_team_1", "quota_bytes": 1024},
{"team": "imaginary_team_2", "quota_bytes": 512},
])
integration_test(test_db)
```
Audit each integration test setup and look for:
- Direct connections to production databases or services
- Data import scripts that pull from production backups
- Environment variables that point to production credentials
For each such case, replace with synthetic seed data. The seed data should be:
- Representative of the real-world scenarios the tests exercise (enough variety to exercise security boundary conditions)
- Free of any real personal data, credentials, or business-sensitive values
- Checked into the repository alongside the tests, so the environment can always be rebuilt to a known state
### Step 4 — Select and Configure Sanitizers
**WHY**: Memory corruption bugs — buffer overflows, use-after-free, uninitialized reads, race conditions — are the root cause of the most severe security vulnerabilities (including Heartbleed). They are invisible to code review and to unit tests that don't trigger the exact input sequence that causes the corruption. Sanitizers work by instrumenting the compiled binary with additional instructions that detect these conditions at runtime, with precise diagnostic output pointing to the exact line of code responsible. They are the most effective automated tool for finding this class of bug before it reaches production.
**The Google Sanitizers suite** (LLVM/GCC-backed, C/C++ primary but some support other languages):
| Sanitizer | Flag | What It Detects | Use Case |
|---|---|---|---|
| AddressSanitizer (ASan) | `-fsanitize=address` | Out-of-bounds accesses, use-after-free, heap/stack/global buffer overflows | All C/C++ codebases |
| UndefinedBehaviorSanitizer (UBSan) | `-fsanitize=undefined` | Signed integer overflow, null pointer dereference, misaligned access | Combined with ASan |
| ThreadSanitizer (TSan) | `-fsanitize=thread` | Race conditions in concurrent code | Multi-threaded services |
| MemorySanitizer (MSan) | `-fsanitize=memory` | Reads from uninitialized memory | Security-sensitive data handling |
| LeakSanitizer (LSan) | `-fsanitize=leak` | Memory leaks and resource leaks | Long-running services |
**Performance cost — the core tradeoff**:
Sanitizer-instrumented binaries can be orders of magnitude slower than native binaries. ASan typically adds 2x overhead; MSan can add 3x. This makes it impractical to run sanitizer builds on every commit in the same pipeline as fast unit tests. The established practice is:
- **Fast unit and integration tests**: Run on every commit, no sanitizers — native build speed
- **Sanitizer-enhanced pipelines**: Run nightly (or on a separate slower CI tier), with ASan + UBSan as the baseline combination
- **Fuzz testing**: Always pair with ASan — the fuzzer generates inputs, ASan detects the memory corruption they trigger
**CI/CD integration**:
```bash
# Bazel example — build ASan variant
build:asan --copt -fsanitize=address
build:asan --linkopt -fsanitize=address
build:asan --copt -fno-omit-frame-pointer
build:asan --copt -O1 --copt -g
# Run nightly in CI
$ CC=clang CXX=clang++ bazel build --config=asan //...
$ bazel test --config=asan //...
```
For Go: use `-race` flag for race condition detection (equivalent to TSan). Python: use Valgrind for C extensions, or memory profilers for Python-native code.
**ASan example output** (shows precisely where the violation occurred and where the memory was originally allocated and freed):
```
ERROR: AddressSanitizer: heap-use-after-free on address 0x...
READ of size 1 at ...
#0 in main use-after-free.c:5:10
SUMMARY: AddressSanitizer: heap-use-after-free use-after-free.c:5:10 in main
```
When a sanitizer flags a "known safe" function (for example, a signed integer overflow that is intentionally discarded and has no security consequence), annotate it explicitly: `__attribute__((no_sanitize("undefined")))`. Add annotations only after careful review — they suppress real detections and can create false negatives. Document the reason in a comment.
### Step 5 — Plan Fuzz Testing
**WHY**: Fuzz testing finds bugs that no human would think to test for — unexpected input combinations that trigger crashes, memory corruption, or denial-of-service conditions. File parsers, network protocol handlers, compression codecs, and audio/video decoders are the highest-value fuzz targets because they process complex, adversary-controlled inputs that can reach many code paths. Heartbleed (CVE-2014-0160) — the OpenSSL vulnerability that caused web servers to leak TLS certificates and session cookies — was reproducible through fuzzing with the right driver and ASan enabled.
**How fuzz engines work**:
A fuzz engine (fuzzer) generates large numbers of candidate inputs and passes them through a fuzz driver to the fuzz target (the code under test). Modern fuzz engines use compiler instrumentation to observe which code paths each input exercises. When an input triggers a new code path, the engine preserves it in a seed corpus and uses it to generate future mutations. This coverage-guided approach is far more effective than "dumb fuzzing" (random bytes from a random number generator), which is unlikely to pass input validation and reach the interesting code.
**Identify fuzz targets**: Search the codebase for:
```bash
# Grep for common fuzz target signatures
Grep pattern: ReadJpeg|ParsePng|DecodeAudio|ParseHttp|ReadXml|ParseJson
Grep pattern: Unmarshal|Deserialize|Decode|ParseProto
Grep pattern: compress|decompress|inflate|deflate
```
Each function that accepts a byte sequence from an external source and interprets it as a structured format is a candidate fuzz target.
**Writing effective fuzz drivers** (libFuzzer/AFL compatible):
The libFuzzer entry point is a single function:
```cpp
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
// Pass data to the fuzz target. Return 0 always.
MyLibrary::ParseInput(data, size);
return 0;
}
```
Avoid these patterns in fuzz drivers — they prevent the fuzzer from running quickly and reproducing crashes:
- **Nondeterministic behavior**: Random number generators, timestamp-dependent logic, thread scheduling dependencies. The fuzzer cannot reproduce crashes if the same input produces different execution paths.
- **Slow operations**: Console logging, disk I/O, network calls. Use a memory-based filesystem or a "fuzzer-friendly" build flag (`-DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION`) to disable slow paths.
- **Intentional crashes**: The fuzzer cannot distinguish an intentional crash from a bug.
- **Integrity checks (CRCs, message digests)**: A fuzzer is unlikely to generate a valid checksum. Skip integrity checks in fuzzer builds, or the fuzzer will never reach the parsing code past the checksum verification.
**Dictionary-based input for structured formats**:
When the fuzz target processes a well-specified format — HTTP, SQL, JSON, XML, a binary protocol — provide a dictionary of keywords from the format's grammar. Without a dictionary, the fuzzer generates inputs that are rejected at the token-recognition stage, never reaching the parsing logic where bugs live:
```
# HTTP dictionary example (http.dict)
"GET "
"POST "
"HTTP/1.1\r\n"
"Content-Length: "
"Transfer-Encoding: chunked"
```
Fuzz engines like Peach Fuzzer can also accept a formal grammar definition of the input format, generating structurally valid inputs that violate specific field relationships — more targeted than a keyword dictionary.
**Seed corpus**: Start with real-world representative inputs. For an image library, use a small JPEG file. For a JSON parser, use a representative JSON document. A single real-world seed can increase code coverage by 10% or more compared to starting from an empty corpus, because the fuzzer begins exploring mutations from valid inputs rather than building structure from scratch. Sources for seed corpora:
- Existing test suite sample files
- OSS-Fuzz and The Fuzzing Project published corpora for common formats
- Hand-curated inputs covering edge cases identified in the code review
**Input splitting for multi-mode targets**: If the fuzz target's behavior depends on a configuration mode, split the input: use the first N bytes to select the mode, and the remaining bytes as the payload. This allows a single driver to explore all modes without the fuzzer having to independently discover the mode-selection encoding.
**Comparing implementations**: When migrating from library A to library B, a fuzz driver can pass the same input to both and report any output difference as a crash. This differential fuzzing approach finds subtle behavioral changes that standard tests miss.
### Step 6 — Set Up Continuous Fuzzing
**WHY**: Fuzzing cannot run to completion — fuzz engines generate inputs indefinitely. A one-time local fuzzing session finds some bugs but misses regressions introduced by later changes. Continuous fuzzing integrates the fuzz engine with the CI/CD build pipeline, so fuzzers automatically run against the latest code, accumulate a growing corpus of interesting inputs, and file bugs when crashes are detected — without manual intervention. OSS-Fuzz found over 1,000 bugs within 5 months of its December 2016 launch and has since found tens of thousands more.
**ClusterFuzz** (open source, Google):
- Manages pools of virtual machines running fuzzing tasks
- Corpus management, crash deduplication, lifecycle management
- Web interface for fuzzer metrics (executions per second, code coverage, crash rate)
- Supports libFuzzer and AFL
- Does not build fuzzers — expects the project to push fuzz driver binaries via a continuous build pipeline
**OSS-Fuzz**:
- Combines ClusterFuzz infrastructure with distributed Google Cloud execution
- Open to open source projects — integration requires writing fuzz drivers and a Dockerfile
- Reports vulnerabilities directly to developers within hours of detecting them
- Projects already integrated benefit from both Google's internal fuzzing tools and external fuzzing tools reviewing the same code
**Integration checklist**:
- [ ] Fuzz driver builds cleanly with ASan + libFuzzer flags
- [ ] Seed corpus committed to repository or uploaded to cloud storage bucket
- [ ] Build pipeline pushes fuzzer binary to ClusterFuzz/OSS-Fuzz bucket on every main branch commit
- [ ] Crash alerts route to the team's bug tracker
- [ ] Saved crash samples are checked into the test suite as unit test cases after fixes, to prevent regression
### Step 7 — Integrate Static Analysis at the Right Depth
**WHY**: Static analysis can detect potential bugs before code is executed — ideally before it is checked in. But different static analysis techniques operate at different depths and costs, and the right tool depends on where in the development cycle it runs. Running a slow, deep analyzer on every keystroke creates friction and developer fatigue; running only a fast linter nightly misses the window when the developer still understands the code they just wrote.
**The depth-vs-cost spectrum**:
| Technique | Depth | Cost | Integration Point | Examples |
|---|---|---|---|---|
| Linters / AST pattern matching | Shallow — syntactic patterns, usage rules | Very fast (~compile time) | IDE, pre-commit hook, every commit | Clang-Tidy (C/C++), Error Prone (Java), GoVet (Go), Pylint (Python) |
| Abstract interpretation (deep static analysis) | Semantic — control flow, data flow, cross-function | Slow (minutes to hours) | Nightly, code review on changed code only | Frama-C (C), Infer (Java/C/Go), AbsInt |
| Formal methods | Proof — mathematically rigorous property verification | Very high up-front cost | Safety-critical paths only; pre-deployment release analysis | TLA+, Coq, specialized domain tools |
Statically verifying any program is an undecidable problem — no algorithm can determine whether an arbitrary program will execute without violating an arbitrary property. All static analyzers therefore make tradeoffs between false positives (warnings about code that is actually safe) and false negatives (missing real bugs). Tool selection and integration point determine which tradeoff is acceptable.
**Linters at the commit loop**:
Linters perform syntactic analysis and AST pattern matching. They run fast (roughly compile time), catch common bug patterns and style violations, and integrate well into IDEs and code review systems. Target a false-positive rate of 10% or lower — developers learn to ignore higher-noise tools.
Key linters by language:
- **C/C++**: Clang-Tidy (custom checks, auto-fix capability), Error Prone-style checks via compiler warnings
- **Java**: Error Prone (type-safe bug patterns, auto-fix; 162 authors had submitted 733 checks as of 2018)
- **Go**: GoVet (suspicious constructs), staticcheck
- **Python**: Pylint, Bandit (security-focused)
Google's Tricorder platform runs ~146 analyzers covering 30+ languages during code review, surfacing results inline in the review diff. The platform measures user-perceived false-positive rate by tracking "Not useful" clicks and disables individual checks that exceed 10% noise. Adopt a similar feedback loop: track which findings developers dismiss and tune the ruleset accordingly.
**Abstract interpretation nightly (deep static analysis)**:
Abstract interpretation tools reason about program semantics — data flow, control flow, heap shape — often across function call boundaries. They can find buffer overflows, null pointer dereferences, division by zero, and dangling pointers that linters miss. The cost is analysis time measured in minutes to hours for large codebases.
Integration pattern:
- Run nightly against the full codebase (not on every commit)
- At code review: run in differential mode — analyze only the changed code, reusing previously computed analysis facts for unchanged code
- Route results to the code author during review, not as blocking CI failures (the analysis takes too long to block a commit)
Tools:
- **Frama-C**: C programs — buffer overflows, segmentation faults, dangling pointers, division by zero
- **Infer**: Java, C, Go — memory and pointer bugs, dangling pointers
- **App Security Improvement (ASI)** program: Google Play Store — performs interprocedural analysis on every Android app upload; had led to over 1 million app fixes as of early 2019
**Formal methods for safety-critical paths**:
Formal methods require specifying system properties in a mathematically rigorous way and then verifying them. The upfront cost is high — engineers must learn specification languages and the approach requires a priori description of system requirements. Reserve formal methods for:
- Safety-critical code where a defect has physical consequences (flight control, medical devices)
- Cryptographic protocol implementations (continuous formal verification of TLS protocols)
- Hardware design (standard practice via EDA vendor tools)
**Developer workflow integration**:
The key principle from Google's experience: surface findings at the moment the developer is most able to act on them — when they are writing or reviewing the code, not days later. A developer who gets a null-pointer warning during a code review can fix it immediately. The same warning in a weekly report competes with other priorities.
Practical integration:
1. Add linters to pre-commit hooks and IDE plugins (zero friction for developers who already run the linter)
2. Surface linter results inline in code review diffs (Tricorder pattern — present findings alongside the changed lines)
3. Run deep analysis (Infer, Frama-C) nightly and route findings to a team dashboard — not as commit blockers
4. For security-critical paths identified in threat modeling, evaluate formal methods on a case-by-case basis
### Step 8 — Produce the Security Testing Strategy Document
**WHY**: A strategy document makes the testing plan explicit, reviewable, and executable by the team. It prevents the common failure mode where each engineer adds tests independently without a shared understanding of which vulnerability classes need coverage or how the testing layers interact.
Produce a document with the following sections:
```
## Security Testing Strategy: [service/codebase name]
**Date**: [date]
**Based on**: secure-code-review findings for [scope], [date]
### Coverage Priorities (from code review findings)
[Priority 1]: [Finding class] → [Testing layer] → [Specific tool/test design]
[Priority 2]: ...
### Unit Testing Plan
- Security boundary tests: [list specific functions/modules and input classes to test]
- Access control invariants: [parameterized test design for role/action matrix]
- Malformed input tests: [entry points and attack payloads to exercise]
### Integration Testing Environment
- Seed data plan: [what synthetic data to create, who creates it]
- Production data locations to remediate: [list any current production data copies]
### Dynamic Analysis (Sanitizers)
- Language: [language]
- Baseline sanitizers: [ASan + UBSan / -race / Valgrind — per language]
- CI/CD integration: [nightly build config, runtime estimate]
- Additional sanitizers for specific findings: [TSan if race conditions found, MSan if uninitialized reads found]
### Fuzz Testing Plan
- Fuzz targets identified: [list: function name, input format, codebase location]
- Fuzz engine: [libFuzzer / AFL / Honggfuzz — or combination]
- Seed corpus sources: [existing test files, OSS-Fuzz corpora, hand-curated]
- Dictionary files needed: [list formats requiring dictionary input]
- Continuous fuzzing: [ClusterFuzz / OSS-Fuzz / periodic local — with rationale]
### Static Analysis Plan
- Commit/PR: [linter(s) and configuration]
- Nightly: [deep analysis tool if applicable]
- Safety-critical paths: [formal methods evaluation — yes/no with rationale]
### Integration Timeline
[Week 1]: [Immediate actions — highest-priority unit tests, linter setup]
[Week 2-3]: [Sanitizer nightly builds, integration test remediation]
[Month 2]: [Fuzz drivers written and integrated, continuous fuzzing started]
[Ongoing]: [OSS-Fuzz / ClusterFuzz monitoring, static analysis tuning]
```
---
## Key Principles
**Testing strategies have different cost-benefit profiles — use all layers.** Unit tests provide fast feedback but cannot find memory corruption or unexpected input combinations. Sanitizers catch memory bugs that unit tests miss. Fuzzing finds edge cases no human would consider. Static analysis catches bugs before execution. Each layer is complementary, not a substitute for the others.
**Fuzz testing is most effective when paired with sanitizers.** A fuzzer that generates inputs without ASan enabled can only detect bugs that cause the program to crash or exit. With ASan, the fuzzer detects invalid memory accesses that would otherwise be silent — allowing it to find vulnerabilities like Heartbleed that don't immediately crash the process.
**Never copy production data to test environments.** Production databases contain sensitive personal information that becomes accessible to all test runners and visible in test logs. Seed test environments with synthetic, non-sensitive data. This also eliminates a major source of test flakiness — the environment always starts from a known clean state.
**Performance cost of sanitizers requires scheduling discipline.** Sanitizer-instrumented binaries run orders of magnitude slower than native builds. Run sanitizer pipelines nightly (not on every commit) to catch bugs without blocking developer velocity. Bugs found nightly are still caught far earlier than bugs found in production.
**Code review findings drive testing priorities.** Security testing should not be designed in isolation. Anti-patterns identified during code review — memory corruption patterns, injection risks, concurrency issues — directly determine which sanitizers to enable, which fuzz targets to prioritize, and which unit tests to write first.
**Static analysis should surface findings when developers can act on them.** Findings surfaced during code review — when the engineer understands the code they just wrote — have high fix rates. Findings in a weekly report compete with other priorities. Integrate linters into the commit loop and code review diff; run deeper analysis nightly and route results to the author.
**Fuzzer crashes become regression tests.** When a fuzzer finds a crash, save the crashing input and add it as a unit test case after fixing the bug. This prevents the same vulnerability from being reintroduced in future refactoring.
---
## Examples
### Example 1 — Fuzz Driver for a JPEG Decoder (C++, libFuzzer + ASan)
**Scenario**: A service processes JPEG images uploaded by users. The decoder library handles arbitrary user input. The secure-code-review found no injection issues, but the library is written in C++ and processes complex binary input — a strong candidate for fuzzing.
**Build configuration** (Bazel `.bazelrc`):
```
build:asan --copt -fsanitize=address --linkopt -fsanitize=address
build:asan --copt -fno-omit-frame-pointer --copt -O1 --copt -g dbg
```
**Fuzz driver** (libFuzzer entry point compatible with Honggfuzz and AFL):
```cpp
#include <cstddef>
#include <cstdint>
#include "jpeg_data_decoder.h"
#include "jpeg_data_reader.h"
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t sz) {
knusperli::JPEGData jpg;
knusperli::ReadJpeg(data, sz, knusperli::JPEG_READ_HEADER, &jpg);
return 0;
}
```
**Running the fuzzer** (5-minute session, empty starting corpus):
```bash
$ mkdir synthetic_corpus
$ ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-6.0/bin/llvm-symbolizer \
bazel-bin/fuzzer \
-max_total_time=300 -print_final_stats synthetic_corpus/
```
**Improving coverage**: Add a real JPEG file to the seed corpus. A single representative seed can improve code coverage by 10%+ because the fuzzer begins mutating valid inputs rather than building JPEG structure from scratch.
**Input splitting for multi-mode targets**: If the decoder has multiple read modes (header-only, tables, full), use the first 64 bytes of input to select the mode rather than hardcoding one mode per driver — a single driver then exercises all code paths.
### Example 2 — Sanitizer Nightly Build Integration
**Scenario**: A Go gRPC service processes concurrent requests. ThreadSanitizer was not previously enabled. The secure-code-review identified shared state accessed by multiple goroutines.
**CI configuration** (GitHub Actions nightly):
```yaml
nightly-sanitizer:
schedule:
- cron: '0 2 * * *'
steps:
- name: Build and test with race detector
run: go test -race ./...
- name: Run integration tests with race detector
run: go test -race -tags=integration ./...
```
**What TSan catches that tests miss**: A race condition where two goroutines concurrently read and write a shared token store without synchronization. The race is timing-dependent — unit tests pass 99% of the time. TSan instruments the binary to detect the unsynchronized access deterministically, regardless of thread scheduling.
### Example 3 — Static Analysis Integration at Code Review
**Scenario**: A Java service performs string comparisons incorrectly. Error Prone is integrated into the Bazel build.
**Error Prone detection** (compile-time, no runtime overhead):
```java
// Buggy code — Error Prone catches this at build time:
public boolean foo() {
return getString() == "foo"; // reference equality, not value equality
}
```
**Error Prone output during build** (surfaces in code review diff):
```
ERROR: String comparison using reference equality instead of value equality
return getString() == "foo";
^
[StringEquality] see http://errorprone.info/bugpattern/StringEquality
Suggested fix: return getString().equals("foo");
```
The developer sees the finding inline in their code review and can apply the one-click fix. This is faster and more reliable than a weekly static analysis report.
---
## References
- [sanitizer-configuration-guide.md](../../references/sanitizer-configuration-guide.md) — ASan, TSan, MSan, UBSan, LSan: compiler flags, Bazel/CMake integration, shadow memory mechanics, performance overhead tables, "known safe" annotation governance
- [fuzz-driver-patterns.md](../../references/fuzz-driver-patterns.md) — libFuzzer/AFL/Honggfuzz entry points, seed corpus curation, dictionary format, FuzzedDataProvider for input splitting, fuzzer-friendly build flags, differential fuzzing pattern
- [continuous-fuzzing-setup.md](../../references/continuous-fuzzing-setup.md) — ClusterFuzz and OSS-Fuzz integration: Dockerfile requirements, build pipeline configuration, corpus management, crash deduplication, bug tracker routing
- [static-analysis-depth-spectrum.md](../../references/static-analysis-depth-spectrum.md) — Full depth-vs-cost spectrum: linters (Clang-Tidy, Error Prone, GoVet, Pylint) → abstract interpretation (Frama-C, Infer, ASI) → formal methods. Developer workflow integration patterns from Google Tricorder.
Cross-references:
- `secure-code-review` — findings from code review drive the testing priorities in Step 1; anti-patterns found there determine which sanitizers and fuzz targets are highest value
- `secure-deployment-pipeline` — testing strategy outputs (sanitizer builds, fuzz driver binaries) integrate with the deployment pipeline
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-secure-code-review`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
Review a system design for security and reliability tradeoffs before implementation begins. Use when: evaluating an architecture proposal or design document...
---
name: security-reliability-design-review
description: |
Review a system design for security and reliability tradeoffs before implementation begins. Use when: evaluating an architecture proposal or design document and need to identify where security and reliability requirements conflict with feature or cost requirements; auditing a proposed design to determine whether security and reliability are designed in from the start or likely to require expensive retrofitting later; deciding whether to build payment processing or sensitive data handling in-house versus delegating to a third-party provider; assessing whether a microservices framework or platform incorporates security and reliability by construction rather than by convention; or producing a design review report for a security review, production readiness review, or architecture decision record. Applies the emergent property test (security and reliability cannot be bolted on — they must arise from the whole design), the initial-versus-sustained-velocity model (early neglect accelerates to later slowdown), and the Google design document evaluation checklist covering scalability, redundancy, dependency, data integrity, SLA, and security/privacy considerations. Produces: a design review report with identified tensions between feature, security, and reliability requirements; tradeoff recommendations; and prioritized security/reliability improvements. Distinct from threat modeling (which focuses on adversary scenarios) and code review (which audits existing implementation). Applicable at any stage where design decisions are still open.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/building-secure-and-reliable-systems/skills/security-reliability-design-review
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: []
source-books:
- id: building-secure-and-reliable-systems
title: "Building Secure and Reliable Systems"
authors: ["Heather Adkins", "Betsy Beyer", "Paul Blankinship", "Piotr Lewandowski", "Ana Oprea", "Adam Stubblefield"]
chapters: [4]
tags:
- security
- reliability
- design-review
- tradeoffs
- requirements-analysis
- nonfunctional-requirements
- emergent-properties
- build-vs-buy
- microservices
- velocity
execution:
tier: 2
mode: full
inputs:
- type: document
description: "System design document, architecture proposal, design doc draft, or structured description of the system being reviewed — including its primary feature requirements, data it will handle, and any known constraints"
- type: context
description: "Optional: existing security policies, SLA commitments, compliance requirements (PCI-DSS, GDPR, HIPAA), or third-party integration plans that constrain the design"
tools-required: [Read, Write]
tools-optional: [Grep]
mcps-required: []
environment: "Runs in conversation or project context. Works from a design document or description provided by the user. Produces a structured design review report."
discovery:
goal: "Produce a design review report: identified requirement tensions, emergent property assessment, tradeoff recommendations, Google design doc checklist responses, and prioritized security/reliability improvements"
tasks:
- "Classify requirements into feature (functional) and nonfunctional (security, reliability) categories"
- "Apply the emergent property test to assess whether security and reliability are designed in or bolted on"
- "Run the Google design document evaluation checklist"
- "Analyze build-vs-buy tradeoffs for any sensitive data handling or third-party dependencies"
- "Assess whether platform or framework choices provide security-by-construction"
- "Apply the initial-vs-sustained-velocity model to evaluate deferral risk"
- "Produce a prioritized improvement list with tradeoff recommendations"
audience:
roles: ["software-engineer", "architect", "tech-lead", "site-reliability-engineer", "security-engineer", "engineering-manager"]
experience: "intermediate — assumes familiarity with software design but not necessarily with formal security or reliability engineering"
triggers:
- "User has a design document or architecture proposal and wants a security and reliability review"
- "User is deciding whether to handle sensitive data in-house or delegate to a third party"
- "User wants to assess whether their design has security or reliability requirements baked in or bolted on"
- "User is preparing a design doc for a production readiness review or security review"
- "User wants to understand the long-term velocity implications of deferring security and reliability work"
- "User is choosing a microservices framework or platform and wants to know if it provides security-by-construction"
not_for:
- "Identifying specific vulnerabilities in existing code — use a secure code review process"
- "Mapping attacker profiles and threat scenarios — use adversary profiling and threat modeling"
- "Reviewing code in a pull request — use a code review checklist"
- "Planning incident response — use an incident response skill"
---
# Security and Reliability Design Review
## When to Use
You are helping a user evaluate, write, or improve a system design — a document or proposal that describes what a system will do, how it will be structured, and what constraints it will operate under. The design review is happening before implementation is locked in.
This skill examines the design from the perspective of security and reliability as emergent properties: properties that cannot be added to a finished design but must be considered throughout the design process, because they arise from the totality of architectural decisions — component decomposition, communication patterns, data handling, dependency structure, monitoring, and testing practices.
The output is a structured design review report, not a list of security warnings. The goal is to help the user understand where feature requirements, security requirements, and reliability requirements are in tension, and to recommend tradeoffs that satisfy all three at reasonable cost.
---
## Context & Input Gathering
### Required Context (must have — ask if missing)
**1. What is the system's primary purpose and what are its critical feature requirements?**
Why: Critical feature requirements are the subset of feature requirements without which the system has no viable product. They set the baseline against which security and reliability tradeoffs must be weighed — you cannot recommend eliminating a critical feature requirement to satisfy a security goal. Distinguishing critical requirements from other feature requirements also reveals where tradeoffs are genuinely available versus where they are not.
- Check prompt or environment for: design document, user stories, use cases, product requirements document, README describing what the system does
- If missing, ask: "What does this system do and what are the features that absolutely must work for the product to be viable? For example: 'Users must be able to purchase items and complete checkout — the payment flow is a critical requirement.'"
**2. What data does the system handle and who has access to it?**
Why: The security requirements for a system are primarily determined by the sensitivity of the data it handles and who can access it. A system that processes payment card data is in scope for PCI-DSS compliance and faces regulatory obligations. A system holding personally identifiable information faces GDPR or equivalent requirements. A system with sensitive user communications attracts adversaries with different motivations and capabilities than a system holding only public data. The data profile sets the floor for security investment.
- Check prompt or environment for: data classification, privacy policy, compliance requirements, what user data is stored, whether payment or health data is involved
- If missing, ask: "What types of data will this system store or process? For example: user accounts with PII, payment card data, health records, financial transactions, or only non-sensitive application data?"
**3. What are the system's availability and reliability commitments?**
Why: Reliability requirements determine what the system must do when things go wrong — how it handles component failures, dependency outages, data loss, and surges in load. These commitments directly shape architectural decisions about redundancy, fallback mechanisms, and data integrity. A system with no stated SLA has made an implicit reliability decision that will constrain its options when it fails in production.
- Check prompt or environment for: service level objectives (SLOs), uptime requirements, SLA commitments, business impact of downtime
- If missing, ask: "What are the availability requirements for this system? For example: 99.9% uptime, sub-100ms p99 latency, or 'downtime is acceptable during off-hours.' What is the business impact if the system is unavailable for an hour? A day?"
### Optional Context (enriches the review)
- **Third-party dependencies:** Any external services, APIs, or libraries the system will integrate with — each is a reliability and security dependency.
- **Team size and operational model:** Smaller teams face tighter constraints on the security and reliability complexity they can maintain. A two-person team cannot staff a 24/7 on-call rotation.
- **Phase or stage:** Early-stage design vs. pre-launch review vs. post-incident redesign changes the emphasis of the review.
- **Regulatory or compliance requirements:** PCI-DSS, GDPR, HIPAA, SOC 2 each impose specific security and reliability obligations that must appear in the design.
---
## Core Principles
### The Emergent Property Test
Security and reliability are not modules or features — they are emergent properties of the entire system design. There is no `--enable_high_reliability_mode` flag. No single component in a system's source code "implements" security or reliability.
Reliability emerges from: how the service decomposes into components; how component availability relates to dependency availability; what communication mechanisms are used (RPCs, message queues, event buses); how load balancing, load shedding, and routing are implemented; how testing is integrated into the development and deployment workflow; and what monitoring and alerting infrastructure exists to detect anomalies.
Security posture emerges from: how the system decomposes into subcomponents and the trust relationships between them; the implementation languages, platforms, and frameworks used; how security design reviews, security testing, and code review are integrated into the workflow; and the forms of security monitoring, audit logging, and anomaly detection available to responders.
**Test to apply:** Can a well-specified module or flag be added to the design after the fact to satisfy the security or reliability requirement? If yes — it is a feature requirement, not an emergent property requirement. If no — it is an emergent property requirement that must be addressed in the design now, not retrofitted later. Apply this test to each identified security and reliability concern to determine whether it can be deferred or must be resolved in the current design.
### The Cost of Retrofitting
Design choices related to security and reliability are often similar in nature to foundational architectural choices like whether to use a relational or NoSQL database, or whether to use a monolithic or microservices architecture. Retrofitting these choices onto an existing system often requires significant design changes, major refactoring, or even partial rewrites, and can become very expensive and time-consuming.
Furthermore, such changes might have to be made under time pressure in response to a security or reliability incident — and making invasive late-stage changes to a deployed system in a hurry comes with significant risk of introducing additional flaws. The review should identify which requirements, if deferred, will impose the highest retrofitting cost.
### Initial Velocity Versus Sustained Velocity
Teams commonly justify ignoring security and reliability as early and primary design drivers in the name of "velocity" — the concern that addressing these requirements will slow down initial development. The correct distinction is between initial velocity and sustained velocity.
Choosing not to account for security, reliability, and maintainability early in the project cycle may indeed increase initial velocity. However, it also usually slows the project down significantly later. The late-stage cost of retrofitting a design to accommodate emergent property requirements can be very substantial. Furthermore, making invasive late-stage changes to address security and reliability risks can itself introduce additional security and reliability risks.
Evaluate each deferred security or reliability requirement against this model: What is the realistic late-stage cost of addressing this after the system is deployed? Is there a plausible remediation path that does not require significant redesign under time pressure?
---
## Process
### Step 1 — Classify Requirements
Identify and classify all stated and implied requirements in the design document.
**Why:** Security and reliability requirements have fundamentally different characteristics from feature requirements. Feature requirements map directly to specific code — a user story for profile editing maps to a handler, structured types, and a UI. Security and reliability requirements are emergent — there is no specific module in the codebase that "implements" reliability. Conflating the two categories leads to underinvestment in security and reliability because they appear to be addressed when they are not.
Produce a requirements table with three columns:
| Requirement | Category | Critical? |
|---|---|---|
| Users can purchase items | Feature | Yes — critical |
| Support team can view order history | Feature | No |
| All PII encrypted at rest | Security (nonfunctional) | Yes |
| Service available 99.9% per month | Reliability (nonfunctional) | Yes |
| Checkout completes in <2s at p99 | Performance (nonfunctional) | No |
**Categories:**
- **Feature / functional:** Describe specific behaviors a user can observe. Express as use cases, user stories, or user journeys. Identify which are critical (failure means no viable product).
- **Security (nonfunctional):** Define the exclusive circumstances under which sensitive data may be accessed, what protections are required for that data, and what compliance standards apply.
- **Reliability (nonfunctional):** Define SLOs for uptime, latency, and error rates; how the system responds to dependency failures; and data integrity and recovery requirements.
- **Development efficiency:** How implementation language, frameworks, and build processes affect the team's ability to iterate.
- **Deployment velocity:** How long from code complete to user-visible feature.
Surface any implied requirements that are not stated — particularly nonfunctional ones that are typically assumed (e.g., "we won't lose user data" is a reliability requirement even if unstated).
**Output of this step:** A requirements classification table with all stated and implied requirements, their category, and whether they are critical.
---
### Step 2 — Apply the Google Design Document Checklist
Run the Google design document evaluation questions against the design. Each question should be answered with the design's current response and an assessment of whether the response is adequate.
**Why:** Google uses this design document template to guide new feature design and to collect feedback from stakeholders before starting an engineering project. The reliability and security sections remind teams to think about the implications of their project and kick off production readiness or security review processes early — sometimes multiple quarters before engineers officially start thinking about the launch stage. These questions surface the requirements that teams most commonly neglect until they become incidents.
#### Scalability
- How does the system scale? Consider both data size increase and traffic increase.
- What are the current hardware or infrastructure constraints? Adding more resources might take much longer than expected, or might be too expensive for the project. What initial resources will be needed? Plan for high utilization, but be aware that using more resources than needed will block expansion.
#### Redundancy and Reliability
- How will the system handle local data loss and transient errors (e.g., temporary outages), and how does each affect the system?
- Which systems or components require data backup? How is the data backed up? How is it restored? What happens between the time data is lost and the time it is restored?
- In the case of a partial loss, can the system keep serving? Can only missing portions of backups be restored to the serving data store?
#### Dependency Considerations
- What happens if dependencies on other services are unavailable for a period of time?
- Which services must be running for the application to start? (Do not forget subtle dependencies like DNS resolution or time synchronization.)
- Are there dependency cycles — systems that block on each other in ways that prevent either from starting independently?
#### Data Integrity
- How will data corruption or loss in data stores be detected?
- What sources of data loss are covered? (User error, application bugs, storage platform bugs, site/replica disasters.)
- How long will it take to notice each type of loss? What is the recovery plan for each?
#### SLA Requirements
- What mechanisms are in place for auditing and monitoring the service level guarantees of the application?
- How can the stated level of reliability be guaranteed?
#### Security and Privacy
- What are the potential attacks relevant to this design, and what is the worst-case impact of each attack? What countermeasures are in place to prevent or mitigate each?
- What are the known vulnerabilities or potentially insecure dependencies?
- If the application has no security or privacy considerations, explicitly state so and explain why.
For each question, record: the design's current answer (or "not addressed"), and a gap assessment (Adequate / Needs attention / Not addressed).
**Output of this step:** A completed checklist table with current design response and gap assessment for each question.
---
### Step 3 — Identify Requirement Tensions
Map the tensions between feature requirements and nonfunctional requirements identified in Steps 1 and 2.
**Why:** Because security and reliability requirements are emergent properties, they interact with feature requirements and with each other in non-obvious ways. The classic example from payment processing illustrates this: attempting to mitigate a security risk (by outsourcing payment handling) introduces a reliability risk (third-party service dependency); attempting to mitigate that reliability risk (by adding a message queue fallback) reintroduces the security risk (sensitive data on disk). Tensions compound. Identifying them explicitly prevents the team from solving one tension while inadvertently creating another.
For each identified tension, produce a tension entry:
```
Tension: [short name]
Feature requirement affected: [the user story or functional requirement at risk]
Security/reliability concern: [the nonfunctional requirement in conflict]
How they conflict: [one paragraph describing the interaction]
Current design response: [how the design currently resolves or ignores this tension]
Assessment: [Resolved / Partially resolved / Unresolved]
```
Common tension patterns to look for:
- **Availability vs. data minimization:** Reliability mechanisms like message queues and local caching require data to be stored locally or in fallback systems, which may conflict with requirements to not hold sensitive data at rest.
- **Third-party dependency for data minimization vs. reliability:** Delegating sensitive data handling to a third party reduces in-house security obligation but adds a dependency failure mode. Mitigating that reliability risk often reintroduces the security concern.
- **Developer velocity vs. security controls:** Frameworks and languages that enforce security properties add constraints on what developers can do — the tradeoff is reduced developer freedom for reduced vulnerability surface.
- **Complexity vs. testability:** Resilience mechanisms (fallback paths, queues, retries) introduce code paths exercised only in failure scenarios. These paths are less tested and more likely to harbor bugs.
- **Cost of redundancy vs. availability commitment:** Meeting a 99.99% SLA may require geographic redundancy and hot standby systems — costs that must be weighed against the SLA commitment's actual business value.
**Output of this step:** A tensions table with one entry per identified tension, including the current design response and resolution status.
---
### Step 4 — Evaluate Build-vs-Buy for Sensitive Data Handling
For any component that handles sensitive data (PII, payment card data, health records, credentials), evaluate whether to build the handling in-house or delegate to a third-party service provider.
**Why:** Often, the best way to mitigate security concerns about sensitive data is to not hold that data in the first place. Third-party service providers can be arranged so that sensitive data never passes through your systems, or at least so that your systems do not persistently store it. This is a design-level tradeoff with significant downstream security, reliability, and cost implications — and it must be decided at design time, not retrofitted later.
For each sensitive data component, assess the following:
**Benefits of third-party delegation:**
- Your systems no longer hold the sensitive data, reducing the risk that a vulnerability in your systems results in a data compromise.
- Compliance obligations (e.g., PCI-DSS scope for payment card data) may be significantly simplified.
- You do not need to build and maintain infrastructure to protect the data at rest in your systems.
- Many third-party providers offer countermeasures against fraud and risk assessment services that would be expensive to build in-house.
**Costs and risks of third-party delegation:**
- The provider will charge fees. Beyond a certain transaction volume, in-house processing may be more cost-effective.
- Engineering cost of integrating with and maintaining the vendor API — including tracking API changes on the vendor's schedule.
- Your user story "user can complete purchase" now fails if the payment provider's service is down or unreachable. This is an additional dependency failure mode outside your control.
- You are entrusting sensitive customer data to a third-party vendor. You must carefully evaluate the vendor's security stance during selection and on an ongoing basis.
- Integrating with a vendor-supplied library introduces the risk that a vulnerability in that library, or its transitive dependencies, results in a vulnerability in your systems. Prefer vendors who expose their API via open protocols (REST+JSON, XML, SOAP, gRPC) rather than requiring a proprietary linked library.
**Reliability mitigation risks:**
Mitigating the reliability risk of a third-party dependency (e.g., adding a message queue to buffer transactions during outages) can reintroduce the security risk you were trying to eliminate (e.g., sensitive payment data written to persistent local storage). Subsystems exercised only in rare failure scenarios are less tested and more likely to harbor hidden bugs — including security vulnerabilities that are never triggered in normal operation.
Produce a recommendation: delegate, build in-house, or hybrid, with rationale.
**Output of this step:** A build-vs-buy assessment for each sensitive data component, with a recommendation and the key risks of each option.
---
### Step 5 — Evaluate Framework and Platform Security-by-Construction
Assess whether the implementation framework or platform chosen for the system provides security and reliability properties by construction — meaning the framework actively prevents developers from introducing common vulnerabilities or reliability failures, rather than relying on developer discipline.
**Why:** The Google microservices framework example illustrates the win-win scenario: a framework adopted primarily because it simplifies application development and automates common chores can simultaneously provide out-of-the-box security and reliability enforcement. When security and reliability best practices are woven into the fabric of the framework — not bolted on at the end — the framework takes responsibility for handling many common concerns. This makes security and production readiness reviews much more efficient: if continuous builds are green, the application is not affected by the common security concerns already addressed by the framework. Bug fixes in the centrally maintained framework propagate automatically to all applications when they are rebuilt and deployed.
Evaluate the proposed framework or platform against the following criteria:
**Static conformance checks:** Does the framework enforce that values passed between components meet type constraints at compile time? (Example: immutable types passed between concurrent contexts prevent concurrency bugs by construction — a conformance check that eliminates a class of bugs rather than relying on developer awareness.)
**Isolation enforcement:** Does the framework enforce isolation constraints between components so that a change in one component cannot cause a bug in another? This is a reliability property: well-defined and understandable interfaces between components reduce both vulnerability to bugs and attack surface.
**Security vulnerability class elimination:** Does the framework actively prevent common vulnerability classes — cross-site scripting, SQL injection, cross-site request forgery — by construction rather than by guideline? A framework that provides `SafeHtml` and `SafeUrl` types that the compiler enforces cannot be bypassed by mistake eliminates those vulnerability classes structurally, not by convention.
**Automated operational setup:** Does the framework automatically configure monitoring for operational metrics, health checks, and SLA compliance? Manual configuration of these operational concerns is a reliability risk because it is inconsistently applied across services.
**Error budget integration:** Does the deployment automation incorporate reliability gates (e.g., stopping releases when the error budget is consumed) so that SLA adherence is a structural property of the deployment pipeline rather than a manual process?
If the current design relies on developer discipline rather than framework enforcement for security or reliability properties, flag this as a design risk and recommend frameworks or patterns that provide the property by construction.
**Output of this step:** A framework/platform assessment against the five criteria, with a recommendation on whether the current choice provides adequate security-by-construction coverage.
---
### Step 6 — Apply the Initial-vs-Sustained-Velocity Assessment
Assess the velocity implications of deferring each unresolved security or reliability requirement identified in previous steps.
**Why:** Teams are tempted to defer security and reliability work in the name of "moving fast." This is a coherent trade if initial velocity is the primary goal and the late-stage cost is understood and accepted. The problem is that the late-stage cost is routinely underestimated and the retrofitting work routinely happens under the worst possible conditions — under time pressure, in response to a security or reliability incident, in a mature and complex codebase with many dependencies. Making invasive late-stage changes under those conditions introduces new risks. The goal of this step is to make the deferred cost explicit and help the team decide deliberately, not by default.
For each unresolved requirement from the tensions analysis, assess:
1. **Retrofitting complexity:** Can this requirement be addressed by adding a new component, or does it require redesigning existing architectural boundaries? (Example: adding a logging interceptor to an RPC framework is additive — it does not require changing existing handlers. Redesigning from a monolith to microservices to achieve isolation is a structural change affecting everything.)
2. **Incident risk:** What is the probability and impact of a security or reliability incident caused by this gap before it is addressed? Deferring a requirement that has a high probability of causing a production incident within 12 months is a different decision from deferring a requirement that affects a low-traffic, low-sensitivity code path.
3. **Remediation pressure:** If this gap causes an incident, what is the likely time-pressure under which the fix must be made? Incident-driven remediation under time pressure is a significant source of new security and reliability bugs.
4. **Early investment cost:** What is the actual engineering cost of addressing this requirement now, at design time? Early investments in security and reliability are typically modest when made early in a product's lifecycle and require only incremental ongoing effort to maintain. The same investment made after launch typically requires a lot of work all at once.
**Output of this step:** A velocity impact table with each deferred requirement, its retrofitting complexity (Low/Medium/High), incident risk level (Low/Medium/High), and a recommended resolution timing (address now / address before first production load / can defer to later milestone).
---
### Step 7 — Produce the Design Review Report
Compile the outputs of Steps 1–6 into a structured design review report.
**Why:** A design review is only useful if it is documented, shareable, and actionable. The report becomes the shared reference for design decisions, security review filings, production readiness reviews, and architecture decision records. Without a written artifact, the review has no lasting impact.
**Report structure:**
```
1. Executive Summary
- System overview (one paragraph)
- Overall assessment: Adequate / Needs attention / Significant gaps
- Top 3 prioritized recommendations
2. Requirements Classification
- Table from Step 1
3. Google Design Document Checklist
- Completed checklist from Step 2, with gap column
4. Identified Requirement Tensions
- Tension entries from Step 3
5. Build-vs-Buy Recommendations
- Assessments from Step 4 (omit if no sensitive data handling)
6. Framework/Platform Assessment
- Evaluation from Step 5
7. Initial-vs-Sustained-Velocity Assessment
- Table from Step 6
8. Prioritized Improvement List
- Ranked list of recommended changes
- Each entry: recommendation, rationale, timing, estimated effort
9. Open Questions
- Design decisions not yet made that affect this review
- Information that would change the assessment if available
```
**Output of this step:** A completed design review report. Write it to the project directory or present it as a structured response, depending on the user's preference.
---
## Key Principles
**1. Security and reliability cannot be bolted on.**
They are emergent properties of the entire system design — including component decomposition, communication mechanisms, data handling, trust relationships, testing practices, and operational workflow. Apply the emergent property test to every security and reliability concern: if there is no specific module or flag that implements it, it must be addressed at the design level, not deferred to implementation.
**2. Tradeoffs compound.**
The payment processing example illustrates a common pattern: solving one security or reliability tension creates another. A security mitigation (not holding sensitive data) introduces a reliability risk (third-party dependency). Mitigating that reliability risk (local message queue) reintroduces the security risk (sensitive data on persistent storage). Identifying tensions explicitly prevents teams from solving one without noticing they have created another.
**3. Initial velocity is not the same as sustained velocity.**
Deferring security and reliability work typically accelerates the start but decelerates the project as a whole. The late-stage cost of retrofitting emergent property requirements can be very substantial, and remediation work done under incident pressure is itself a source of new security and reliability failures. Make the deferred cost explicit before treating it as acceptable.
**4. Security-by-construction beats security-by-convention.**
A framework that makes common vulnerability classes structurally impossible (via type constraints enforced by the compiler) is more reliable than one that relies on developer awareness and code review. When choosing between frameworks or design patterns, prefer options that make the secure behavior the default and the insecure behavior a compile-time or structural error.
**5. Security and reliability reviews must happen at design time.**
Google's design document template kicks off production readiness and security review processes early — sometimes multiple quarters before engineers start thinking about the launch stage. Design reviews sometimes happen multiple quarters before the launch stage. Late-stage security and reliability reviews that discover fundamental design problems are costly and disruptive. This review is most valuable when design decisions are still open.
---
## Examples
### Example 1 — Payment processing: build-vs-buy analysis
**Context:** An e-commerce service needs to accept credit card payments. The product team's critical requirement is "users can complete checkout." The team is considering whether to handle payment card processing in-house or delegate to a third-party payment service API.
**Requirement tension identified:**
- Delegating to a third-party service eliminates the need to hold PCI-DSS-regulated card data, reducing compliance scope and data breach risk.
- However, the third-party service becomes a critical dependency. If the payment service is unavailable, "users can complete checkout" fails — a critical requirement is now dependent on an SLA outside the team's control.
- Adding a message queue to buffer transactions during outages reintroduces storage of payment data (on disk or in memory), which partly re-creates the PCI-DSS scope the delegation was meant to eliminate.
- Using volatile-memory-only storage for the queue avoids the PCI concern but means transactions are lost during host restarts — a data integrity failure.
**Recommendation:** Delegate to a third-party payment service; do not add a local queue fallback. Accept the reliability dependency on the payment provider and mitigate by selecting a provider with a strong SLA and evaluating a secondary provider for failover. Avoid the queue fallback because the security/compliance risk it reintroduces outweighs the reliability benefit at typical transaction volumes.
**Emergent property test result:** The PCI-DSS compliance requirement cannot be satisfied by adding a module after the design is implemented — it requires that the data never enters the system in the first place. This is an emergent property requirement that must be resolved at design time.
---
### Example 2 — Microservices framework: security-by-construction assessment
**Context:** A team is building a new microservices-based web application and choosing between a minimal framework and a full-featured internal framework with conformance checks.
**Framework/platform assessment:**
- The minimal framework provides no conformance checks. Security properties (input validation, output encoding, authentication, authorization) must be implemented correctly in each service by each developer. These properties are verified by code review and security guidelines — by convention, not by construction.
- The full-featured framework enforces conformance checks: all values passed between concurrent contexts must be immutable types (eliminating concurrency bugs by construction); isolation constraints between components prevent cross-component bug propagation; web application support handles common vulnerability classes (cross-site scripting, cross-site request forgery) by construction, not by guideline.
**Sustained velocity implication:** Applications built on the security-by-construction framework are verified by green continuous integration builds — if the build passes framework conformance checks, the application is not affected by the security concerns the framework addresses. Security and production readiness reviews are much more efficient. Bug fixes and security patches in the framework propagate automatically to all applications at next build and deploy.
**Recommendation:** Choose the full-featured framework. The up-front learning cost is offset by the structural elimination of common vulnerability classes and the efficiency gain in security reviews over the system's lifetime. This is a design-time decision; migrating later requires significant refactoring of every service.
---
## References
See `references/` for:
- `google-design-doc-checklist.md` — Full text of the reliability and security sections of the Google design document template, with guidance on how to interpret each question
- `requirement-classification-guide.md` — Worked examples of feature vs. nonfunctional requirement classification, including implied nonfunctional requirements teams commonly miss
- `tension-catalog.md` — Common security-vs-reliability tension patterns with worked analysis from the payment processing and microservices case studies
- `build-vs-buy-assessment-template.md` — Blank template for third-party delegation tradeoff analysis, including vendor security evaluation criteria
- `velocity-impact-assessment.md` — Guidance on assessing retrofitting complexity and incident risk for deferred requirements
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
Use when you need to recover from a security incident, build an incident recovery plan, execute post-breach remediation, rotate credentials after a breach, s...
---
name: security-incident-recovery
description: Use when you need to recover from a security incident, build an incident recovery plan, execute post-breach remediation, rotate credentials after a breach, scope attacker impact across systems, build a recovery checklist, decide when to eject an attacker vs. continue observing, or run a post-incident postmortem with short-term and long-term action items.
tags: [security, incident-recovery, post-incident, credential-rotation, postmortem]
depends_on: [security-incident-command]
---
# Security Incident Recovery
Guides post-incident recovery from security breaches using a 4-phase process: appoint a Remediation Lead in parallel with investigation, scope the full blast radius, build a structured recovery checklist, and execute remediation — including isolation, clean rebuilds, credential rotation, and postmortem. Incorporates a decision framework for eject-vs-mitigate-vs-rebuild choices, four key adversarial-thinking questions, and technical debt tracking for short-term bypasses. Covers three worked scenarios ranging from opportunistic cloud compromise to targeted APT with financial backdoor. Output: recovery checklist, credential rotation plan, and postmortem with prioritized action items.
**Prerequisite:** The Incident Management at Google (IMAG) process must be running with an Incident Commander (IC) and an Operations Lead (OL) already staffed (`security-incident-command`). This skill picks up from the point at which enough is known about the attack to begin parallel recovery planning.
## When to Use
- Executing recovery after a confirmed security breach, intrusion, or account compromise
- Scoping which systems, accounts, and data are affected and need remediation
- Building the recovery checklist before executing cleanup
- Deciding whether to eject the attacker now or continue observing their activity
- Rotating credentials, secrets, and cryptographic keys following a breach
- Conducting a postmortem that distinguishes overnight tactical fixes from strategic long-term changes
- Recovering from attacks where the attacker may still be present and watching the response
## Process
### Phase 1 — Appoint a Remediation Lead (parallel track)
As soon as the incident is confirmed and investigation begins, the IC and Operations Lead should appoint a **Remediation Lead (RL)**. Do not wait until investigation is complete.
**Why parallel tracks:** Investigation is time-consuming and detail-heavy; recovery planning requires a different skill set (system knowledge, operations expertise) than forensic investigation. Separate teams feed information to each other bidirectionally rather than handing off sequentially — this prevents the investigation team from being exhausted by the time recovery begins, and it allows recovery planning to stay current with newly discovered attacker activity.
**Who the RL assembles:** The people who build and run the affected systems every day — SREs, developers, system administrators, helpdesk staff, and the security specialists who manage routine processes like code audits and configuration reviews. Recovery engineers do not need to be security professionals; they need deep system knowledge.
**Information management during recovery:**
Store all recovery documentation — raw incident trails, scratch notes, recovery checklists, new operational documentation — in systems inaccessible to the attacker. Use air-gapped computers or services clearly outside the attacker's potential scope. Start with notecards and add an independent service provider once you are confident no recovery team member machines are compromised.
If the attacker has access to your email or instant messaging system, they can read your recovery plans in real time, observe your mitigations, and take pre-emptive action. In a large-scale phishing scenario where employee accounts are compromised, set up a completely new communication system — for example, inexpensive Chromebooks on a fresh network — before coordinating recovery steps.
Assign dedicated note-takers or documentation specialists. An audit trail of what happened during recovery helps fix mistakes and satisfies postmortem requirements.
### Phase 2 — Scope the Recovery
Before building the recovery checklist, establish the complete picture of what is affected and what the attacker can do.
**Required inputs from the investigation team:**
1. **Complete affected system list:** Every system, network, and dataset directly or indirectly impacted. If a configuration distribution system is compromised, every system that received configurations from it is in scope. Track indirect impact chains.
2. **Attacker tactics, techniques, and procedures (TTPs):** Understanding how the attacker operates lets you identify related resources that may be impacted and anticipate their next moves. Build a list of attacker behaviors and possible defenses — this becomes operational documentation for why specific new defenses are introduced.
3. **Variant analysis:** Identify whether other systems are susceptible to variations of the same attack. If a buffer overflow was exploited on one web server and 20 others run the same flawed software, all 20 are in scope even if not yet compromised. Check for related vulnerability classes in the same infrastructure or the same vulnerability in different software using the same shared libraries.
**The four key recovery questions — ask these before proceeding:**
**1. How will your attacker respond to the recovery effort?**
There is a human on the other side of the attack who is reacting to your incident response. Your recovery plan must include severing all attacker connections to your resources and ensuring they cannot return. A mistake here leads the attacker to take additional actions you may not have anticipated or have visibility into. An attacker who is still active when you take six systems offline can see your actions and may destroy the remaining infrastructure.
**2. What does complete remediation require?**
Understand the full scope of what the attacker has done — not just what is known. Plan for the possibility that the attacker is still present and watching your actions. Inform the investigation team of your recovery plans; they must be confident your plans will stop the attack before you execute.
**3. What variants of the attack exist?**
Address the known compromised system, but also consider whether variants of the vulnerability exist elsewhere in the infrastructure and how you will mitigate those for the future.
**4. Will your recovery reintroduce attack vectors?**
Recovery methods that restore systems to a known good state using system images, source code, or configurations may reintroduce the vulnerability that enabled the attack. Update golden images and delete compromised snapshots before systems come back online. Determine the attack timeline — when did the attacker make modifications, and how far back must you revert to reach a genuinely uncompromised state?
**On the timing of ejection — the consensus principle:**
Security incident responders generally agree that you should wait until you have a full understanding of the attack before ejecting the attacker. This prevents the attacker from observing your mitigations and helps you respond defensively.
Apply this carefully: if your attacker is actively doing something dangerous — exfiltrating sensitive data, destroying systems, stealing money — you may choose to act before you have a complete picture. Ejecting early while the attacker still has unknown footholds means entering a chess game. Prepare accordingly: know the steps you must take to reach checkmate before you make your opening move.
### Phase 3 — Build the Recovery Checklist
The recovery checklist is the single authoritative document for executing remediation. Build it before you start any cleanup work.
**Why a written checklist is mandatory:** Recovery from a complex incident is a choreographed operation with many individuals whose actions affect each other. A well-documented checklist lets the recovery team identify parallelizable work, coordinate across teams, and give the IC confidence that all completed steps have actually been verified. It also provides an audit trail.
**Checklist structure (Figure 18-1 template):**
```
INSTRUCTIONS:
- Highlight GREEN when COMPLETED
- Highlight ORANGE for TO-DO items
- Highlight RED if task cannot be completed (must be replaced with equivalent action)
- Highlight BLUE if task will not be completed because it is operationally risky (adjustments needed)
Incident Commander:
- Give "hands off keyboards" briefing
- Provide overview of tasks and constraints
- Notify everyone of the exact time remediation begins
In order of operations, the Remediation Lead initiates:
1. [Team name]: Description of task
a. Detailed commands or steps
2. [Team name]: Description of task
a. Detailed commands or steps
...
```
**What each checklist item must include:** the specific team responsible, detailed commands or exact steps (not just descriptions), the order of execution, rollback procedures if the plan fails, and any cleanup steps that follow remediation. Thorough documentation means every team member has unambiguous, agreed-upon guidance without needing to interpret intent under pressure.
**Parallelization:** Structure the checklist to identify which tasks can run concurrently. For complex incidents with multiple affected systems, create separate checklists per workstream (email team, source code team, SSL keys team, customer data team) that a single RL coordinates.
**Technical debt tracking for short-term mitigations:**
Some recovery steps will bypass normal procedures to eject the attacker quickly — for example, manually adding firewall deny rules outside your version control system, or disabling automatic configuration pushes. Accept this technical debt deliberately, but set explicit remediation reminders.
For each bypass, ask:
- How quickly can we replace this short-term mitigation?
- Is the team that owns this technical debt committed to paying it off through later improvements?
- Will this mitigation affect system uptime or exceed error budgets?
- How can future engineers recognize this as temporary (comments in code, descriptive commit messages, tagging)?
- How can a future engineer with no domain expertise prove the mitigation is no longer needed and remove it safely?
- What happens if this short-term mitigation is left in place for months instead of weeks?
Tag technical debt visibly so it cannot be forgotten.
### Phase 4 — Execute the Recovery
#### Isolation and Quarantine
Isolate compromised assets before or during remediation to prevent the attacker from using remaining access to undermine recovery.
**Quarantine methods:**
- Disable switch ports or networking at the host level to isolate individual machines
- Use network segmentation to isolate entire networks of compromised machines
- Move malicious binaries to quarantine folders with restricted permissions (antivirus-style quarantine)
- For mission-critical systems that cannot be taken fully offline: isolate on a separate network with restricted traffic rules limiting what it can send and receive
**BeyondCorp / zero-trust architecture advantage:** Zero-trust environments that validate asset security posture before granting access to corporate services — like Google's BeyondCorp architecture — provide a natural isolation boundary during recovery. Because access is controlled via proxy infrastructure with strong credentials that can be revoked at any time, isolating a compromised asset does not require network-level reconfiguration. Lateral movement (attacker moving between machines on the same network) is also suppressed by default because there is little implicit trust between assets.
**Leaving compromised assets online is technical debt.** If a mission-critical database cannot be rebuilt immediately, isolate it and mark it clearly — use highly visible physical stickers on devices, maintain a list of MAC addresses of quarantined systems and monitor for those addresses on the network. Include explicit remediation of all quarantined assets in the recovery checklist and postmortem. Someone new to the organization or the incident must not accidentally unquarantine a compromised resource.
#### Clean Rebuilds from Known-Good Sources
For systems where the attacker may have installed malware, modified software, or tampered with configurations, reinstalling from scratch with known-good images is typically the best solution. Deleting malware in place risks missing additional malware types the attacker installed.
**Eject vs. mitigate vs. rebuild decision:**
- If the affected systems are mission-critical and difficult to rebuild: deleting malware and patching is tempting but risky if the attacker installed multiple malware types
- If you have reliable and secure design principles in place (hardware-backed boot verification, cryptographic chain of trust, automated release systems): rebuilding is straightforward — it is simply a power cycle or container replacement
- In cloud environments: replace compromised instances with clean images from known-good sources
**Checksum verification:** Obtain known-good copies of source code, binaries, images, and configurations from original sources (open source providers, commercial software vendors, uncompromised version control systems). Perform checksum comparisons against known good states. If your good copies are hosted on compromised infrastructure, determine exactly when the attacker began tampering before using them.
**Attack timeline analysis for golden images:** Review your golden images. Delete compromised snapshots. Update images before systems come back online. If restoring from backups, determine whether the backup predates the attacker's first modification — logs may identify an inflection point where the attack began, giving you an independent lower bound for which backups are safe.
**Mark compromised code explicitly:** During the rebuild, clearly mark known compromised code, libraries, and binaries. Create tests that prevent inadvertent reuse of compromised code by recovery team members working in parallel.
#### Credential and Secret Rotation
Credential rotation is one of the most critical and most frequently incomplete recovery actions.
**What to rotate — comprehensive scope:**
- User account passwords (all accounts with access to affected systems; if attacker had wide access or access is unclear, rotate all users)
- Service and application account credentials
- SSH keys (including the `authorized_keys` file)
- Network device credentials
- Out-of-band management system credentials
- SSL/TLS keys (if frontend web infrastructure was accessible to the attacker — failure to rotate enables man-in-the-middle attacks)
- Encryption keys for data at rest (rotate keys and re-encrypt if the key lived on a compromised server)
- API keys for cloud services (attackers who accessed source code or local config files may have found these; rotate conservatively and often even where you cannot prove access occurred)
- SaaS application credentials
**Execution order matters:** Prioritize administrator credentials, then known compromised accounts, then accounts that grant access to sensitive resources. In large organizations where thousands of users must change passwords simultaneously, single sign-on (SSO) services can simplify coordinated resets — but introducing SSO during an incident is time-consuming and may not be viable short-term.
**Pass-the-hash risk:** Attackers can obtain and replay hashed credentials (NTLM or LanMan hashes) without needing cleartext passwords. If your environment is vulnerable, include credential hash rotation and review of authentication protocols in the checklist.
**Pattern-based password prediction:** If your environment stores password history (LDAP, Active Directory), an attacker with access to hashed passwords can infer patterns (password2019, password2020...) and predict the next password in a sequence. Consult security specialists about whether forced password resets are sufficient or whether two-factor authentication must be enabled simultaneously to prevent the attacker from leveraging their historical knowledge.
**Two-factor authentication as a quick win:** Adding two-factor authentication — especially for less-protected entry points like email — is a fast short-term mitigation. It also provides a detection benefit: if an attacker attempts to log in with stolen credentials after rotation, the attempt appears as an account login failure, alerting the investigation team to continued attacker activity.
### Phase 5 — After the Recovery: Postmortem
Begin post-recovery actions as soon as recovery is complete. Waiting lets incident details fade and postpones necessary medium- and long-term fixes.
**Postmortem structure:**
Every postmortem must contain a list of action items that address the underlying problems found during the incident. A strong postmortem:
- Covers the technology issues the attacker exploited
- Recognizes opportunities for improved incident handling
- Documents the time frames and efforts associated with each action item
- Explicitly separates short-term tactical mitigations from long-term strategic changes
**Short-term overnight items** (low-hanging fruit — quick to implement, small in scope):
- Patch known vulnerabilities exploited in the attack
- Add two-factor authentication to entry points that lacked it
- Set up a vulnerability discovery program
- Enable two-factor authentication for additional services
- Lower the time to apply patches
**Long-term strategic items** (fundamental changes to how systems and processes work):
- Deploy backbone encryption
- Start a dedicated security team
- Implement BeyondCorp / zero-trust network architecture
- Alter operating system choices to enable hardware-verified boot
- Establish ongoing vulnerability management programs
- Pursue widespread FIDO-based security key adoption for authentication
**Security-focused postmortem questions:**
- What were the main contributing factors? Are there variants and similar issues elsewhere?
- What testing or auditing processes should have detected these factors earlier?
- Was this incident detected by an expected technology control (e.g., intrusion detection system)? If not, how do you improve detection?
- How quickly was the incident detected and responded to? Is this within an acceptable range?
- Was important data sufficiently protected to deter attacker access? What new controls should be introduced?
- Was the recovery team able to use versioning, deployment, testing, backup, and restoration tools effectively?
- What normal procedures (testing, deployment, release) did the team bypass during recovery? What remediation must happen now?
- Were any infrastructure changes made as temporary mitigations that now need refactoring?
- What bugs were filed during the incident and recovery that still need to be addressed?
**Action item format:** Lay out action items with explicit owners, sorted by short-term vs. long-term. The period of routine steady state after an incident is simultaneously the window before the next incident — use it to identify new threats and prepare.
## Worked Examples
### Example A: Compromised Cloud Instances (Opportunistic)
**Scenario:** A web-based software package running on cloud VMs has a common vulnerability. An opportunistic attacker discovers it with scanning tools and takes over the VMs, using them to launch new attacks.
**Recovery approach:**
The RL uses investigation notes to build a checklist with four ordered workstreams:
1. **Security specialists:** Set up vulnerability scanning — includes scanning other software on the cloud instances to identify additional necessary fixes
2. **Web developer team:** Patch vulnerable software — includes specific source location, which versions to deploy, and deployment steps
3. **SRE team:** Deploy new clean cloud instances in parallel with compromised ones — test and validate before routing traffic
4. **SRE team:** Disable compromised cloud instances — includes specific shutdown steps and archive or deletion to prevent reuse
**Postmortem outcome:** Identifies need for a formal vulnerability management program to proactively discover and patch known issues before attackers find them.
### Example B: Large-Scale Phishing Attack (70% Employee Compromise)
**Scenario:** Over seven days, an attacker launches a password phishing campaign against an organization without two-factor authentication. 70% of employees fall for it. The attacker reads email and leaks sensitive information to the press. The attacker has stated they will take more action in the coming days.
**Complexities:** The investigation team determines the attacker has not yet tried to access systems other than email — but many employees share passwords between email and VPN/cloud services, meaning the attacker could pivot. Ejecting from email quickly is urgent, but uncoordinated password resets that tip off the attacker before all entry points are locked could trigger immediate escalation.
**Recovery checklist — four ordered workstreams:**
1. **System administrators / SREs:** Implement two-factor authentication infrastructure for email (specific enrollment steps for all employees)
2. **Email SREs:** Lock all email accounts (specific steps to lock until reset and 2FA enrollment — consider locking all accounts simultaneously rather than a rolling window, since the attacker is active)
3. **IT staff (Helpdesk):** Enact two-factor authentication and password resets (help users change email, VPN, and cloud-based passwords; may involve an all-hands meeting)
4. **System administrators / SREs:** Adopt two-factor authentication for VPN and cloud systems
**Postmortem outcome:** Two-factor authentication for all critical communication and infrastructure systems; SSO solution; employee phishing awareness education; engage an IT security specialist for ongoing best-practice guidance.
### Example C: Targeted APT — Financial Backdoor, SSL Theft, Executive Monitoring
**Scenario:** An attacker has had undetected access for over a month. They inserted a backdoor into source code siphoning $0.02 from every transaction, stole SSL keys protecting customer web communications, and monitored executive email. The investigation team does not know how the attack began or how the attacker maintains access. The attacker can modify and deploy source code to production, access SSL key infrastructure, and read executive email including incident response coordination.
**Complexities:** This attack requires full parallelization across five independent recovery workstreams:
- **Email team:** Sever attacker's email access — reset passwords, introduce two-factor authentication (the attacker is actively reading your recovery plans)
- **Source code team:** Determine when the attacker modified source code, identify all affected binaries deployed, sanitize affected source files, redeploy clean code. Confirm the attacker has not modified version control history and cannot make additional changes.
- **SSL keys team:** Deploy new SSL keys to web serving infrastructure. Determine how the attacker gained access to key storage and prevent future unauthorized access.
- **Customer data team:** Determine what customer information the attacker accessed. Perform customer password changes or session resets as needed. Coordinate with customer support, legal, and related staff.
- **Reconciliation team:** Address the $0.02-per-transaction siphoning. Determine whether database records must be amended for long-term financial recording. Coordinate with finance and legal.
**Key lesson:** Complex incidents require less-than-ideal choices. Severing email access may mean turning off the email system entirely, locking all accounts, or bringing up a parallel email system — each option alerts the attacker, impacts the organization differently, and may not address the initial access vector. The recovery and investigation teams must work together closely to find a path that fits the situation.
## Key Principles
**Parallelization is essential, not optional.** Recovery and investigation must run as separate parallel tracks from the start. Investigation fatigue makes sequential handoff unreliable, and delayed recovery planning allows attacker footholds to persist.
**The attacker is watching.** Your recovery plan must account for the possibility that the attacker can see your actions — through compromised communication systems, compromised investigation machines, or retained system access. Use out-of-band, attacker-inaccessible communication channels for all recovery coordination.
**Ejection timing requires consensus.** Wait for full understanding before ejecting — but act early if the attacker is doing active damage. These two principles are in tension; resolve the tension by ensuring the investigation team concurs that your recovery plan will actually stop the attack before you execute it.
**Technical debt must be tracked explicitly.** Short-term bypasses are sometimes necessary. Accept them deliberately, tag them visibly, and set explicit remediation reminders — or they become permanent vulnerabilities.
**Zero-trust architecture simplifies recovery.** Environments that validate asset posture before granting access (BeyondCorp pattern) provide natural quarantine boundaries and suppressed lateral movement, reducing the blast radius of a compromise and simplifying isolation during recovery.
**Credential rotation scope is always larger than it looks.** Include SSL keys, API keys, service accounts, and infrastructure credentials — not just user passwords. A missed credential is a door left unlocked.
**Postmortem action items must be separated by time horizon.** Overnight tactical fixes and multi-year strategic changes belong in the same postmortem but must be clearly delineated with explicit owners, or the strategic work never gets done.
## References
- *Building Secure and Reliable Systems* (Blank, Oprea et al., Google/O'Reilly, 2020)
- Chapter 18 "Recovery and Aftermath" (pp. 417–442)
- "Recovery Logistics" (pp. 418–420): parallel tracks, Remediation Lead appointment, information management, air-gapped documentation
- "Recovery Timeline" (pp. 420–421): when to start recovery planning, not before some plan exists
- "Planning the Recovery / Scoping the Recovery" (pp. 421–422): complete affected system list, TTP analysis, postmortem doc
- "Recovery Considerations" (pp. 423–427): four key questions, ejection timing consensus, infrastructure compromise, variant analysis, attack vector reintroduction, mitigation options and technical debt framework
- "Recovery Checklists" (pp. 428–429): Figure 18-1 template structure
- "Initiating the Recovery" (pp. 429–435): isolation/quarantine, BeyondCorp sidebar, system rebuilds, data sanitization, credential and secret rotation, SSO consideration
- "After the Recovery / Postmortem" (pp. 435–437): short-term vs. long-term action items, security-focused postmortem questions
- "Examples" (pp. 437–442): compromised cloud instances (Figure 18-2), large-scale phishing (Figure 18-3), targeted APT with parallel workstreams
- Depends on: `security-incident-command` (IMAG process running, IC/OL/RL roles staffed)
- Related: `recovery-mechanism-design` (design-time recovery architecture), `incident-response-team-setup` (IR team structure and postmortem process)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Building Secure And Reliable Systems by Unknown.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
Command and manage an active security incident from declaration through remediation handoff using the incident management framework (Google's IMAG, derived f...
---
name: security-incident-command
description: |
Command and manage an active security incident from declaration through remediation handoff using the incident management framework (Google's IMAG, derived from ICS). Use when: you have a confirmed or suspected security incident and need to take command; someone says "we have a security incident" or "we may have been compromised"; you need to stand up an incident command structure with staffing roles; you are running forensic investigation and need to coordinate parallel tracks; an incident has grown large enough to require shift rotation and formal handovers; or you need to decide when investigation is complete enough to move to ejection and remediation. Distinct from incident response team setup (which designs the team and IR capability before incidents) — this skill executes the live response. Applies the seven-step incident command process: declare, staff, establish operational security, run forensic investigation loop, scale with rotation, apply the lead-rate decline signal to decide ejection timing, and hand off with a structured brief. Produces: incident state document, forensic timeline, communication plan, and remediation handoff package.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/building-secure-and-reliable-systems/skills/security-incident-command
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- incident-response-team-setup
source-books:
- id: building-secure-and-reliable-systems
title: "Building Secure and Reliable Systems"
authors: ["Heather Adkins", "Betsy Beyer", "Paul Blankinship", "Piotr Lewandowski", "Ana Oprea", "Adam Stubblefield"]
chapters: [17]
tags:
- security
- incident-response
- crisis-management
- incident-command
- forensics
execution:
tier: 1
mode: hybrid
inputs:
- type: context
description: "Alert, detection signal, or report that triggered escalation — describes what was observed, when, and on which systems"
- type: context
description: "Organization's IR team charter and severity/priority model from incident-response-team-setup"
tools-required: [Read, Write]
tools-optional: [Bash, Grep]
mcps-required: []
environment: "Runs during a live incident. IC executes Steps 1–3 immediately; Steps 4–7 are iterative throughout incident lifecycle."
discovery:
goal: "Manage a security incident from initial detection through remediation handoff: declare, staff, secure the investigation environment, complete the forensic investigation loop, scale the response sustainably, apply the lead-rate decline signal to decide ejection, and execute a structured handover"
tasks:
- "Issue a formal incident declaration using the exact IMAG formula"
- "Staff the four core roles: IC, OL, CL, RL"
- "Establish operational security: new channels, clean machines, domain obfuscation"
- "Run forensic investigation loop across forensics, reversing, and hunting sub-tracks"
- "Monitor lead-rate decline signal to determine when investigation is complete enough to eject"
- "Implement shift rotation and formal handover briefs to sustain multi-day response"
- "Produce remediation handoff: incident state document, forensic timeline, communication plan"
audience:
- "Incident commander (IC) running an active security incident"
- "Security engineers and operations leads on IR teams"
- "Engineering managers who may be pulled into IC role during a crisis"
---
# Security Incident Command
Guides managing a security incident from declaration through remediation handoff using the incident management framework (Google's IMAG, derived from ICS). The seven-step process transforms a chaotic escalation into a controlled, parallelized investigation with clear role assignments, operational security discipline, and a principled stopping rule.
**Prerequisite:** Assumes your team has completed `incident-response-team-setup` — specifically that IC/OL/CL/RL roles are defined, severity and priority models exist, and a communications plan is in place.
## When to Use
- A detection alert, customer report, or engineer escalation suggests a targeted compromise or active attacker
- Someone in the organization says "we have a security incident" or "we think we've been breached"
- An escalation has passed triage and requires organized, multi-person incident management
- A running incident needs handover to a fresh team due to fatigue
- You are the first responder and need to decide whether to declare and take command
- A critical vulnerability (e.g., Shellshock-class severity) requires incident-level coordination even without confirmed exploitation
**Scope boundary:** This skill governs the command and investigation phases. Remediation execution and postmortem writing are downstream; the RL begins drafting the cleanup plan in parallel during investigation, and the postmortem begins once the incident closes.
## Process
### Step 1 — Don't panic. Triage first.
Before declaring, take five minutes. Breathe. Resist the instinct to immediately lock accounts or take systems offline — those actions alert an attacker who may still be present, destroying your visibility.
**Why this matters:** Unlike a reliability incident (where the system won't resist being fixed), in a security compromise the attacker may be paying close attention to your actions. Premature response actions can cause an attacker to go quiet, eliminating your visibility into the full scope of their foothold, or trigger destructive exit behavior. Postmortems rarely show that responding five minutes sooner made the difference — additional planning up front adds more value than immediate action.
**Triage assessment:** Estimate the potential severity by asking:
- What data is accessible from the potentially compromised system, and what is its criticality?
- What trust relationships does this system have with other systems (lateral movement risk)?
- Are there compensating controls the attacker would also need to defeat to extend their reach?
- Does this appear to be opportunistic commodity malware, or a targeted attack crafted for your organization?
**Ransomware triage example — same threat, three very different responses:**
| Organization | Security posture | Response required |
|---|---|---|
| Org 1 | Cryptographically signed software allowlist, mature detection | Alert fires; single engineer handles it via standard process. No crisis response needed. |
| Org 2 | Automated wipe-and-replace of compromised cloud demo instances | Automated mitigation contains the risk. Minimal human response needed. |
| Org 3 | Fewer layered defenses, limited visibility | Ransomware can spread network-wide. Significant forensic and coordination effort required. Crisis response is warranted. |
Same ransomware attack, three different required response levels. Your layered defenses and process maturity determine whether this is a crisis that needs incident command or a playbook-driven standard process.
If the escalation is a complex and potentially damaging problem — targeted compromise, active attacker, or a critical vulnerability with imminent exploitation risk — proceed to Step 2.
### Step 2 — Declare the incident and take command
The first step in IMAG is taking *command*. Issue a formal declaration using this exact formula:
> **"Our team is declaring an incident involving X, and I am the incident commander."**
Replace X with a concise description of the known or suspected issue. This declaration may seem unnecessary, but it is essential: it aligns everyone's expectations, signals that normal processes can be bypassed, and notifies executives that teams may need to reprioritize. Incidents are complex, high-tension situations that happen fast — being explicit about who is in command removes all ambiguity.
**The IC explicitly accepts the assignment.** The person designated as IC must verbally confirm they are taking command. This eliminates the dangerous ambiguity of an assumed command structure under pressure.
**Why declaring matters before you know everything:** You do not need certainty to declare. Use a worst-case assumption at triage time. If the security team's suspicions are correct and an engineer's account is compromised, you should declare immediately even if the full scope is unknown. Declaration is the mechanism that assembles the people and removes the organizational friction needed to investigate.
### Step 3 — Staff the four core roles
Under IMAG, four roles must be explicitly assigned. One person can hold multiple roles on small teams, but every role must be assigned — never assumed.
| Role | Responsibility |
|------|---------------|
| **Incident Commander (IC)** | Strategic direction. Keeps the response on track. Does NOT perform hands-on technical work — if you find yourself checking logs, step back. |
| **Operations Lead (OL)** | Tactical counterpart to the IC. Directs technical investigation. All forensics and engineering staff report to the OL. |
| **Communications Lead (CL)** | Owns all internal and external communications. Prepares briefs for executives, legal, regulators, customers. Keeps responders from making conflicting public statements. |
| **Remediation Lead (RL)** | Begins studying confirmed compromise areas as the forensics team discovers them. Builds the cleanup plan in parallel so it is ready when ejection is decided. |
**Additional leads depending on incident scope:**
- **Management liaison:** Authorizes high-impact decisions (shutting down revenue-generating services, revoking engineer credentials) that the IC cannot make unilaterally.
- **Legal lead:** Guides actions requiring legal permission (examining employee browser history, assessing regulatory disclosure obligations under GDPR or similar frameworks).
**The IC's ongoing job is control, not investigation.** The IC operates a while-loop against the incident:
```
While incident is ongoing:
1. Check in with each lead: status, new info, roadblocks, personnel needs, fatigue level
2. Update status documents and dashboards
3. Ensure executives, legal, and PR are appropriately informed
4. Identify tasks that can run in parallel
5. Assess whether enough information exists for remediation decisions
6. Ask: is the incident over?
Loop.
```
If the IC is performing hands-on tasks, the incident is unmanaged. Step back and assign the task to the OL.
### Step 4 — Establish operational security
Before investigation begins, lock down the communication and investigation environment. Operational security means keeping your response activity secret from the attacker — and from tools that might inadvertently expose it.
**Why operational security cannot wait:** An attacker who detects that you have discovered their presence has two options: go quiet (hiding remaining footholds) or destroy as much of your environment as possible on the way out. Either outcome is bad. Every action your team takes should be evaluated against the question: what would a shrewd attacker conclude if they observed this action?
**Operational security setup checklist:**
- [ ] **New communication channels:** Stand up fresh chat rooms and document stores on infrastructure the attacker cannot access. If the email server or primary chat system may be compromised, do not use it to coordinate the response. Build a new temporary cloud environment with accounts not associated with your organization; distribute clean machines (e.g., Chromebooks) for responders to use.
- [ ] **Clean machines only:** Do not use workstations that may be compromised for investigation work. Log collection and analysis must happen from trusted endpoints.
- [ ] **No direct login to compromised servers:** Logging into a compromised machine with administrative credentials hands those credentials to the attacker. Use pre-deployed remote forensic agents or key-based access methods to collect artifacts without revealing credentials.
- [ ] **Domain obfuscation in all written artifacts:** Write attacker-controlled domains as `example[dot]com` rather than `example.com`. Many collaboration tools (chat clients, spreadsheets, document editors) automatically fetch URLs they detect — a chat window containing `malware_c2[dot]gif` will not silently reach out to the attacker's command-and-control server, but `malware_c2.gif` might. Make this a habit in all documents, even outside incidents.
- [ ] **No port scans or domain lookups against attacker infrastructure:** These actions appear in the attacker's logs as unusual and alert them to your investigation. Do not interact with attacker command-and-control systems.
- [ ] **No premature account lockouts or password resets:** Locking a compromised account before you understand the full scope signals your discovery and may cause the attacker to destroy evidence or pivot to undetected footholds.
- [ ] **No taking systems offline before scoping is complete:** Taking systems offline prematurely destroys forensic evidence and reveals your timeline to the attacker.
- [ ] **Brief every team member explicitly on confidentiality rules:** People do not know information is confidential unless you tell them. Each responder should review and acknowledge the operational security guidelines before starting work.
**The one exception — imminent risk:** If a system is so critical and the vulnerability so easily exploited (e.g., Shellshock-class) that vital data, systems, or lives are at immediate risk, the cost of making your discovery obvious is outweighed by stopping ongoing harm. This decision belongs to executives with IC advisory input, not the IC alone.
### Step 5 — Run the forensic investigation loop
Digital forensics means backtracking through each phase of the attack to reconstruct the attacker's steps. The goal is to identify all affected parts of the organization and understand the full attack path.
**Forensic investigation methods:**
- **Forensic imaging:** Create read-only copies (with checksums) of storage from compromised systems before touching them. Preserves evidence for potential legal proceedings.
- **Memory imaging:** Capture system memory to recover process trees, running executables, and passwords to encrypted files that attackers may have used.
- **File carving:** Recover deleted files from disk — many operating systems only unlink filenames rather than zeroing disk contents. Attackers who delete logs may leave recoverable artifacts.
- **Log analysis:** Reconstruct event sequences from system and network logs to determine who talked to what and when.
- **Malware analysis:** Analyze attacker tools to determine capabilities, communication patterns, and indicators of compromise (IOCs) that the hunting sub-track can search for across the environment.
**The forensic timeline** is the central artifact of the investigation: a chronologically ordered list of events proving correlation and causation of attacker activity. Build it continuously throughout the investigation.
**Shard the investigation into three parallel sub-tracks** when you have sufficient staff:
```
Forensics group → finds artifacts on known-compromised systems
↓
Reversing group → analyzes attacker tools, extracts IOCs
↓
Hunting group → searches all systems for IOC fingerprints
↓
(feeds new leads back to Forensics group)
```
The OL maintains the tight feedback loop between these three groups. Each lead discovered by the forensics or hunting groups becomes a *pivot point*: a new question that spawns a fresh forensic investigation. For example, once the team discovers an attack progressed from a workstation to a file share, the file share becomes the subject of a new forensic investigation.
**Parallelizing beyond forensics:** While investigation continues, assign available staff to tasks that can run ahead:
- If you may need to share forensic findings with law enforcement or external firms, assign someone to prepare a redacted and shareable indicators list now — your raw investigation notes will not be shareable later.
- Assign the RL to begin studying confirmed compromise areas and drafting the cleanup plan as findings accumulate.
- Assign someone to begin the postmortem structure if staff allows.
### Step 6 — Apply the lead-rate decline signal to decide ejection timing
The most difficult decision in a security incident is knowing when to stop investigating and start remediating. You will never know everything. The question is whether you know enough to successfully remove the attacker and protect the data they are after.
**The lead-rate decline signal:** Monitor the rate at which new investigative leads are being discovered. When the interval between new discoveries noticeably increases — when new leads are slowing the way popcorn slows down as a bag finishes popping — the picture is likely complete enough to act. You have probably learned all you need to remove the attacker and protect your data.
Do not wait for certainty. The decision is: have we learned enough to successfully execute remediation? Prior to deciding, the IC should confirm:
- [ ] Does the RL have a complete cleanup plan covering all confirmed compromise areas?
- [ ] Has legal reviewed and approved proceeding to remediation?
- [ ] Do executives understand and approve the remediation plan?
- [ ] Has the CL prepared external communication drafts for any stakeholder notifications that may be required?
### Step 7 — Scale with rotation and hand off with a structured brief
No human can sustain peak performance through a long incident without fatigue degrading their judgment. Beyond 12 hours of continuous work, responders will start making more mistakes than progress — the law of diminishing returns applies to humans. Limit all shifts (including IC shifts) to no more than 12 continuous hours.
**Follow-the-sun rotation for multi-day incidents:** Split the response team into regional groups so fresh responders cycle onto the incident continuously. A team with Americas, Asia-Pacific, and European staff can staff the response 24 hours a day without anyone working unsustainable hours.
**Handover preparation — the outgoing IC should:**
1. Update all tracking documentation, evidence notes, and written records before the handover meeting.
2. Ask themselves: **"If I weren't handing this investigation over, what would I work on for the next 12 hours?"** The incoming IC must have an answer to this question before the handover meeting ends.
3. Run the handover meeting in this sequence:
1. Delegate a note-taker (preferably from the incoming team).
2. Summarize current incident state and investigation direction.
3. Outline the tasks the outgoing IC would have worked on over the next 12 hours.
4. Open discussion — all attendees.
5. Incoming IC outlines their planned tasks for the next 12 hours.
6. Incoming IC establishes the time(s) of the next sync meeting(s).
Each lead (OL, CL, RL) should deliver a corresponding handover to their incoming counterpart before the IC-level handover is complete.
**Maintaining morale during long incidents:**
The IC is responsible for the team's emotional state, not just technical progress. Incidents are stressful. Some engineers rise to the challenge; others find the combination of intensity and ambiguity deeply draining. Watch for:
- **Fatigue:** People who don't recognize their own need for rest still need to take breaks. Leaders may need to intervene.
- **Burnout signals:** Increasing cynicism or "this is unwinnable" statements from critical engineers. Address these directly and frankly — explain expectations, offer a relief period if staff allows.
- **Lead by example:** If you (as IC or OL) visibly take breaks and eat, the team will believe it is actually encouraged.
## Outputs
| Artifact | Description |
|----------|-------------|
| **Incident state document** | Current status, known compromise scope, assigned roles, open leads, decisions made |
| **Forensic timeline** | Chronologically ordered list of confirmed attacker actions with evidence citations |
| **Communication plan** | Stakeholder list, what each group knows, draft external notifications |
| **Remediation handoff package** | RL's cleanup plan covering all confirmed compromise areas, approved by legal and executives |
## Anti-Patterns
**Rushing:** The extra few seconds gained by acting before planning are eclipsed by the consequences of acting without a plan. Security response requires investigation before remediation.
**Semantic misunderstandings:** "We can turn the service back on when the attack is mitigated" means different things to different teams. To the product team: delete the malware and the system is safe. To the security team: the system is not safe until the attacker can no longer exist in or return to the environment. Always be explicit and overcommunicate. The responsibility for ensuring the intended message was received belongs to the communicator, not the listener.
**Hedging:** "We're pretty sure we found all the malware" is a weak answer. The IC needs actionable certainty about specific sub-systems: "We are confident about servers, NAS, email, and file shares; we are not confident about hosted systems because log visibility there is limited." Replace qualifiers with bounded scope statements.
**Overstuffed meetings:** If most attendees are listening rather than contributing, the invite list is too large. Incident leads attend syncs; everyone else works tasks and gets updates from their lead.
**Operational security failures:** Communicating about the incident over potentially compromised channels; logging into compromised servers with privileged credentials; downloading attacker malware from an investigation workstation; performing port scans or domain lookups against attacker infrastructure. Each of these actions may appear in attacker logs and reveal your investigation.
**Hero mode:** Small teams working extra-long hours in "hero mode" is unsustainable and produces lower-quality work. Use it very sparingly, and never beyond 12 hours per shift.
**IC doing technical work:** If the IC is checking logs or performing forensics, no one is managing the incident. Step back.
## Key Principles
**Investigate before remediating.** Unlike a reliability outage (where the system won't resist being fixed), an attacker may be watching your actions. Premature remediation destroys evidence and may trigger destructive exit behavior.
**Declare explicitly.** Saying "this is an incident and I am the IC" is not redundant — it removes ambiguity under high stress when normal processes are bypassed.
**Operational security is a prerequisite, not an afterthought.** Set up clean channels and machines before investigation begins. Once a secret is lost, it is hard to regain.
**The IC's job is control, not investigation.** If the IC is doing hands-on work, the incident is unmanaged.
**The lead-rate decline signal is the ejection signal.** You will never know everything. Know enough to remove the attacker and protect the data they are after — then act.
**Shifts cap at 12 hours.** Fatigue produces more mistakes than progress. Rotation is not a luxury; it is a reliability control applied to humans.
## References
- *Building Secure and Reliable Systems* (Adkins, Beyer, Blankinship, Lewandowski, Oprea, Stubblefield; Google/O'Reilly, 2020)
- Chapter 17 "Crisis Management" (pp. 387–415):
- "Is It a Crisis or Not? / Triaging the Incident" (pp. 388–390): three-org ransomware triage comparison
- "Taking Command of Your Incident / Beginning Your Response" (pp. 391–393): IMAG declaration formula
- "Establishing Your Incident Team" (pp. 393–394): IC/OL/CL/RL roles
- "Operational Security" (pp. 394–397): OpSec setup, common mistakes, domain obfuscation
- "The Investigative Process / Sharding the Investigation" (pp. 398–400): forensics, reversing, hunting sub-tracks; pivot points; forensic timeline
- "Keeping Control of the Incident / Parallelizing the Incident" (pp. 401–402): IC while-loop, OODA loop, RL parallelization
- "Handovers" (pp. 402–405): 12-hour shift cap, follow-the-sun rotation, handover meeting agenda, "12-hour question"
- "Morale" (pp. 405–406): fatigue, burnout, lead-by-example
- "Communications / Misunderstandings / Hedging / Meetings" (pp. 406–409): semantic misunderstanding example; hedging anti-pattern; kickoff and sync meeting agendas
- "Putting It All Together" (pp. 410–414): end-to-end worked example (cloud service account compromise)
- Depends on: `incident-response-team-setup` (roles, severity/priority models, communications plan, playbooks assumed to exist before incident)
- Feeds into: remediation execution (Chapter 18), postmortem writing
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-incident-response-team-setup`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
Plan and execute a security change rollout across a service or fleet: classify the change into a time horizon (short / medium / long-term), triage affected s...
---
name: security-change-rollout-planning
description: |
Plan and execute a security change rollout across a service or fleet: classify the change into a time horizon (short / medium / long-term), triage affected systems by risk tier, select the appropriate rollout strategy with canarying and staged deployment, define communication strategy (internal and external), set rollback and success criteria, and produce a written rollout plan. Use when you need to respond to a zero-day vulnerability, roll out a security posture improvement, or drive an ecosystem or regulatory compliance change. Handles timeline disruption scenarios: accelerate when an exploit goes public, slow down when patch instability is detected, delay when embargo, external dependency, or limited blast radius dictates caution. Produces a rollout plan with timeline, per-tier risk triage, communication strategy, and explicit rollback criteria. Examples covered: Shellshock emergency patch, hardware security key (FIDO/WebAuthn) company-wide deployment, and Chrome HTTPS migration.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/building-secure-and-reliable-systems/skills/security-change-rollout-planning
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: []
source-books:
- id: building-secure-and-reliable-systems
title: "Building Secure and Reliable Systems"
authors: ["Heather Adkins", "Betsy Beyer", "Paul Blankinship", "Piotr Lewandowski", "Ana Oprea", "Adam Stubblefield"]
chapters: [7]
tags:
- security
- change-management
- rollout-planning
- patch-management
- vulnerability-response
- zero-day
- incident-response
- compliance
- sre
- reliability
execution:
tier: 1
mode: hybrid
inputs:
- type: document
description: "Description of the security change: what is changing, why, which systems or users are affected, and any known deadline or embargo constraints"
- type: codebase
description: "Optional: infrastructure config, dependency manifests, or service inventory if available — used to identify affected systems and their tier"
tools-required: [Read, Write]
tools-optional: [Grep, Bash]
mcps-required: []
environment: "Can run from any project directory. Asks structured questions if system context is not provided. Produces a written rollout plan document."
discovery:
goal: "Produce a security change rollout plan: horizon classification, per-tier risk triage, phased timeline with success criteria, communication strategy, and rollback criteria"
tasks:
- "Classify the change into its time horizon: short-term (zero-day, hours–days), medium-term (posture improvement, weeks–months), or long-term (ecosystem/regulatory, months–years)"
- "Triage affected systems by risk tier: identify which systems are highest risk and must be patched first"
- "Design the rollout phases with canarying and staged deployment"
- "Define per-phase success criteria that must be met before proceeding"
- "Draft internal and external communication strategy and timeline"
- "Define rollback criteria and rollback mechanism for each phase"
- "Identify 'when plans change' scenarios and produce contingency branches"
audience:
roles: ["security-engineer", "site-reliability-engineer", "software-engineer", "platform-engineer", "tech-lead", "incident-commander"]
experience: "intermediate — assumes familiarity with release engineering basics and production system concepts"
triggers:
- "A new vulnerability (zero-day or CVE) has been disclosed and patch readiness must be assessed"
- "Planning a security posture improvement (e.g., new authentication mechanism, encryption upgrade)"
- "Responding to a regulatory or ecosystem change with a compliance deadline"
- "An ongoing rollout needs to be accelerated because an exploit became publicly available"
- "An in-flight security patch is causing instability and needs to be slowed or rolled back"
- "Deciding how to handle a vulnerability that is currently under embargo"
not_for:
- "Designing the security change itself (what to fix) — use adversary-profiling-and-threat-modeling for that"
- "Active incident response once an exploit is in progress — use incident response runbooks for that"
- "Compliance gap analysis or regulatory interpretation"
---
## When to Use
Use this skill when you have a security change to roll out and need a plan — not just a patch, but a structured rollout with appropriate speed, staged deployment, communication, and contingency handling.
The core decision the skill drives is: **how fast, and how?** Security changes impose a fundamental tradeoff: speed reduces exposure time but increases the risk of rolling out a broken patch that causes downtime or data loss. The right answer depends on whether the change is short-term (a zero-day requiring emergency response in hours), medium-term (a posture improvement that can proceed in weeks), or long-term (an ecosystem migration spanning months to years).
Invoke this skill when:
- A CVE or zero-day has been disclosed and you need to decide how quickly to patch which systems
- You are rolling out a new security control (new authentication method, encryption upgrade, policy enforcement) and need a phased plan
- A regulatory or compliance deadline requires a multiyear migration across systems you do not fully control
- An ongoing rollout is being disrupted — by an exploit going public, a patch causing errors, or an embargo being broken early
Do not use this skill to design what the fix should be (that is threat modeling) or to manage an active exploit in real time (that is incident response).
---
## Context and Input Gathering
Before producing the rollout plan, gather the following:
1. **Change description**: What is changing? Is it a patch to a library or binary, a configuration change, a new mandatory control, or a protocol migration?
2. **Triggering event**: Is this reactive (zero-day, active exploit, regulatory deadline) or proactive (internal security posture goal)?
3. **Embargo status**: Is the vulnerability currently under embargo? Is there a public announcement date? Has the embargo been broken?
4. **Affected systems**: Which services, hosts, binaries, or user populations are affected? Do you have an inventory, or must one be built?
5. **System diversity**: Are affected systems standardized (same OS, same distribution, same dependency version) or heterogeneous (different versions, inherited infrastructure, nonstandard configurations)?
6. **Team dependencies**: Does the rollout require other teams to act? Are you dependent on a vendor releasing a patch, or on client systems being updated before or after your server?
7. **Patch availability**: Is a patch or remediation available? Has it been verified against the actual vulnerability? Does it address all known variants?
8. **Existing release infrastructure**: Do you have automated testing, canarying, staged rollout tooling? Are your dependencies up to date and frequently rebuilt? Do you use containers or microservices that make patching independent?
9. **Success metrics**: How will you know the rollout succeeded? Is there instrumentation to detect unpatched systems? Can you measure whether the vulnerability is still present?
If a system inventory is not available, ask the engineer to describe: the three highest-risk systems that must be patched first, the number of systems in scope, and whether nonstandard or inherited infrastructure exists.
---
## Process
### Step 1 — Classify the Change Into Its Time Horizon
**WHY**: The time horizon determines everything that follows — the planning depth, rollout speed, communication strategy, and success criteria. A plan designed for a week-long rollout is wrong for a same-day emergency. A plan designed for a same-day emergency applied to a years-long ecosystem migration wastes time and breaks systems unnecessarily.
Classify the change using this taxonomy:
| Horizon | Trigger type | Target timeline | Primary risk |
|---|---|---|---|
| **Short-term** | Zero-day vulnerability, active exploit, critical CVE | Hours to days | Moving too slowly; exploit window grows |
| **Medium-term** | Security posture improvement, new control, proactive hardening | Weeks to months | Insufficient stakeholder buy-in; long tail of unpatched systems |
| **Long-term** | Ecosystem change, regulatory requirement, protocol migration | Months to years | Loss of momentum; leadership support eroding; documentation rot |
Classification inputs:
- **Severity**: Is the vulnerability actively exploited or being exploited in the wild? Is it critical (remote code execution, key disclosure) or informational?
- **Deadline**: Is there a compliance date, embargo lift date, or news embargo?
- **Dependent parties**: Does rollout require waiting for a vendor, for clients to patch first, or for other teams to implement prerequisites?
- **Sensitivity**: Is this a mandatory enforcement or a gradual opt-in? Can it roll out team by team?
Zero-day response almost always classifies as short-term. Before prioritizing same-day zero-day response, confirm that the "top hits" — the highest-impact vulnerabilities from recent years — are already patched. A zero-day response against a system that is unpatched for known critical vulnerabilities is the wrong priority order.
### Step 2 — Triage Affected Systems by Risk Tier
**WHY**: Not all systems carry the same risk from a given vulnerability. Treating a low-risk internal analytics server the same as a publicly exposed TLS endpoint wastes capacity and may slow the critical path. Triage directs limited remediation capacity to the highest-leverage systems first.
For each system or system group in scope, assess:
- **Exposure**: Is this system directly accessible from the internet, from partners, or only internally? Does it handle the specific protocol, library, or configuration that is vulnerable?
- **Exploitability in context**: Even if the vulnerability is critical in general, can it be triggered against this system? (A bash vulnerability doesn't affect a Go binary that never invokes bash.)
- **Blast radius**: What is the impact if this system is compromised? Does it hold customer data, encryption keys, authentication credentials, or production secrets?
- **Patchability**: Is the system running a standard distribution that already has a patched package? Does it require manual intervention, custom build, or vendor coordination?
Assign each system to a tier:
| Tier | Definition | Rollout approach |
|---|---|---|
| **Tier 1 — Critical** | Externally exposed, high blast radius, actively exploitable | Patch first; use accelerated rollout; accept faster-than-normal validation |
| **Tier 2 — High** | Internally exposed or high blast radius but not directly reachable | Patch in next wave; standard validation |
| **Tier 3 — Standard** | Low blast radius, standardized infrastructure | Patch via normal automated rollout |
| **Tier 4 — Nonstandard** | Inherited, heterogeneous, or vendor-dependent systems | Manual intervention; send per-team notifications with explicit action items |
Shellshock pattern: at Google, Tier 1 production servers (standard distribution, easy automated rollout) were patched first and fastest. Tier 2 workstations were easy to patch quickly but required different tooling. Tier 4 nonstandard servers required manual notifications with explicit follow-up actions per team.
### Step 3 — Design the Rollout Phases
**WHY**: Rolling a change out all at once maximizes exposure risk if the patch is broken or has unexpected interaction effects. A gradual rollout — with instrumentation for canarying — lets you detect problems before they affect your entire infrastructure. This principle holds even on accelerated timelines.
For each change, define rollout phases with explicit gates between them:
**Standard phased rollout structure**:
1. **Canary phase**: Apply the change to a small, monitored subset (e.g., 1–5% of systems or a single team). Run long enough to detect error rate changes, performance regressions, or unexpected behavior. For a zero-day, this phase may be compressed to minutes rather than days.
2. **Staged expansion**: Expand to progressively larger population segments, with monitoring confirmation between each stage. Prefer expansion by logical group (team, system tier, geographic region) rather than random percentage — this makes rollback bounded.
3. **Full rollout**: Apply to remaining systems. For Tier 4 nonstandard systems, issue per-team action items and track completion separately from the automated rollout.
**Per-phase definition**:
- Scope: which systems or users are included
- Duration: minimum time at this phase before advancing
- Success criteria: what must be true before advancing (error rate, patch confirmation, monitoring signal)
- Rollback trigger: what would cause an immediate rollback at this phase
**Change design properties that make every phase safer** (implement these before rollout, not during):
- **Incremental**: Keep the change as small and standalone as possible. Do not bundle security fixes with unrelated refactoring.
- **Documented**: Record the "how" and "why," the systems and teams in scope, lessons from any proof of concept, the rationale for key decisions, and points of contact.
- **Tested**: Cover the change with unit tests and, where possible, integration tests. Complete a peer review.
- **Isolated**: Use feature flags to isolate the change. The system should exhibit no behavior change with the flag off.
- **Qualified**: Use your normal binary release process — proceed through qualification stages before reaching production traffic.
- **Staged**: Instrument for canarying. You should be able to observe differences in behavior before and after your change.
### Step 4 — Set Per-Phase Success Criteria and Rollback Criteria
**WHY**: Advancing through phases without explicit criteria leads to either excessive caution (delayed rollout, extended exposure) or insufficient caution (rolling a broken patch to the full fleet). Pre-defined criteria remove the need for judgment calls under pressure.
**Success criteria** (must be satisfied before advancing to next phase):
- Error rate: less than X% increase in errors, latency, or failure rate after the change
- Patch confirmation: instrumentation or scanning confirms affected systems have the updated version
- Monitoring signal: no unexpected alerts or on-call pages directly attributable to the change
- User feedback: for user-facing changes, no significant increase in support tickets or failure reports
**Rollback criteria** (trigger immediate rollback):
- Error rate exceeds threshold (define the number: e.g., 5% increase in 5xx responses)
- On-call pages attributable to the change
- Patch introduces a regression that is worse than the original vulnerability's blast radius
- Critical dependency fails after the patch is applied
**Rollback mechanism**: Define how rollback works before the rollout begins. For binary patches: roll back to the previous version via the same release pipeline. For configuration changes: revert the config and push. For user enrollment changes: define the off-ramp (e.g., allow users to use the legacy path temporarily). If rollback requires significant manual effort, flag this as a risk before the rollout begins.
### Step 5 — Plan Internal and External Communication
**WHY**: A technically sound rollout can fail due to communication gaps. Internal teams that are not informed prepare no coverage and escalate incorrectly. Externally, users and partners who are surprised by behavior changes lose trust. Early and clear communication also prevents PR problems from becoming the bottleneck during a live vulnerability response.
**Internal communication**:
- Announce the change and its timeline to all affected teams before the rollout begins, not during
- For Tier 4 nonstandard systems: send per-team notifications with explicit action items, deadlines, and escalation contacts — do not rely on them to discover the requirement from a general announcement
- For embargoed vulnerabilities: communicate internally under NDA or internal secrecy protocols as permitted; plan internal communications so they are ready to send the moment the embargo lifts
- Provide a single dashboard or tracking sheet that shows patch progress by system tier — this allows leadership reporting and team self-service progress checking
- Establish a feedback loop: define how teams report completion and how issues are escalated
**External communication**:
- Prepare external communications (to customers, partners, and the public) before the rollout begins, not during — PR approval cycles are slow and incompatible with emergency response timelines
- For zero-day responses: draft the customer-facing message, the partner notification, and the public statement ahead of time. Be ready to send at embargo lift.
- For medium-term changes affecting users: give ample advance notice. Use all available outreach channels (email, developer mailing lists, help forums, partner relationships). Overcommunication maximizes reach.
- For long-term ecosystem changes: tailor outreach regionally and by stakeholder group. Data-driven targeting (identify which regions are lagging, which stakeholder types need different messaging) improves coverage.
- Tie the change to business value whenever possible. Security-only messaging fails to motivate organizations that are not already security-first. Link the change to business benefits (performance, new features, compliance relief, liability reduction).
### Step 6 — Produce the Rollout Plan Document
**WHY**: A written plan enables handoff continuity (individuals leave or rotate), provides a reference during the rollout to avoid judgment calls under pressure, and serves as the audit trail for leadership reporting and post-mortems.
The rollout plan document should include:
```
Security Change Rollout Plan
=============================
Change: [name / CVE / description]
Horizon: [short / medium / long-term]
Author(s): [points of contact]
Status: [draft / approved / in-progress / complete]
## Summary
[1–3 sentences: what is changing, why, and the target completion date]
## Affected Systems
[Table: System/tier | Risk tier | Patch method | Owner | Estimated effort]
## Phase Plan
[For each phase:]
Phase N — [name]
Scope: [systems or user population]
Timeline: [start date / duration]
Success criteria: [specific measurable conditions]
Rollback trigger: [specific conditions]
Rollback mechanism: [how to execute rollback]
## Communication Plan
Internal: [who is notified, when, via what channel]
External: [who is notified, when, via what channel]
Embargo handling: [if applicable]
## Contingency Branches
Accelerate trigger: [e.g., exploit goes public]
→ Accelerate action: [reorder tiers, prioritize Tier 1 only, accept reduced validation window]
Slow-down trigger: [e.g., patch causing errors above threshold]
→ Slow-down action: [pause expansion, rollback canary, debug and reissue]
Delay trigger: [e.g., vendor patch not available, external dependency not ready]
→ Delay action: [apply mitigations, set new target date, notify stakeholders]
## Success Metrics
[How will you confirm the rollout is complete and the vulnerability is mitigated?]
```
---
## Key Principles
**Classify before planning.** The time horizon is not an output of planning — it is the input. Get the horizon wrong and every downstream decision (speed, validation, communication) is calibrated incorrectly.
**Patch the top hits first.** Before prioritizing same-day zero-day response, confirm that known high-impact vulnerabilities from recent years are already remediated. Zero-day response against a system with unpatched known criticals is the wrong priority order.
**Gradual rollout applies even on emergency timelines.** Even for a zero-day, roll out in phases with canarying. The risk of deploying a broken patch — causing downtime across the fleet — can exceed the risk of the vulnerability itself during the patching window. On an accelerated timeline, "gradual" means hours rather than days, not "skip canarying."
**Standardization is leverage.** Rollout speed scales with how standardized your system fleet is. A single OS distribution and package format means a verified patch can be automated immediately. Heterogeneous infrastructure (different distributions, different versions, inherited systems) requires manual intervention that does not scale. Invest in standardization before the emergency.
**When plans change, do not panic — but do not be surprised.** Embed contingency branches in the plan from the start. Define in advance: what triggers acceleration, what triggers a slowdown, what triggers a full delay. When the trigger fires, execute the prepared branch rather than improvising.
**Overcommunicate externally, especially for long-term changes.** Organizations resistant to change (including security changes) respond to business-aligned messaging more than security messaging alone. Tie the change to business benefits. Use every available outreach channel. Expect a long tail of non-compliant systems and plan for it explicitly.
**Architecture as rollout enabler.** The ease of rolling out any security change is determined by architectural decisions made long before the change. Frequent builds and rebuilds ensure the latest patched dependencies are always available. Containers and immutable images allow patch-as-code-rollout. Microservices scope the blast radius of each change and enable independent rollout per service. Automated testing gives confidence to push faster. Build these before the emergency — they are not alternatives to a rollout plan, they are what makes any rollout plan executable.
---
## Examples
### Example 1 — Short-Term: Shellshock Zero-Day (Google, September 2014)
**Scenario**: On the morning of September 24, 2014, Google Security learned of a publicly disclosed, remotely exploitable vulnerability in `bash` that trivially allowed code execution on affected systems. Exploits appeared in the wild the same day.
**Horizon classification**: Short-term. Actively exploited the same day. Emergency response timeline.
**Triage**:
- Tier 1 (low risk to patch, automated rollout): Majority of production servers. Patched fastest, with reduced validation window once servers passed sufficient validation and testing.
- Tier 2 (higher risk, easy to patch quickly): Workstations. Patched quickly with different tooling.
- Tier 4 (high risk, nonstandard): Small number of nonstandard servers and inherited infrastructure. Required manual intervention. Per-team notifications sent with explicit action items to enable tracking progress at scale.
**Key execution decisions**:
- Parallel workstream: software was developed to detect vulnerable systems within the network perimeter, then integrated into standard security monitoring as a permanent capability.
- Communication: internal communication to all potentially affected teams; external communication to partners and customers prepared as early in the response as possible to avoid PR bottlenecks.
**Lessons**:
- Standardize software distribution to the greatest extent possible. Non-standard distributions made patching harder and slower.
- Use public distribution standards. Patches in standard format can be applied immediately without rework.
- Ensure your automated rollout mechanism can accelerate for emergency changes without skipping functional validation entirely.
- Have monitoring to track rollout progress and identify unpatched systems in real time.
- Prepare external communications before the response, not during it.
### Example 2 — Medium-Term: Hardware Security Keys (FIDO/WebAuthn) Company-Wide Deployment (Google, 2013–2015)
**Scenario**: One-time passwords (OTPs) were susceptible to phishing interception. Starting in 2011, Google evaluated stronger two-factor authentication methods. By 2013, hardware security keys (FIDO/WebAuthn) were selected and rollout began. Full OTP deprecation was completed in 2015.
**Horizon classification**: Medium-term. Proactive security posture improvement. Gradual, multi-year rollout.
**Rollout strategy**:
- Initially opt-in (easiest use case first): users who enrolled voluntarily reduced total authentication time — keys were simpler than OTPs, removing the need to type a code.
- Self-service enrollment: users registered via a web-based enrollment portal. Physical distribution at office IT helpdesks for first adopters and those needing assistance. Keys trusted on first use (TOFU).
- Application migration: systems using centrally generated OTPs were migrated application by application, tracked by monitoring which applications were still generating OTP requests. Long-tail applications were handled by working directly with vendors to add support.
- Enforcement: in 2015, users received reminders when they used OTPs instead of security keys, then eventually OTP access was blocked.
- Exception handling: for cases where security keys could not yet be used (mobile device setup), a web-based OTP generator was created requiring the user to authenticate with their security key first — a degraded but auditable fallback.
**Lessons**:
- Ensure the solution works for all users, including those with accessibility requirements.
- Make the change easier to adopt than the prior method — a self-service change with lower friction succeeds faster than a mandated change with friction.
- Make the feedback loop on noncompliance fast (minutes to hours, not days).
- Track progress and identify the long tail of use cases from the start, not after completion.
- Rotate encryption keys and other secrets regularly as a practice habit so emergency key rotation (e.g., after Heartbleed) is not a Herculean effort.
### Example 3 — Long-Term: HTTPS Migration (Chrome / Web Ecosystem, 2017–2019+)
**Scenario**: HTTPS provides confidentiality and integrity guarantees critical to the web. The Chrome team drove adoption across the entire web ecosystem through a combination of incentives, warnings, and coordination with standards bodies, certificate authorities, and other browser vendors.
**Horizon classification**: Long-term. External ecosystem change. Required new systems to be built. Target timeline: years.
**Rollout strategy**:
- Data-driven targeting: gathered data on HTTPS usage worldwide, surveyed end users on browser UI perception, measured site behavior to identify features that could be restricted to HTTPS, used case studies to understand developer concerns.
- Graduated warnings: Chrome warnings for insecure pages rolled out gradually over years, with user behavior telemetry monitored to confirm no unexpected negative effects (e.g., retention drop, engagement decline).
- Overcommunication: every available outreach channel used before each change — blogs, developer mailing lists, press, help forums, partner relationships. Regional tailoring (e.g., additional focus on Japan when adoption was lagging due to slow uptake by top sites).
- Business incentive linkage: enabled HTTPS-only features (Service Workers, push notifications, background sync) to give security-resistant organizations a business reason to migrate, not just a security reason.
- Industry consensus building: coordinated with other browser vendors and certificate authorities so HTTPS became an industry-wide trend, not a Chrome-specific imposition.
**Result**: Chrome users' time on HTTPS sites rose from approximately 70% (Windows) and 37% (Android) to over 90% on both platforms.
**Lessons**:
- Understand the ecosystem before committing to a strategy. Research stakeholder groups quantitatively and qualitatively before designing the rollout.
- Overcommunicate to maximize reach. Use every available channel.
- Tie security changes to business incentives. Security-only messaging fails to move organizations that are not already security-driven.
- Build industry consensus where possible. A change that is seen as an industry-wide trend is far easier to drive than one seen as a single vendor's imposition.
- Plan for 80–90% adoption as a meaningful success if 100% is not required — achieving near-complete adoption of a long-tail change is itself a significant risk reduction.
---
## References
- [three-horizon-classification.md](references/three-horizon-classification.md) — Full taxonomy: short / medium / long-term change triggers, timelines, planning depth, and success criteria
- [system-triage-matrix.md](references/system-triage-matrix.md) — Risk tier classification (Tier 1–4): exposure, exploitability, blast radius, patchability definitions
- [rollout-phase-template.md](references/rollout-phase-template.md) — Per-phase template: scope, duration, success criteria, rollback trigger, rollback mechanism
- [communication-strategy-guide.md](references/communication-strategy-guide.md) — Internal and external communication playbook for each time horizon; embargo handling guidance
- [when-plans-change.md](references/when-plans-change.md) — Contingency branches: accelerate, slow down, and delay triggers with corresponding tactics
- [architecture-as-enabler.md](references/architecture-as-enabler.md) — How frequent builds, containers, microservices, and automated testing reduce rollout friction for all future security changes
Cross-references:
- `adversary-profiling-and-threat-modeling` — determine what you are defending against before designing the change
- `resilience-and-blast-radius-design` — design system architecture to reduce the blast radius of any given change or compromise
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
Secure a software deployment pipeline against supply chain attacks from benign insiders (mistakes), malicious insiders, and external attackers: map pipeline...
---
name: secure-deployment-pipeline
description: |
Secure a software deployment pipeline against supply chain attacks from benign insiders (mistakes), malicious insiders, and external attackers: map pipeline threats to mitigations using the three-adversary threat model, generate binary provenance requirements for each build stage, define provenance-based deployment policies with choke-point enforcement, design verifiable build architecture (trusted build service, rebuild service, or hybrid), and produce a staged hardening roadmap with breakglass controls. Use when assessing supply chain security for a CI/CD pipeline, implementing binary provenance to trace artifact origins, designing deployment policies that verify what is deployed rather than who initiated deployment, hardening build infrastructure against insider threats, or establishing breakglass procedures that remain auditable. Requires secure-code-review as a prerequisite control (code review is the first mitigation layer against malicious or accidental code changes before they enter the pipeline). Produces a deployment pipeline security assessment with threat-mitigation mapping, provenance schema, policy rules, and a phased hardening plan.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/building-secure-and-reliable-systems/skills/secure-deployment-pipeline
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- secure-code-review
source-books:
- id: building-secure-and-reliable-systems
title: "Building Secure and Reliable Systems"
authors: ["Heather Adkins", "Betsy Beyer", "Paul Blankinship", "Piotr Lewandowski", "Ana Oprea", "Adam Stubblefield"]
chapters: [14]
tags:
- security
- deployment
- supply-chain
- ci-cd
- binary-provenance
execution:
tier: 3
mode: full
inputs:
- type: pipeline_description
description: "Description or diagram of current CI/CD pipeline: source control system, build tooling, test infrastructure, deployment mechanism, and any existing signing or policy controls"
- type: threat_context
description: "Organizational context: size, insider risk tolerance, regulatory requirements, and which of the three adversary types are in scope"
outputs:
- type: security_assessment
description: "Threat-mitigation mapping using Tables 14-1/14-2, binary provenance schema, deployment policy rules, verifiable build architecture recommendation, and phased hardening roadmap"
---
# Secure Deployment Pipeline
## When to Use
Use this skill when:
- You need to verify that the code running in production is the code you assume it is
- You are designing or auditing a CI/CD pipeline for insider threat resistance
- You want to implement binary provenance so artifact origins are traceable after security incidents
- You are moving from signature-based to provenance-based deployment policies
- You need to establish breakglass procedures that are auditable and rare enough to distinguish from attacks
**Prerequisite:** The `secure-code-review` skill must already be applied. Code review is the foundational mitigation — it is the multi-party authorization control that increases the cost of introducing malicious changes before they reach the pipeline. All pipeline controls assume reviewed code enters the source control system.
---
## Step 1: Define Your Adversary Scope
Before mapping threats to mitigations, identify which adversary types apply to your organization. The threat model has three types:
| Adversary | Definition | Example |
|-----------|-----------|---------|
| **Benign insider** | Engineer who makes mistakes | Accidentally builds from locally modified code with unreviewed changes |
| **Malicious insider** | Insider who tries to exceed authorized access | Deploys a modified binary that exfiltrates customer credentials |
| **External attacker** | Compromises machine or account of one or more insiders | Uses a compromised engineer account to push a backdoored build script |
**Why this matters:** Mitigations have different effectiveness profiles against each adversary type. Code review deters malicious insiders but does not protect against collusion or external attackers who compromise multiple accounts simultaneously. Automation eliminates benign-insider build mistakes but introduces its own attack surface if not locked down. Knowing your adversary scope lets you prioritize controls and document known limitations honestly.
**Output:** A written adversary scope statement: which adversary types are in scope, which are out of scope and why, and any known limitations.
---
## Step 2: Map Your Pipeline Threats (Table 14-1)
Walk through each threat category and record: the threat, your current mitigation, and the limitation of that mitigation.
**Threat-mitigation mapping (Table 14-1 — basic best practices):**
| Threat | Mitigation | Known Limitation |
|--------|-----------|-----------------|
| Engineer accidentally introduces a vulnerability | Code review + automated testing | Does not catch all vulnerabilities |
| Malicious insider submits a change with a backdoor | Code review (increases cost; adversary must craft change to pass review) | Does not protect against collusion or external attackers compromising multiple accounts |
| Engineer accidentally builds from locally modified unreviewed code | Automated CI/CD system always pulls from the correct source repository | None if CI/CD is properly locked down |
| Engineer deploys a harmful configuration (e.g., debug features enabled in production) | Treat configuration as code: require configuration changes to be checked in, reviewed, and tested like any code change | Not all configuration can be treated as code |
| Malicious insider deploys a modified binary that exfiltrates data | Production environment requires proof that the CI/CD system built the binary; CI/CD pulls only from the correct source repository | Adversary may exploit breakglass procedures; mitigated by logging and auditing |
| Malicious insider modifies cloud bucket ACLs to exfiltrate data | Treat resource ACLs as configuration; restrict changes to the deployment process only | Does not protect against collusion |
| Malicious insider steals the integrity signing key | Store signing key in a key management system accessible only to the CI/CD system; require key rotation | See "Advanced Mitigation Strategies" for build-specific hardening |
For each row, fill in your current state. Threats with no mitigation or with significant limitations are your highest-priority gaps.
**Why configuration-as-code is non-negotiable:** Engineers understand they should not build a production binary from a locally modified source copy. But many do not apply the same discipline to configuration. Treating configuration as code — version-controlled, peer-reviewed, tested before deployment — reuses all existing supply chain controls at no additional architectural cost.
---
## Step 3: Verify Artifacts, Not Just People
The key principle: deployment environments should verify **what** is being deployed, not only **who** initiated the deployment. An actor can make a mistake or may be intentionally deploying a malicious change regardless of authorization level.
**Implementation:**
- Deployment environments must require proof that each automated step of the deployment process occurred
- Humans must not be able to bypass automation unless some other mitigating control checks that bypass action (logging, alerting, post-deployment audit)
- Place deployment decisions at proper **choke points** — points through which all deployment requests must flow. In Kubernetes, the master node is the choke point; configure worker nodes to accept requests only from the master. Add an Admission Controller webhook or Binary Authorization for policy enforcement.
**Two choke point architectures:**
1. **Direct enforcement:** The choke point itself evaluates the policy (e.g., Kubernetes Admission Controller)
2. **Proxy enforcement:** A proxy in front of the choke point evaluates the policy — but the admission point must be configured to accept access only via the proxy, or adversaries bypass it directly
---
## Step 4: Design Binary Provenance for Each Build Stage
Every build stage must produce **binary provenance**: a cryptographically authenticated record of exactly how the artifact was built — inputs, transformation, and the entity that performed the build.
**Why provenance is required beyond auditing:** Without provenance, you cannot know what source code a deployed binary came from. During a security incident, reverse engineering a binary is prohibitively expensive; inspecting version-controlled source is fast. Provenance is also the prerequisite for provenance-based deployment policies (Step 5).
**Binary provenance schema — required fields:**
| Field | Required | Description |
|-------|----------|-------------|
| **Authenticity** | Yes | Cryptographic signature covering all other fields. Verifies which system produced the provenance and establishes integrity. Usually a signing key accessible only to the CI/CD system. |
| **Outputs** | Yes | The output artifacts this provenance applies to, each identified by a cryptographic hash of the artifact content (SHA-256). |
| **Inputs: Sources** | Recommended | The top-level input artifacts — e.g., "Git commit `270f...ce6d` from `https://github.com/org/repo`" or "file `foo.tar.gz` with SHA-256 `78c5...6649`". Each input should have both an identifier (URI) and a version (cryptographic hash). |
| **Inputs: Dependencies** | Recommended | All other artifacts needed for the build — libraries, build tools, compilers — that are not fully specified in sources. Each can affect build integrity. |
| **Command** | Recommended | The command used to initiate the build, structured for automated analysis. Example: `{"bazel": {"command": "build", "target": "//main:hello-world"}}` |
| **Environment** | Optional | Architecture details, environment variables, and any other information needed to reproduce the build. |
| **Input metadata** | Optional | Metadata about inputs that downstream systems will find useful (e.g., source commit timestamp used by a policy evaluation system). |
| **Debug info** | Optional | Information useful for debugging but not required for security (e.g., machine on which the build ran). |
| **Versioning** | Recommended | Build timestamp and provenance format version number, enabling format changes without rollback attack susceptibility. |
**Attack surface awareness:** Any input not checked by the build system (and therefore not implied by the signature) and not included in the sources (and therefore not peer reviewed) is a vector. If users can specify arbitrary compiler flags, the verifier must validate those flags explicitly.
**Propagation guidance:** Strongly prefer propagating provenance inline with the artifact (e.g., as a Kubernetes annotation passed to the Admission Controller webhook) rather than storing in a separate database keyed by artifact hash. Database-keyed lookup causes ambiguity when the same artifact hash appears in multiple provenance records, produces unactionable error messages, and introduces latency that may exceed deployment SLOs.
---
## Step 5: Define Provenance-Based Deployment Policies
A deployment policy describes the intended properties of each deployment environment. The deployment environment matches this policy against the binary provenance of each artifact at deployment time.
**Why provenance-based policies outperform pure code signing:**
- Reduces implicit assumptions throughout the supply chain, making each step's contract explicit and auditable
- Clarifies the responsibility of each pipeline stage, reducing misconfiguration
- Allows a single signing key per build step (not per deployment environment), because the provenance record carries the environment-specific information
**Example policy rules to implement (tailor to your threat model):**
1. Source code was submitted to version control and peer reviewed
2. Source code came from an approved repository (specific URI and build target)
3. Build was performed through the official CI/CD pipeline (not a local workstation)
4. Tests passed (the test result is an artifact in the provenance chain)
5. The binary type is explicitly allowed for this environment (no "test" binaries in production)
6. The code version or build is sufficiently recent (blocks rollback to versions with known vulnerabilities)
7. Code passed a security vulnerability scan within the last N days
**Three-step policy implementation:**
1. **Verify provenance authenticity** — cryptographically validate the signature to prevent tampering or forgery
2. **Verify provenance applies to the artifact** — compare the cryptographic hash of the artifact to the hash in the provenance payload, preventing "good provenance applied to bad artifact" attacks
3. **Verify provenance meets all policy rules** — evaluate each rule in the deployment policy against provenance field values
**Design rules:**
- Make provenance unambiguous: one artifact, one provenance record per build
- Make policies unambiguous: one policy applies to any given deployment. If two policies could apply, resolve it as a meta-policy (require both to pass) rather than leaving it ambiguous
- Provide actionable error messages: when a deployment is rejected, the message must explain what went wrong and how to fix it (e.g., "source URI was X, policy requires Y; rebuild from Y or update the policy to allow X")
---
## Step 6: Address Advanced Threats (Table 14-2)
Four threats remain unaddressed by basic best practices alone. Apply these advanced mitigations for high-security or large organizations:
**Threat-mitigation mapping (Table 14-2 — advanced):**
| Threat | Advanced Mitigation |
|--------|-------------------|
| Engineer deploys an old version of code with a known vulnerability | Deployment policy requires the code to have undergone a security vulnerability scan within the last N days |
| CI system misconfigured to allow builds from arbitrary source repositories; malicious adversary builds from a repo containing malicious code | CI system generates binary provenance recording which source repository it pulled from; production deployment policy requires provenance proving the artifact originated from an approved repository |
| Malicious adversary uploads a custom build script that exfiltrates the signing key, then signs and deploys a malicious binary | Verifiable build system uses privilege separation: the component that runs custom build scripts (the worker) has no access to the signing key. Only the trusted orchestrator process holds the key. |
| Malicious adversary tricks the CD system into using a backdoored compiler or build tool | Hermetic builds require developers to explicitly specify the compiler and build tool in the source code (fully pinned, peer-reviewed like any other code change). The orchestrator fetches these tools; the worker has no ability to substitute them. |
**Verifiable build architecture — three options:**
| Architecture | How it works | Best for |
|-------------|-------------|---------|
| **Trusted build service** | A central service the verifier trusts signs provenance with a key only it holds. Build once; no reproducibility required. | Most organizations; used by Google internally |
| **Rebuild service** | Verifier reproduces the build and checks bit-for-bit identical output. Requires full reproducibility. | Not scalable (builds take minutes; deployments need milliseconds) |
| **Rebuilding service** | A quorum of independent rebuilders attest to provenance. Hybrid of the two above. | Open source projects (Debian model) where central authority is infeasible |
**Hermetic build requirements (prerequisite for the advanced mitigations above):**
- All inputs to the build (sources, compilers, libraries) are fully specified up front, outside the build process, with unambiguous versions (cryptographic hashes or fully resolved version numbers)
- The orchestrator fetches all inputs; the worker executes build steps in isolation with no network access to arbitrary sources
- Benefits: enables CVE scanning against the full dependency graph, guarantees third-party import integrity, enables cherry-picking (patch + rebuild without extraneous behavior changes from a different compiler version)
---
## Step 7: Design Breakglass Controls
In emergencies (e.g., an outage requiring an immediate configuration change that would take too long through the normal pipeline), engineers may need to bypass the deployment policy.
**Breakglass design requirements:**
- **Every breakglass deployment must raise an alarm immediately** — not after the fact, but at the moment of use
- **Every breakglass deployment must be audited quickly** — the audit must occur while the context is fresh
- **Breakglass events must be rare** — if there are too many breakglass events, it becomes impossible to differentiate malicious activity from legitimate emergency use. The rarity threshold is your primary security property: if breakglass is routine, it provides no security
- **Log sufficient information** — logs must capture enough state to reconstruct the full policy decision after the fact, including the state of any external systems the policy depended on at decision time
**Post-deployment verification (required even with enforcement):**
Even when deployment policies are enforced at the choke point, run post-deployment verification because:
- Policies can change; existing deployments must be re-evaluated against new policy versions
- The enforcement decision service may have been unavailable, causing a "fail open" deployment
- Breakglass may have been used
- Dry-run mode surfaces policy violations without blocking deployment (useful when first rolling out enforcement)
- Forensic investigators need the deployment record after incidents
---
## Step 8: Staged Hardening Roadmap ("Take It One Step at a Time")
Implementing the full supply chain security model requires many changes. Attempting all of them simultaneously risks disrupting engineering productivity and causing outages. Secure one aspect of the supply chain at a time.
**Recommended sequence:**
| Phase | Focus | Key Actions |
|-------|-------|------------|
| **Phase 1: Foundational** | Eliminate human-in-the-loop build steps | Automate all build, test, and deploy steps; script everything so humans and automation execute identical steps; establish code review as mandatory (depends on `secure-code-review`) |
| **Phase 2: Configuration control** | Extend supply chain discipline to configuration | Treat all configuration as code: check in, review, and test before deployment; restrict direct configuration changes to the deployment process only |
| **Phase 3: Artifact verification** | Verify what is deployed, not just who deploys | Add binary provenance to CI/CD output; configure deployment choke points to require provenance; implement basic deployment policies |
| **Phase 4: Advanced hardening** | Address insider and sophisticated external threats | Implement privilege separation in build system (orchestrator + isolated worker); require hermetic builds; add provenance-based policies covering source repository, scan recency, and version recency |
| **Phase 5: Breakglass and monitoring** | Close the auditability gap | Implement breakglass with immediate alerting; enable post-deployment verification; run dry-run mode before enforcing new policies |
**Lock down the automated system itself:** For each path where an administrator can make a change without review — configuring the CI/CD pipeline directly, using SSH to run commands on the build machine — implement a mitigation that requires peer review for that path. This is the hardest step and the one most commonly skipped.
---
## Output Format
Produce a deployment pipeline security assessment with these sections:
1. **Adversary scope** — which of the three adversary types are in scope and why
2. **Threat-mitigation gap table** — current mitigations and limitations for each threat in Tables 14-1 and 14-2, with gaps highlighted
3. **Binary provenance schema** — which fields are required for your pipeline stages, with specific values for the Outputs and Sources fields
4. **Deployment policy rules** — ordered list of policy rules for each deployment environment, with the provenance fields each rule inspects
5. **Verifiable build architecture recommendation** — which of the three architectures fits your organization, with rationale
6. **Phased hardening roadmap** — which phase you are currently in, what changes Phase N+1 requires, and acceptance criteria for each phase
7. **Breakglass policy** — who can invoke breakglass, what is logged, how quickly audits must occur, and what the expected rarity threshold is
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-secure-code-review`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
Review code for security vulnerabilities and reliability anti-patterns: scan for SQL injection risks (raw string concatenation into queries), XSS exposure (u...
---
name: secure-code-review
description: |
Review code for security vulnerabilities and reliability anti-patterns: scan for SQL injection risks (raw string concatenation into queries), XSS exposure (untyped HTML construction), authorization bypass from multilevel nesting, primitive type obsession, YAGNI-inflated attack surface, and missing framework enforcement for authentication/authorization/rate-limiting. Use when conducting a security code review, auditing a codebase for injection vulnerabilities, checking whether auth logic could be bypassed by nesting errors, evaluating whether RPC backends use hardened interceptor frameworks, or assessing whether type-safety patterns (TrustedSqlString, SafeHtml, SafeUrl) are applied to user-controlled inputs. Produces a categorized security findings report with severity, anti-pattern class, affected locations, and fix recommendations grounded in hardened-by-construction design.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/building-secure-and-reliable-systems/skills/secure-code-review
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: []
source-books:
- id: building-secure-and-reliable-systems
title: "Building Secure and Reliable Systems"
authors: ["Heather Adkins", "Betsy Beyer", "Paul Blankinship", "Piotr Lewandowski", "Ana Oprea", "Adam Stubblefield"]
chapters: [12]
tags:
- security
- code-review
- secure-coding
- type-safety
- injection-prevention
- xss-prevention
- sql-injection
- authorization
- anti-patterns
- framework-design
- reliability
execution:
tier: 2
mode: full
inputs:
- type: codebase
description: "Source code directory, individual files, or a pull request diff to review for security and reliability anti-patterns"
- type: document
description: "Architecture description or code context if no codebase is directly accessible — falls back to structured review of a described design"
tools-required: [Grep, Read]
tools-optional: [Bash, Write]
mcps-required: []
environment: "Run inside the repository root. Grep searches the working tree for anti-patterns; Read examines flagged files for confirmation and context."
discovery:
goal: "Produce a categorized security findings report: each finding includes the anti-pattern class, affected file(s) and line(s), severity, and a concrete fix recommendation"
tasks:
- "Scan for SQL injection: raw string concatenation into query functions vs. parameterized queries or TrustedSqlString-style typed builders"
- "Scan for XSS exposure: unescaped user input in HTML rendering, string-typed URLs, absence of SafeHtml/SafeUrl type wrappers"
- "Scan for authorization bypass from multilevel nesting: auth/error checks nested more than 2 levels deep"
- "Scan for primitive type obsession: string/int parameters where distinct types would prevent misuse"
- "Scan for RPC backend structure: direct handler logic vs. interceptor-based framework for auth, logging, rate-limiting"
- "Scan for YAGNI complexity: unused parameters, abstract base classes with single concrete implementors, dead code paths"
- "Synthesize findings into a report with severity tiers and fix recommendations"
audience:
roles: ["security-engineer", "software-engineer", "site-reliability-engineer", "tech-lead", "code-reviewer"]
experience: "intermediate — assumes familiarity with web security concepts and a modern typed programming language"
triggers:
- "Security code review requested for a service or module"
- "Auditing a codebase for injection or XSS vulnerabilities"
- "Checking whether RPC backends enforce auth/logging via a framework or inline ad hoc logic"
- "Reviewing a pull request that touches database query construction or HTML rendering"
- "Assessing whether authorization logic could be bypassed due to nesting structure"
- "Evaluating a new service before launch for common OWASP vulnerability classes"
not_for:
- "Threat modeling or adversary profiling — use adversary-profiling-and-threat-modeling"
- "Access control policy and least-privilege design — use least-privilege-access-design"
- "Cryptographic primitive selection"
- "Infrastructure or network security configuration"
---
## When to Use
Use this skill when you need to systematically review source code for the vulnerability classes that account for the majority of security incidents in production systems: injection attacks, XSS, authorization bypass, and type-safety failures.
Invoke it for:
- Any codebase that constructs database queries, renders user input in HTML, or makes authorization decisions inline rather than via a shared framework
- Pull requests touching query construction, template rendering, or RPC handler registration
- Services that implement authentication, authorization, or rate-limiting directly in each handler rather than in a shared interceptor layer
- Assessing technical debt that may carry latent security risk (excessive nesting, primitive types used for security-sensitive data)
Do not invoke it for full threat modeling, infrastructure security review, or cryptographic design — those are separate concerns covered by other skills.
---
## Context and Input Gathering
Before beginning the review, establish:
1. **Language and framework**: What language is the codebase in? Is there a shared RPC framework, ORM, or template system in use? (This determines which scanning patterns are relevant.)
2. **Entry points**: Where does user-controlled input enter the system? (HTTP request parameters, RPC request fields, file uploads, headers.)
3. **Database access pattern**: Does the codebase use an ORM, raw query strings, prepared statements, or a typed query builder?
4. **HTML rendering**: Is HTML rendered server-side? Which template engine? Does it auto-escape by default?
5. **Authorization model**: Is authorization enforced at a framework/interceptor layer or inline in each handler?
6. **Prior review history**: Are there known exemptions, legacy `unsafe` packages, or known TODO/FIXME annotations around security logic?
If no codebase is accessible, conduct a structured interview covering the above before proceeding.
---
## Process
### Step 1 — Scan for SQL Injection Vulnerabilities
**WHY**: SQL injection is the top OWASP vulnerability class. String concatenation into query functions allows attackers to inject arbitrary SQL commands. The only reliable fix is to make mixing user input and SQL structurally impossible — either through parameterized queries enforced by the framework, or a typed query builder (TrustedSqlString pattern) that the compiler enforces.
Search for raw string concatenation into database query calls:
```
Grep pattern: db\.query\(.*\+|db\.exec\(.*\+|execute\(.*\+|query\(["'].*\+
Grep pattern: f"SELECT|f"INSERT|f"UPDATE|f"DELETE (Python f-strings with SQL keywords)
Grep pattern: "SELECT.*" \+|"INSERT.*" \+|"UPDATE.*" \+ (Java/Go string concat)
```
For each match, Read the flagged file and determine:
- **Confirmed injection risk**: User-controlled input (request parameter, external input) is concatenated directly into the query string. Mark as **Critical**.
- **Developer-controlled concatenation only**: String literals are concatenated but no external input reaches the query. Mark as **Low** but flag for refactoring toward a typed builder.
- **Parameterized / prepared statement**: Input is bound via `?`, `$1`, or named parameters. **Pass** — note the pattern for consistency.
**Fix recommendation for Critical findings**: Replace string concatenation with parameterized queries or a typed query builder. The TrustedSqlString pattern makes injection structurally impossible by construction:
```go
// BEFORE (vulnerable): string concatenation
db.query("UPDATE users SET pw_hash = '" + request["pw_hash"]
+ "' WHERE reset_token = '" + request.params["reset_token"] + "'")
// AFTER (safe): parameterized query
q := db.createQuery("UPDATE users SET pw_hash = @hash WHERE token = @token")
q.setParameter("hash", request.params["hash"])
q.setParameter("token", request.params["token"])
db.exec(q)
// BEST (hardened by construction): TrustedSqlString typed builder
// AppendLiteral only accepts string literal arguments (enforced by the compiler).
// User-supplied strings cannot be passed — the type system prevents it.
q.AppendLiteral("UPDATE users SET pw_hash = ")
q.AppendParam(request.params["hash"]) // param binding, not literal
```
If an `unsafequery` or equivalent escape hatch is found, confirm it routes through a security review workflow. Log its presence as a finding requiring manual sign-off.
### Step 2 — Scan for XSS Vulnerabilities
**WHY**: XSS vulnerabilities occur when user-controlled input is rendered into HTML without sanitization. Because HTML has dozens of contexts (element content, attribute values, URL attributes, JavaScript event handlers) each requiring different escaping semantics, manual escaping is unreliable at scale. The correct fix is a type system where `SafeHtml`, `SafeUrl`, and `TrustedResourceUrl` types are produced only by trusted constructors — making it structurally impossible to pass an unsanitized string where a safe value is required.
Search for patterns where string values reach HTML output:
```
Grep pattern: innerHTML\s*=|document\.write\(|\.html\(.*req\.|\.html\(.*param
Grep pattern: render\(.*request\.|template.*\+.*request\.|%s.*request\.
Grep pattern: href\s*=\s*["']?\s*\+|src\s*=\s*["']?\s*\+ (URL attribute injection)
Grep pattern: SafeHtml|SafeUrl|TrustedResourceUrl (confirm presence in rendering paths)
```
For each HTML rendering site, Read the surrounding code and determine:
- **Unescaped user input in element content**: String from request inserted directly into HTML. Mark as **Critical**.
- **Untyped URL in href/src attribute**: String variable used as a URL — allows `javascript:` scheme injection. Mark as **High**.
- **Auto-escaping template in use**: Template engine escapes all substitutions by default and the context is correctly classified. **Pass**.
- **SafeHtml/SafeUrl type used**: Constructor-enforced safe type. **Pass** — verify the constructor is from the trusted codebase, not user-supplied.
**Fix recommendation for Critical findings**:
```
// BEFORE (vulnerable): string interpolation into HTML
<div>userAddress</div> // attacker sets userAddress to <script>...</script>
// AFTER (safe): use a SafeHtml type produced by a trusted constructor
// The type system rejects strings where SafeHtml is expected.
// For URLs, use SafeUrl or TrustedResourceUrl — they cannot be constructed
// from arbitrary strings at runtime:
templateRenderer.setMapData(
ImmutableMap.of("url", TrustedResourceUrl.fromConstant("/script.js"))
).render()
// Passing a plain string variable for "url" is a compile-time error.
```
Flag any template that receives HTML fragments from external (untrusted) sources — these require a sanitizer that parses the HTML, verifies contracts per element, and removes non-conforming content before rendering.
### Step 3 — Scan for Authorization Bypass from Multilevel Nesting
**WHY**: Authorization checks and error handling embedded inside deeply nested conditional blocks are a named reliability and security anti-pattern. When checks occur at nesting level 3 or deeper, the logical relationship between conditions is difficult to follow, and error cases (like "unauthorized") can be silently swapped, bypassed, or reached by an unintended path. Unit tests rarely exercise all combinations of nested conditions.
**Detection criterion**: Flag any code block where an authorization check, permission assertion, or security-relevant error condition appears more than 2 levels of nesting deep (3+ levels of indentation beyond the function body).
Search for deep nesting around auth-related terms:
```
Grep pattern: (if|else|try|catch|for|while).*\n.*\1.*\n.*\2.*auth|permission|unauthorized|forbidden|access_denied
```
Because regex cannot reliably count indentation levels, use Grep to locate function bodies containing both deep nesting keywords and security-relevant terms, then Read the flagged files to manually assess nesting depth.
**Concrete example from the chapter** — the two versions below are logically equivalent, but the left version has a critical bug that is easy to miss:
```python
# BEFORE (vulnerable nesting — bug: 'wrong encoding' and 'unauthorized' are SWAPPED)
response = stub.Call(rpc, request)
if rpc.status.ok():
if response.GetAuthorizedUser():
if response.GetEnc() == 'utf-8':
vals = [ParseRow(r) for r in response.GetRows()]
avg = sum(vals) / len(vals)
return avg, vals
else:
raise ValueError('no rows') # BUG: should be 'wrong encoding'
else:
raise AuthError('unauthorized') # BUG: should be 'no rows'
else:
raise ValueError('wrong encoding') # BUG: should be RpcError
# AFTER (safe — early exit, each check handled at its own level)
response = stub.Call(request, rpc)
if not rpc.status.ok():
raise RpcError(rpc.ErrorText())
if not response.GetAuthorizedUser():
raise ValueError('wrong encoding')
if response.GetEnc() != 'utf-8':
raise AuthError('unauthorized')
if not response.GetRows():
raise ValueError('no rows')
vals = [ParseRow(r) for r in response.GetRows()]
avg = sum(vals) / len(vals)
return avg, vals
```
**Fix recommendation**: Refactor using the early-exit (guard clause) pattern — handle each error condition as soon as it can be detected, at the top level of the function. Authorization checks should execute first, before any other logic, and should fail fast with a clear error.
Mark findings as **High** when the mishandled case is an auth/permission check; **Medium** when it is a generic error path.
### Step 4 — Scan for Primitive Type Obsession
**WHY**: Using raw strings or integers for parameters that carry security-sensitive semantic meaning — user IDs, group names, token values, URLs — leads to misuse that the compiler cannot catch. Arguments can be passed in wrong order, values can be truncated or implicitly converted, and security invariants cannot be enforced without runtime checks scattered throughout the codebase. Strong types wrap the primitive and encode the invariant in the type system, enforced at compile time.
Search for functions with multiple adjacent string or int parameters in security-sensitive contexts:
```
Grep pattern: func.*\(.*string,\s*string|def.*\(.*str,\s*str (adjacent string params)
Grep pattern: AddUser.*string.*string|AddRole.*string.*string (identity/permission functions)
Grep pattern: query\(.*string\)|execute\(.*string\) (string-typed query params)
```
For each match, Read the function signature and callers to assess:
- **Security-relevant ambiguity**: Two or more string parameters where transposing them could produce a security failure (e.g., `addUserToGroup(username, groupName)` — which order?). Mark as **Medium**.
- **URL or HTML typed as string**: A `string` parameter used as a URL or HTML fragment that could accept a `javascript:` URI or raw HTML. Mark as **High**.
- **Non-security ambiguity**: Ambiguous parameter ordering but no security consequence. Mark as **Low** / code quality.
**Fix recommendation**:
```
// BEFORE: ambiguous, compiler cannot catch argument transposition
AddUserToGroup(string, string) // which is user, which is group?
// AFTER: strong types make the call self-documenting and compiler-enforced
Add(User("alice"), Group("root-users"))
// User and Group are distinct types wrapping string — passing them in
// wrong order is a compile-time error.
// For URLs: use SafeUrl or TrustedResourceUrl instead of string
// For HTML content: use SafeHtml instead of string
```
### Step 5 — Evaluate RPC Backend Framework Usage
**WHY**: The most common security and reliability failures in RPC backends — skipped authentication, inconsistent authorization, missing rate limiting, incomplete logging — occur when each handler reimplements these cross-cutting concerns independently. A framework with predefined interceptors (before/after hooks around each RPC) enforces that every call passes through logging, authentication, authorization, and rate limiting in order, regardless of which developer wrote the handler. Centralized framework logic can be fixed once and applied everywhere; ad hoc per-handler logic must be audited and fixed in every handler.
Search for RPC handler implementations:
```
Grep pattern: func.*Handler\(|def.*handler\(|@app\.route|RegisterHandler
Grep pattern: func.*Before\(.*Request|func.*After\(.*Response (interceptor pattern — positive signal)
```
For each handler file, Read and assess:
**Interceptor pattern present (positive)**:
```go
// Authorization interceptor — framework-enforced
type authzInterceptor struct {
allowedRoles map[string]bool
}
func (ai *authzInterceptor) Before(ctx context.Context, req *Request) (context.Context, error) {
callInfo, err := FromContext(ctx) // callInfo populated by the framework
if err != nil { return ctx, err }
if ai.allowedRoles[callInfo.User] { return ctx, nil }
return ctx, fmt.Errorf("Unauthorized request from %q", callInfo.User)
}
func (*authzInterceptor) After(ctx context.Context, resp *Response) error {
return nil // nothing left to do after the RPC is handled
}
```
Mark as **Pass** — the framework handles auth before RPC logic executes. Authorization interceptors that return an error prevent the RPC from executing, but still call the `After` stage of each already-invoked interceptor, ensuring logging completes.
**Ad hoc inline auth (anti-pattern)**:
```
// Each handler manually checks credentials — no guaranteed ordering
func HandleRequest(req *Request) (*Response, error) {
if req.Token == "" { return nil, ErrUnauthorized }
// ... business logic with no guaranteed logging ...
}
```
Mark as **High** — auth can be skipped or ordered differently across handlers, logging is not guaranteed, and rate limiting is missing. Recommend migrating to an interceptor-based framework where logging, authentication, authorization, and rate limiting are defined once and applied to every RPC call.
The interceptor execution order that should be enforced: **Logging → Authentication → Authorization → Rate Limiting → RPC Logic**. On error at any stage, the `after` stages of all previously executed interceptors still run (ensuring logs are always written).
### Step 6 — Scan for YAGNI Complexity
**WHY**: Speculative code written "just in case it might be needed" adds surface area that must be secured, tested, and maintained. Unused parameters and abstract interfaces with a single concrete implementor increase the complexity of reasoning about control flow — which is where security bugs hide. Simpler code has fewer paths that can be exploited.
Search for indicators of YAGNI complexity:
```
Grep pattern: virtual.*=\s*0|abstract.*def\b (abstract methods — check for single implementor)
Grep pattern: TODO|FIXME|HACK (debt that may be load-bearing for security logic)
Grep pattern: bool.*=\s*false\b|bool.*=\s*true\b (boolean flags with suspicious defaults)
```
For each abstract interface or virtual method found, Read the file and search for all implementors. If exactly one concrete implementation exists, flag as **Low** — evaluate whether the abstraction is warranted or whether it adds unnecessary complexity.
For boolean parameters in security-sensitive functions (e.g., `hibernate = false` always passed by callers), Read all call sites. If all callers pass the same value, the parameter is dead weight — the code must handle a case that never occurs, and that handling code introduces reliability risk (error paths never exercised by tests).
**Fix recommendation**: Delete the unused path. Apply the principle of incremental development: implement the abstraction when a second concrete implementation actually exists.
### Step 7 — Synthesize Findings into a Report
**WHY**: Raw findings from individual scans need to be prioritized by severity and organized by fix effort so the team can address the most critical issues first without being overwhelmed by the full list.
Produce a report with the following structure:
```
## Security Review Findings: [codebase / PR / module name]
**Date**: [date]
**Reviewer**: [agent or reviewer name]
**Scope**: [files/directories reviewed]
### Critical
[Finding ID] | [Anti-Pattern Class] | [File:Line] | [Description]
Fix: [concrete recommendation]
### High
[same structure]
### Medium
[same structure]
### Low / Code Quality
[same structure]
### Passes (positive patterns noted)
[Interceptor framework present in X, parameterized queries used throughout Y, SafeHtml used in template layer]
### Recommended Next Steps
1. [Highest-priority fix — typically Critical SQL injection or XSS]
2. [Framework migration if ad hoc auth found]
3. [Type hardening for security-sensitive parameters]
```
Severity assignments:
- **Critical**: User-controlled input directly concatenated into SQL or HTML rendering
- **High**: Untyped URLs, inline auth with no framework enforcement, auth check nested >2 levels
- **Medium**: Primitive type obsession in security-sensitive signatures, medium-depth nesting in error paths
- **Low**: YAGNI complexity, developer-controlled SQL concatenation, minor code quality
---
## Key Principles
**Make security invariants impossible to violate by construction.** Guidelines and code review cannot scale. A type system that prevents mixing user input with SQL query strings — or prevents passing a string where `SafeHtml` is required — removes an entire vulnerability class regardless of developer attention. Design APIs so the insecure path is the hard path.
**Security cannot be retrofitted.** Tacking on security after launch is painful and often requires changing fundamental assumptions. The patterns in this skill (interceptors, typed wrappers, early-exit auth) must be applied during implementation, not after the fact.
**Centralize cross-cutting security logic.** Authentication, authorization, logging, and rate limiting belong in a shared framework or interceptor layer — not reimplemented in each handler. When domain experts fix a vulnerability in a centralized framework, it is fixed everywhere simultaneously.
**Nesting is complexity is risk.** Deep conditional nesting makes control flow hard to reason about. Security checks that appear in nested branches are frequently swapped, bypassed, or untested. The early-exit pattern — check preconditions at the top of the function and fail fast — is both simpler and more secure.
**YAGNI applies to attack surface.** Every abstract interface, unused parameter, and speculative code path is surface area that must be secured and maintained. Avoid implementing features that are not currently needed. Add complexity when a second concrete use case actually arrives.
**Frameworks improve developer productivity and security simultaneously.** Developers using a framework only need to specify what is specific to their service (which authorization policy, which data to log). The framework handles the rest. This frees developer attention for business logic and reduces the chance of security-relevant omissions.
---
## Examples
### Example 1 — RPC Backend: Interceptor Framework vs. Ad Hoc Auth
**Scenario**: Reviewing a Go RPC service with 12 handlers. Some handlers check credentials manually; others omit the check entirely.
**Finding**: Ad hoc inline credential check — High severity. No guaranteed ordering. Logging occurs only in some handlers.
**Positive pattern to look for** (interceptor framework):
```go
// Framework interceptor chain: Logging → Auth → Authorization → Rate Limiting → RPC Logic
// Each stage runs Before (pre-RPC) and After (post-RPC, always runs even on error).
// Authorization interceptor:
type authzInterceptor struct{ allowedRoles map[string]bool }
func (ai *authzInterceptor) Before(ctx context.Context, req *Request) (context.Context, error) {
callInfo, err := FromContext(ctx) // populated by framework from verified credentials
if err != nil { return ctx, err }
if ai.allowedRoles[callInfo.User] { return ctx, nil }
return ctx, fmt.Errorf("Unauthorized request from %q", callInfo.User)
}
// Logging interceptor (logs every request before, and every failure after):
func (*logInterceptor) Before(ctx context.Context, req *Request) (context.Context, error) {
callInfo, _ := FromContext(ctx)
logger.Log(ctx, &pb.LogRequest{
Timestamp: time.Now().Unix(), User: callInfo.User, Request: req.Payload,
}, WithAttemptCount(3))
return ctx, nil
}
func (*logInterceptor) After(ctx context.Context, resp *Response) error {
if resp.Err == nil { return nil }
logger.LogError(ctx, &pb.LogErrorRequest{
Timestamp: time.Now().Unix(), Error: resp.Err.Error(),
}, WithAttemptCount(3))
return nil
}
```
**Recommendation**: Migrate all handlers to register with the interceptor framework. Remove inline credential checks. The framework guarantees every call is logged before and after, authenticated, authorized against the allowed-roles policy, and rate-limited — regardless of which developer wrote the handler.
### Example 2 — SQL Injection: TrustedSqlString Pattern
**Scenario**: Reviewing a password reset endpoint. The query is constructed by string concatenation.
**Finding**: Critical SQL injection — user-controlled `reset_token` concatenated directly into the query string.
```python
# Vulnerable (real code pattern from the chapter):
db.query("UPDATE users SET pw_hash = '" + request["pw_hash"]
+ "' WHERE reset_token = '" + request.params["reset_token"] + "'")
# An attacker sets reset_token = "' OR username='admin" to reset admin's password.
```
**Fix**: Parameterized query eliminates injection. Typed builder (TrustedSqlString) makes it impossible by construction — `AppendLiteral` only accepts compile-time string constants; user input cannot be passed:
```go
// TrustedSqlString typed builder — Go example:
struct Query { sql strings.Builder }
type stringLiteral string // package-private; outside code cannot construct it
func (q *Query) AppendLiteral(literal stringLiteral) { q.sql.WriteString(string(literal)) }
// q.AppendLiteral("foo") -- compiles (string literal implicitly converts)
// q.AppendLiteral(foo) -- does NOT compile (variable cannot convert to stringLiteral)
```
If a legacy `unsafequery` escape hatch is found in the codebase, gate all new uses behind a security review approval workflow (visibility allow-listing). Each new use must be individually reviewed; the review load is manageable precisely because the typed API prevents most cases.
### Example 3 — XSS: SafeHtml and SafeUrl Type Enforcement
**Scenario**: A web application renders user-provided profile data into HTML templates.
**Finding**: High severity — `userAddress` is a string interpolated directly into an HTML element:
```html
<!-- Vulnerable: attacker sets userAddress to <script>exfiltrate_user_data();</script> -->
<div>userAddress</div>
```
**Fix**: The rendering layer must require `SafeHtml` for element content and `SafeUrl` / `TrustedResourceUrl` for URL attributes. The types are immutable wrappers — their constructors are the only trusted codebase that can produce them:
```java
// Template substitution with SafeHtml enforcement:
// Passing a plain String where SafeHtml is expected is a compile-time error.
templateRenderer.setMapData(
ImmutableMap.of("address", SafeHtml.fromConstant("...")) // only from trusted constructors
).render();
// For URL attributes — TrustedResourceUrl rejects javascript: scheme at construction:
templateRenderer.setMapData(
ImmutableMap.of("url", TrustedResourceUrl.fromConstant("/script.js"))
).render();
// templateRenderer.setMapData(ImmutableMap.of("url", some_string_variable)) -- compile error
```
If the application receives HTML fragments from an external (untrusted) source, use a sanitizer that parses the HTML, verifies type contracts per element, and removes elements or attributes that do not meet their contract — rather than rendering the fragment directly.
---
## References
- [security-anti-pattern-catalog.md](../../references/security-anti-pattern-catalog.md) — Full catalog of named anti-patterns from Chapter 12: multilevel nesting, primitive type obsession, YAGNI smells, ad hoc framework reimplementation
- [owasp-top10-framework-mitigations.md](../../references/owasp-top10-framework-mitigations.md) — OWASP Top 10 vulnerability classes mapped to framework-level hardening measures (Table 12-1)
- [trusted-sql-string-pattern.md](../../references/trusted-sql-string-pattern.md) — TrustedSqlString implementation guide: Go (package-private type alias), Java (@CompileTimeConstant), C++ (template constructor); incremental rollout strategy; unsafequery escape hatch governance
- [safe-html-safe-url-pattern.md](../../references/safe-html-safe-url-pattern.md) — SafeHtml and SafeUrl type system: constructor trust model, Closure Templates integration, HTML sanitizer usage, TrustedResourceUrl for script/stylesheet URLs
- [rpc-interceptor-framework.md](../../references/rpc-interceptor-framework.md) — Interceptor framework design for RPC backends: context object, before/after execution model, dependency registration, exponential backoff, cascading failure prevention
Cross-references:
- `adversary-profiling-and-threat-modeling` — identify which threat actors make injection and XSS vulnerabilities most consequential
- `least-privilege-access-design` — authorization framework design and access control policy complement the inline auth review in Step 5
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
Design or audit a system's resilience posture using a layered framework — defense in depth, controlled degradation (load shedding vs. throttling), blast radi...
---
name: resilience-and-blast-radius-design
description: Design or audit a system's resilience posture using a layered framework — defense in depth, controlled degradation (load shedding vs. throttling), blast radius compartmentalization (role/location/time), failure domains, 3-tier component reliability hierarchy, and continuous validation. Use this skill when designing a new system for failure, reviewing an existing architecture for single points of failure, limiting blast radius of a potential attack or outage, deciding fail-open vs. fail-closed behavior for a component, or building an incident-response-ready compartmentalization strategy.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/building-secure-and-reliable-systems/skills/resilience-and-blast-radius-design
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
source-books:
- id: building-secure-and-reliable-systems
title: "Building Secure and Reliable Systems"
authors: ["Heather Adkins", "Betsy Beyer", "Paul Blankinship", "Piotr Lewandowski", "Ana Oprea", "Adam Stubblefield"]
chapters: [8]
tags: [security, reliability, resilience, blast-radius, defense-in-depth, fault-tolerance, compartmentalization, load-shedding, failure-domains]
depends-on: []
execution:
tier: 2
mode: full
inputs:
- type: context
description: "System architecture description, component list, or incident post-mortem. May also be a design proposal or existing runbook."
tools-required: [Read, Write]
tools-optional: []
mcps-required: []
environment: "Works from a system description, architecture diagram, or design document. Output: resilience assessment report with failure domain map, component tier assignments, compartmentalization recommendations, and prioritized implementation roadmap."
discovery:
goal: "Produce a resilience assessment that identifies single points of failure, assigns component reliability tiers, maps blast radius exposure, recommends compartmentalization boundaries, and delivers a cost-ordered implementation roadmap."
tasks:
- "Audit defense-in-depth coverage across all attack stages"
- "Identify degradation control strategy (load shedding vs. throttling) and fail-open/closed posture for each critical component"
- "Map compartmentalization axes: role, location, and time separation"
- "Assign each component to a reliability tier: high-capacity, high-availability, or low-dependency"
- "Define failure domains and evaluate redundancy strategy"
- "Design or review continuous validation coverage"
- "Produce a prioritized implementation roadmap using the cost-ordered sequence"
audience: "engineers, SREs, security engineers, and architects at intermediate-to-advanced level"
when_to_use: "When designing a new system for resilience, reviewing an existing architecture for weaknesses, limiting blast radius, deciding fail-open vs. fail-closed policy, or preparing for incident response"
environment: "System architecture or design document. Knowledge of critical user-facing features and their dependencies is needed for degradation planning."
quality: placeholder
---
# Resilience and Blast Radius Design
## When to Use
Apply this skill when:
- Designing a new system and want to build resilience in from the start
- Reviewing an existing architecture for single points of failure, over-broad blast radius, or missing degradation controls
- Deciding whether a component should fail open (availability) or fail closed (security)
- Planning compartmentalization to limit the impact of a security breach or operational outage
- Preparing an incident response strategy that requires quarantining a subsystem without taking everything offline
- Evaluating whether alternative/backup components are likely to work when actually needed
**The core pattern:** resilience is not a single mechanism — it is five cooperating layers: (1) defense in depth across attack stages, (2) controlled degradation under load, (3) blast radius compartmentalization, (4) failure domains with component-tier redundancy, and (5) continuous validation that the other four still work.
Before starting, confirm you have:
- A description of the system's architecture or the component under review
- An understanding of which features are critical to users (needed for degradation planning)
- A sense of the organization's risk tolerance and operational capacity (needed for tier selection)
---
## Context and Input Gathering
### Required Context
- **System description:** Architecture diagram, design document, or prose description of the system's components and their dependencies.
- **Critical features list:** Which capabilities must stay available even under severe degradation? Which can be throttled or disabled?
- **Threat model (if available):** What types of attackers or failure modes are most likely? This shapes compartmentalization priorities.
### Observable Context
If a system description is provided, scan for:
- Components that are dependencies of many other components — single points of failure
- Components with no alternative implementation — missing redundancy tiers
- Features that require the same network, credentials, or configuration as security controls — fail-open risk
- Backup or alternative components that are never exercised in production — zombie backup risk
- No log-based boundary validation — compartmentalization that cannot be verified
### Default Assumptions
- If fail-open vs. fail-closed is unspecified: ask — this is a security-vs-reliability tradeoff that requires a deliberate organizational decision
- If no degradation strategy exists: default to identifying the top 3 most expensive operations and the top 3 most critical operations as starting points
- If no component tiers are defined: assign all components to high-capacity tier initially, then identify candidates for high-availability and low-dependency promotion
- If no compartmentalization exists: start with role separation (cheapest) before location or time separation
### Sufficiency Check
You have enough to proceed when:
1. You can enumerate the system's major components and their dependencies
2. You know which features are non-negotiable under load
3. You know whether the primary risk is availability (prefer fail-open) or integrity/security (prefer fail-closed)
---
## Process
### Step 1 — Audit Defense in Depth Coverage
Defense in depth protects systems by establishing multiple independent defense layers, so that a failure in any single layer does not result in a complete compromise.
Map the system against the four stages of an attack or failure:
| Stage | What defenders must address |
|---|---|
| Threat modeling and vulnerability discovery | Limit information exposed to attackers; detect reconnaissance signals |
| Deployment / delivery of an exploit | Validate and inspect anything entering the system boundary (supply chain, third-party dependencies, user-supplied content) |
| Execution | Sandbox, isolate, and limit blast radius so execution cannot escape to adjacent systems |
| Compromise | Ensure detection and response are possible; limit how long an attacker can maintain access |
**For each stage, ask:** What defensive measures exist? If this layer fails, what catches the next stage?
**Why:** Each layer buys time and increases attacker cost. Layers should be complementary — each should anticipate the likely failure modes of the layer before it. A single perimeter with no inner defenses is brittle; inner defenses should assume the outer perimeter has been breached.
**Supply chain dependency note:** Arbitrary third-party dependencies entering a trusted execution environment (e.g., untrusted code running alongside internal services) require an independent sandbox layer. If the sandbox fails, a second containment layer (such as syscall filtering) should catch what the sandbox missed.
**Output for this step:** A per-stage defense inventory with gaps marked.
---
### Step 2 — Design Controlled Degradation
When a system exceeds capacity or loses components, unplanned degradation typically causes a cascading failure. Controlled degradation replaces chaotic collapse with a pre-planned sequence of graceful reductions in service.
#### 2a — Classify Features by Criticality
Rank features by their value and their cost:
- **Non-negotiable under load:** Keep these active regardless of resource pressure (security controls, core user data access)
- **Throttleable:** High-value but not critical; delay or slow rather than drop
- **Sheddable:** Low-value or high-cost; return an error rather than serve
**Security and reliability tradeoff:** Security-critical operations should generally fail closed, not open. If a system cannot verify its integrity (e.g., ACLs fail to load), the safer default is "deny all," not "allow all." Determine the organization's minimum nonnegotiable security posture before designing degradation thresholds.
#### 2b — Choose: Load Shedding or Throttling
**Load shedding** (returning errors for excess requests):
- Goal: stabilize a component at maximum load, prevent crash
- When to use: when a component is approaching capacity and any excess request should be rejected rather than risk crashing the whole server
- How it works: assign request priority and cost; shed low-priority or high-cost requests first; security-critical functions get the highest shed priority (shed last)
**Throttling** (delaying responses to slow future request rate):
- Goal: reduce the rate at which clients send follow-up requests
- When to use: when clients will retry on failure and retries would compound the load problem
- How it works: introduce delay before responding so clients naturally slow their request cadence
**Automation controls:** Avoid allowing automation to make unsupervised large-magnitude changes (e.g., a single policy change affecting all servers). Implement a change budget: when automation exhausts its budget, a human must approve the next action. Prevent automated degradation controls from disabling emergency access paths (breakglass mechanisms, SSH fallback routes).
**Output for this step:** Feature criticality ranking, selected degradation mechanism per component, and automation budget policy.
---
### Step 3 — Map Compartmentalization (Blast Radius Control)
Compartmentalization divides the system into isolated units so that a breach or failure in one compartment does not jeopardize the others. Use three axes, applied in order of increasing implementation cost:
#### 3a — Role Separation (lowest cost)
- Run distinct services as distinct identities (service accounts / roles)
- A compromise of one role's credentials cannot be used to impersonate another role's services
- Even services developed and operated by the same team should use different roles if they access different classes of data
**Why:** Credential compromise is the most common attack escalation path. Role separation limits how far a single stolen credential can propagate.
#### 3b — Location Separation (medium cost)
- Run service instances in distinct physical or logical locations (datacenters, cloud regions, network segments)
- Ensure no service has a critical dependency on a backend that is homed in only one location
- Use location-bound cryptographic certificates: compromise of one location's keys should not expose other locations' data
**Physical location as a natural boundary:** Natural disasters, fiber cuts, and physical-presence attacks (tailgating, hardware implants) are all location-confined. Design so that a localized impact stays within its region while the multiregional system continues operating.
**Align physical and logical boundaries:** Segment networks on both network-level risk (internet-facing vs. internal) and physical risk (datacenter vs. office). Distribute secrets, credentials, and encryption keys per-location rather than sharing a single global secret across all servers.
#### 3c — Time Separation (ongoing discipline)
- Rotate credentials and encryption keys on a regular schedule and expire old ones
- Rotation forces an attacker who has stolen a credential to repeatedly re-acquire it, creating more opportunities for detection
- Regularly rotating keys also validates that rotation works — uncovering services that were never designed to handle rotation before an emergency requires it
**Key rotation validation targets:**
- *Key rotation latency:* How long does a full rotation cycle take? This is your incident response clock.
- *Verified loss of access:* Confirm old keys are fully revoked, not just replaced.
#### 3d — Compartment Granularity Tradeoff
Finer compartments (per-RPC-method) give tighter control but create N×M complexity in access policies. Coarser compartments (per-server) are easier to manage but offer less precision. The right granularity balances security benefit against operational overhead — consult incident management and operations teams when setting boundaries.
**Imperfect compartments still add value:** Any delay imposed on an attacker trying to escape a compartment is time for the incident response team to detect and react.
**Output for this step:** A compartmentalization map with role, location, and time boundaries identified. Flag any compartment violations that should be validated by log analysis.
---
### Step 4 — Define Failure Domains and Assign Component Tiers
Compartmentalization limits blast radius from attacks and operational failures. Failure domains extend this further by partitioning the system into independent functional copies, so that complete failure of one partition does not bring down the entire system.
#### 4a — Design Failure Domains
A failure domain is an independent partition that:
- Looks like the whole system to its clients
- Can take over for other partitions during an outage (at reduced capacity)
- Has its own data copy, isolated from other domains' data
**Practical minimum:** Even splitting into two failure domains provides significant value — it enables A/B regression isolation, limits blast radius of a configuration change to one domain, and enables geographic disaster isolation. Use one domain as a canary; enforce a policy that prohibits simultaneous updates to both domains.
**Data isolation within failure domains:**
- Restrict how new data enters a failure domain (validate before accepting)
- Preserve last-known-good configuration to disk so the domain can operate if its configuration API is unavailable
- Rate-limit global configuration changes to prevent a single bad push from affecting all domains simultaneously
#### 4b — Assign Component Reliability Tiers
Three tiers, each suited to different resilience requirements:
**Tier 1 — High-Capacity (standard production)**
- The main fleet serving normal user load
- Focus: capacity planning, rollout procedures, feature development
- Where to invest first — these components carry the most user impact
**Tier 2 — High-Availability (reduced-dependency copy)**
- A copy of a critical component configured with fewer dependencies and slower update cadence
- Uses locally cached data rather than remote databases; runs older (more stable) code versions
- Goal: provably lower probability of outage than the high-capacity version
- Operational cost: additional resource allocation proportional to fleet size
**Tier 3 — Low-Dependency (minimal fallback)**
- An alternative implementation with minimal dependencies, designed to survive infrastructure failures that would take down Tier 1 and Tier 2
- Supports only the most critical subset of features
- Dependencies must themselves be low-dependency; the alternative component and its critical dependency must not share a failure domain with the primary
**Anti-pattern — zombie backup:** Any component that grows to be treated as part of normal operation will be overloaded during an actual outage. Any component that is never exercised in normal operation will fail unexpectedly when needed. Both are backup drift problems. Mitigation: route a small fraction of real traffic through high-availability and low-dependency components continuously.
**Output for this step:** Component tier assignments table. Flag any component with no Tier 2 or Tier 3 fallback that has no acceptable outage window.
---
### Step 5 — Design Continuous Validation
Resilience mechanisms degrade silently over time. New features introduce dependencies that violate compartment boundaries. Backup components accumulate drift. Validation is the compounding interest on all other resilience investments — without it, the other four layers decay.
**Validation is distinct from chaos engineering:** Chaos engineering is exploratory. Validation confirms that specific, known-good properties still hold under controlled conditions.
**Core validation strategy:**
1. Discover new failures (from incident logs, bug reports, fuzzing, expert judgment)
2. Implement validators for each discovered failure mode
3. Execute all validators repeatedly on a schedule
4. Retire validators when the relevant feature or behavior no longer exists
**Validation techniques by component tier:**
| Situation | Technique |
|---|---|
| High-capacity and high-availability components | Request mirroring: send duplicate traffic to both, alert on unexpected response differences |
| Low-dependency components | Route on-call engineers through low-dependency paths as part of their normal duties — this validates readiness and builds familiarity |
| Compartment boundary enforcement | Log analysis: any operation that crosses a role, location, or time boundary should fail; unexpected successes should alert |
| Load shedding and throttling behavior | Inject simulated load or artificial behavior changes into servers; observe propagation through clients and backends |
| Key rotation | Measure rotation latency and confirm verified loss of access for old keys on a regular schedule |
| Resource release under oversubscription | Periodically exercise the actual resource rebalancing process — not a simulation — to confirm the SLO holds |
**Human factors in validation:** Recovery actions necessarily involve humans. Validate recovery instructions for readability. Validate that on-call engineers can actually execute emergency procedures — including using low-dependency systems — under realistic conditions.
**Output for this step:** Validation plan with method, trigger, and frequency per resilience property. Flag any resilience mechanism that has no associated validator.
---
### Step 6 — Produce the Resilience Assessment and Implementation Roadmap
Synthesize findings from Steps 1–5 into a structured report.
**Report sections:**
1. **Defense-in-depth gap inventory** — per-stage coverage and identified gaps
2. **Feature criticality ranking and degradation policy** — which features shed first, which stay active, fail-open vs. fail-closed decisions
3. **Compartmentalization map** — role, location, and time boundaries; compartment violations
4. **Failure domain map** — current domains, tier assignments per component, data isolation status
5. **Continuous validation coverage** — validators in place, gaps, and recommended additions
6. **Prioritized implementation roadmap** (see Key Principles below)
---
## Key Principles
### Prioritized Implementation Order (by cost-effectiveness)
When starting from scratch or limited in capacity, apply resilience improvements in this order:
1. **Failure domains and blast radius controls** — lowest ongoing cost due to their static nature; significant improvement in containment; make future bad designs harder to introduce
2. **High-availability component copies** — next most cost-effective; cheap to deploy, easy to abandon if not needed; provide a concrete measure of how much availability improvement is available
3. **Load shedding and throttling** — justified when organizational scale or risk aversion makes automation cost-effective; critically important for defending against denial-of-service attacks
4. **Denial-of-service defenses** — evaluate effectiveness of degradation controls against DoS scenarios specifically
5. **Low-dependency components** — most expensive; before investing, calculate how long it would take to bring up all dependencies of the business-critical service; compare that recovery time against the cost of the alternative
When budget or time runs out, streamline the efficiency of already-deployed mechanisms before adding new ones.
### Fail-Open vs. Fail-Closed Decision Framework
Resolve this per component using the organization's minimum nonnegotiable security posture:
- **Fail open** (availability priority): Serve what you can even if configuration is incomplete. Suitable when service interruption causes more harm than degraded security posture. Never acceptable for security-critical operations.
- **Fail closed** (security priority): Lock down fully when integrity cannot be verified. Suitable for ACL enforcement, authentication, and any component whose failure could allow unauthorized access.
**Resolution:** Define the minimum security posture first. Then design reliability controls to meet that posture's requirements — for example, tag security-oriented RPC traffic with high quality-of-service priority so it survives load shedding.
### Why Compartment Redundancy Beats Global Defense
A single strong perimeter, if breached, exposes the entire system. Compartments ensure that a breach or failure is contained. Incident management teams can take proportional actions — quarantining one compartment — rather than taking the whole system offline. Compartments also naturally create the boundaries needed for failure domain isolation and backup component separation.
### The Backup Drift Anti-Pattern
Two failure modes destroy backup component value:
- **Overreliance drift:** The backup gets treated as part of normal operation. During an actual outage, it is already saturated.
- **Zombie backup:** The backup is never touched. When needed, it fails because it was never maintained or tested.
Mitigation: route a small, constant fraction of real traffic through alternative components. This both validates them continuously and prevents them from becoming a surprise source of denial of service.
---
## Examples
### Example 1 — Fail-Open vs. Fail-Closed: Two-Factor Authentication
A service uses two-factor authentication (2FA) in the login flow. 2FA infrastructure becomes unavailable.
- **Fail-open decision:** Allow logins without 2FA. Users can still access their accounts. Acceptable during early 2FA adoption or for low-sensitivity accounts.
- **Fail-closed decision:** Disable all logins when 2FA is unavailable. A bank with custody of financial accounts would typically choose this — unauthorized access is a worse outcome than a service outage.
**Applying the framework:** Determine the minimum security posture first. The 2FA service itself should be protected against load shedding (high QoS priority) to reduce the frequency of this decision.
### Example 2 — Multi-Layer Defense in Depth: Multi-Tenant Compute Platform
A platform runs untrusted user code alongside trusted internal services.
Defense layers (each designed to catch what the previous layer misses):
1. API surface reduction — remove or replace built-in APIs that allow dangerous operations
2. Runtime compilation to constrained bytecode — eliminates large classes of memory corruption and control-flow attacks
3. System call filtering — catches exploitable conditions that bypass the bytecode layer; violations terminate the runtime immediately and alert
Over five years, this layered approach caught exploitable conditions that each individual layer would have missed. No single layer was assumed sufficient; each anticipated the failure of the previous one.
### Example 3 — Failure Domain with Canary Deployment
A configuration management system splits into two failure domains. Policy: no simultaneous updates to both domains.
- Domain A receives a new configuration push first (canary)
- Monitoring compares behavior between Domain A and Domain B
- If Domain A shows unexpected errors, the push is halted before Domain B is affected
- Blast radius of a bad configuration change: one failure domain, not the whole system
### Example 4 — Zombie Backup Prevention: On-Call Low-Dependency Validation
On-call engineers are routed through low-dependency system paths as part of their normal on-call rotation. Benefits:
- Continuously validates that low-dependency components are operational
- Engineers build familiarity with emergency paths before they need them under stress
- Reduces risk of misconfiguration surprise during an actual incident
---
## References
- Chapter 8: Design for Resilience, *Building Secure and Reliable Systems* (pp. 143–181)
- Chapter 9: Design for Recovery (recovery after resilience fails)
- Chapter 10: Denial-of-Service attacks (DoS defense evaluation, Step 4 of roadmap)
- Chapter 13: Testing for Reliability and Security (fuzzing, failure injection)
- Chapter 15: Measuring System Properties (choosing what to validate)
- *Site Reliability Engineering* book, Chapters 21 (throttling), 22 (load shedding), 20 (lame-duck mode)
- See `references/component-tier-definitions.md` for detailed tier configuration parameters
- See `references/compartmentalization-matrix.md` for a role/location/time decision matrix template
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)