@clawhub-aipoch-ai-772015cadb
Fix garbled text in PDF/SVG vector graphics caused by font encoding issues, making files editable in AI tools. Supports batch processing and JSON export for...
---
name: vector-text-fixer
description: Fix garbled text in PDF/SVG vector graphics caused by font encoding issues, making files editable in AI tools. Supports batch processing and JSON export for manual correction.
license: MIT
skill-author: AIPOCH
---
# Vector Text Fixer
Fixes garbled text in PDF/SVG vector graphics caused by font embedding problems, encoding errors, or missing font substitution. Outputs repaired files or editable JSON for AI tool import.
## Quick Check
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py --input document.pdf --output fixed.pdf
python scripts/main.py --input diagram.svg --output fixed.svg
```
## When to Use
- Fix garbled/box characters in PDF files caused by font embedding issues
- Repair SVG text encoding errors before editing in Illustrator or Inkscape
- Batch-process a folder of PDF/SVG files with garbled text
- Export a text map JSON for manual correction in AI editors
## Workflow
1. Confirm input file path (PDF or SVG) or batch folder, and desired output path.
2. Validate that the request involves PDF/SVG garbled text repair; stop early if not.
3. Run `scripts/main.py --input <file> --output <file>` or `--batch <folder>`.
4. Return a structured result separating repaired blocks, skipped blocks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the Fallback Template below.
## Fallback Template
If `scripts/main.py` fails or required fields are missing, respond with:
```
FALLBACK REPORT
───────────────────────────────────────
Objective : <repair goal>
Inputs Available : <file path or batch folder provided>
Missing Inputs : <list exactly what is missing>
Note: --input requires a valid PDF or SVG file path, not a text string.
For batch mode use --batch <folder_path> instead.
Partial Result : <any blocks repaired safely>
Blocked Steps : <what could not be completed and why>
Next Steps : <minimum info needed to complete>
───────────────────────────────────────
```
## Stress-Case Output Checklist
For complex multi-constraint requests, always include these sections explicitly:
- **Assumptions**: repair level default (standard), encoding auto-detected
- **Constraints**: encrypted PDFs require password unlock first; scanned PDFs need OCR first
- **Risks**: severely damaged files may not be fully repairable; rare fonts may not map correctly
- **Unresolved Items**: blocks with confidence < 0.3 flagged for manual review
## Supported Scenarios
**PDF Garbled Text:**
- Box/question mark issues from font embedding problems
- Garbled text from encoding conversion errors
- Missing font substitution characters
- Multi-language mixed encoding issues
**SVG Garbled Text:**
- Text entity encoding errors
- Special character escaping issues
- Invalid font reference display abnormalities
- XML encoding declaration errors
## CLI Usage
```bash
# Fix single PDF
python scripts/main.py --input document.pdf --output fixed.pdf
# Fix single SVG
python scripts/main.py --input diagram.svg --output fixed.svg
# Batch process folder
python scripts/main.py --batch ./input_folder --output ./output_folder
# Interactive repair
python scripts/main.py --input doc.pdf --interactive
# Export editable JSON
python scripts/main.py --input doc.pdf --export-json editable.json
# Specify repair level
python scripts/main.py --input doc.pdf --output fixed.pdf --repair-level aggressive
```
## Parameters
| Parameter | Required | Description | Default |
|---|---|---|---|
| `--input` | Yes* | Input PDF or SVG file path | — |
| `--batch` | Yes* | Batch input folder path | — |
| `--output` | Yes | Output file or folder path | — |
| `--repair-level` | No | `minimal` / `standard` / `aggressive` | `standard` |
| `--interactive` | No | Enable interactive repair mode | False |
| `--export-json` | No | Export editable JSON format | — |
| `--encoding` | No | Source file encoding (default: auto-detect) | auto |
*At least one of `--input` or `--batch` is required.
## Repair Levels
- **Minimal**: Only obvious errors (replacement characters, null bytes); maximum original integrity
- **Standard**: Common encoding issues + smart font replacement; balanced repair rate and accuracy
- **Aggressive**: Full text re-encoding + OCR-assisted recognition; for severely garbled documents
## Output Format (JSON Export)
```json
{
"file_type": "pdf",
"pages": [{
"page_num": 1,
"text_blocks": [{
"id": "tb_001",
"bbox": [100, 200, 300, 220],
"original_text": "?????",
"detected_encoding": "UTF-8",
"confidence": 0.3,
"suggested_fix": "Sample Text"
}]
}],
"repair_summary": {
"total_blocks": 15,
"fixed_blocks": 12,
"skipped_blocks": 3
}
}
```
## Input Validation
This skill accepts: PDF (.pdf) or SVG (.svg) file paths, or a folder path for batch processing, where the files contain garbled or unreadable text caused by font/encoding issues.
If the request does not involve PDF/SVG garbled text repair — for example, asking to convert file formats, edit PDF content directly, perform OCR on scanned images, or process non-vector files — do not proceed. Instead respond:
> "`vector-text-fixer` is designed to fix garbled text in PDF/SVG vector graphics caused by font encoding issues. Your request appears to be outside this scope. Please provide a valid PDF or SVG file path, or use a more appropriate tool."
## Error Handling
- If `--input` receives a text string instead of a file path, report the error and request a valid file path.
- If the file is encrypted, report that password unlock is required before processing.
- If the task goes outside documented scope, stop instead of guessing.
- If `scripts/main.py` fails, use the Fallback Template above.
- Do not fabricate repaired text content or execution outcomes.
## Output Requirements
Every final response must include:
1. **Objective** — file(s) repaired and repair level used
2. **Inputs Received** — file path, repair level, encoding settings
3. **Assumptions** — defaults applied (repair level, encoding detection)
4. **Result** — output file path, blocks fixed vs skipped
5. **Risks and Limits** — confidence thresholds, manual review blocks
6. **Next Checks** — review low-confidence blocks manually before use
## Limitations
- Encrypted PDFs require password unlock before processing
- Severely damaged vector files may not be fully repairable
- Some rare fonts may not map correctly
- Scanned PDFs require OCR recognition first
## Dependencies
```
pdfplumber >= 0.10.0
PyMuPDF >= 1.23.0
cairosvg >= 2.7.0
beautifulsoup4 >= 4.12.0
fonttools >= 4.40.0
chardet >= 5.0.0
Pillow >= 10.0.0
```
FILE:requirements.txt
bs4
dataclasses
fitz
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Vector Text Fixer - Fix garbled text in PDF/SVG vector graphics
"""
import argparse
import json
import os
import re
import sys
from pathlib import Path
from typing import Dict, List, Optional, Tuple, Any
from dataclasses import dataclass, asdict
@dataclass
class TextBlock:
"""Text block data structure"""
id: str
bbox: List[float] # [x0, y0, x1, y1]
original_text: str
page_num: int = 1
font_info: Dict[str, Any] = None
confidence: float = 1.0
suggested_fix: str = ""
is_garbled: bool = False
class GarbledTextDetector:
"""Garbled text detector"""
# Common garbled/replacement characters
REPLACEMENT_CHARS = {
'\ufffd', # replacement character
'\u25a1', # white square
'\u25a0', # black square
'\u25af', # white rectangle
'\u2588', # full block
'\ufffe', # non-character
'\uffff', # non-character
'?', # question mark substitute
}
# Common garbled patterns
GARBLED_PATTERNS = [
r'[\u0000-\u0008\u000b-\u000c\u000e-\u001f]', # Control characters
r'[\ufffd\u25a1\u25a0\u25af\u2588\ufffe\uffff]', # Replacement characters
r'[?]{2,}', # Consecutive replacement characters
r'(?:\\x[0-9a-fA-F]{2}){2,}', # Escape sequences
r'[\x80-\x9f]', # C1 control characters
]
def __init__(self):
self.compiled_patterns = [re.compile(p) for p in self.GARBLED_PATTERNS]
def is_garbled(self, text: str) -> Tuple[bool, float]:
"""
Detect whether text is garbled
Returns: (is_garbled, garbled_confidence 0-1)
"""
if not text or not isinstance(text, str):
return False, 1.0
text_len = len(text)
if text_len == 0:
return False, 1.0
garbled_score = 0.0
# 1. Check replacement characters
replacement_count = sum(1 for c in text if c in self.REPLACEMENT_CHARS)
garbled_score += (replacement_count / text_len) * 0.5
# 2. Check garbled patterns
for pattern in self.compiled_patterns:
matches = pattern.findall(text)
if matches:
garbled_score += len(matches) / text_len * 0.3
# 3. Check abnormal character distribution
if self._has_abnormal_distribution(text):
garbled_score += 0.2
# 4. Check for mixed encoding signs
if self._has_encoding_mixed(text):
garbled_score += 0.15
is_garbled = garbled_score > 0.15 or replacement_count > 0
confidence = max(0.0, 1.0 - min(garbled_score, 1.0))
return is_garbled, confidence
def _has_abnormal_distribution(self, text: str) -> bool:
"""Check whether character distribution is abnormal"""
if len(text) < 3:
return False
# Count ratio of non-printable characters
unprintable = sum(1 for c in text if ord(c) < 32 and c not in '\t\n\r')
ratio = unprintable / len(text)
return ratio > 0.3
def _has_encoding_mixed(self, text: str) -> bool:
"""Detect whether mixed encoding signs are present"""
# Detect signs of UTF-8 multi-byte characters being incorrectly parsed
# e.g.: é should be é (UTF-8 bytes parsed as Latin-1)
mixed_patterns = [
r'Ã[\xa0-\xbf]', # UTF-8 misread as Latin-1
r'Â[\x80-\xbf]',
r'â',
r'ã',
]
for pattern in mixed_patterns:
if re.search(pattern, text):
return True
return False
class PDFFixer:
"""PDF text fixer"""
def __init__(self, detector: GarbledTextDetector):
self.detector = detector
self.text_blocks: List[TextBlock] = []
def fix(self, input_path: str, output_path: str,
repair_level: str = "standard") -> Dict[str, Any]:
"""
Fix garbled text in a PDF file
"""
try:
import fitz # PyMuPDF
except ImportError:
return {
"success": False,
"error": "PyMuPDF (fitz) is required. Install: pip install PyMuPDF"
}
try:
doc = fitz.open(input_path)
except Exception as e:
return {"success": False, "error": f"Cannot open PDF: {str(e)}"}
self.text_blocks = []
repair_count = 0
for page_num in range(len(doc)):
page = doc[page_num]
blocks = self._extract_text_blocks(page, page_num + 1)
for block in blocks:
is_garbled, confidence = self.detector.is_garbled(block.original_text)
if is_garbled:
block.is_garbled = True
block.confidence = confidence
block.suggested_fix = self._suggest_fix(
block.original_text,
repair_level
)
repair_count += 1
self.text_blocks.append(block)
# Generate repair report
result = {
"success": True,
"file_type": "pdf",
"pages": len(doc),
"total_blocks": len(self.text_blocks),
"garbled_blocks": repair_count,
"text_blocks": [asdict(b) for b in self.text_blocks],
"output_path": output_path
}
doc.close()
return result
def _extract_text_blocks(self, page, page_num: int) -> List[TextBlock]:
"""Extract text blocks from a PDF page"""
blocks = []
try:
import fitz
# Get text blocks on the page
text_dict = page.get_text("dict")
block_id = 0
for block in text_dict.get("blocks", []):
if "lines" in block: # Text block
for line in block["lines"]:
for span in line.get("spans", []):
text = span.get("text", "")
if text.strip():
bbox = span.get("bbox", [0, 0, 0, 0])
font_info = {
"font": span.get("font", "Unknown"),
"size": span.get("size", 0),
"flags": span.get("flags", 0),
"color": span.get("color", 0)
}
tb = TextBlock(
id=f"p{page_num}_b{block_id}",
bbox=list(bbox),
original_text=text,
page_num=page_num,
font_info=font_info
)
blocks.append(tb)
block_id += 1
except Exception as e:
print(f"Warning: Error extracting text: {e}")
return blocks
def _suggest_fix(self, garbled_text: str, repair_level: str) -> str:
"""Suggest repair text based on garbled content"""
# More complex repair logic can be implemented here
# Currently returns a placeholder prompting the user to input manually
if repair_level == "minimal":
# Minimal repair: only remove replacement characters
return garbled_text.replace('\ufffd', '').strip()
elif repair_level == "aggressive":
# Deep repair: attempt to decode common encoding errors
return self._try_decode_fixes(garbled_text)
else: # standard
# Standard repair: mark for user confirmation
if all(c in GarbledTextDetector.REPLACEMENT_CHARS for c in garbled_text):
return f"[Manual input required - original: {len(garbled_text)} garbled characters]"
else:
return garbled_text.replace('\ufffd', '[?]').strip()
def _try_decode_fixes(self, text: str) -> str:
"""Attempt multiple encoding repairs"""
# Common encoding error pattern fixes
fixes = []
# UTF-8 parsed as Latin-1
try:
fixed = text.encode('latin-1').decode('utf-8')
fixes.append(fixed)
except:
pass
# GBK/GB2312 issue
try:
fixed = text.encode('latin-1').decode('gbk', errors='ignore')
fixes.append(fixed)
except:
pass
# Return the first reasonable fix
for fix in fixes:
if not self.detector.is_garbled(fix)[0]:
return fix
return f"[Manual input required]"
class SVGFixer:
"""SVG text fixer"""
def __init__(self, detector: GarbledTextDetector):
self.detector = detector
self.text_elements: List[TextBlock] = []
def fix(self, input_path: str, output_path: str,
repair_level: str = "standard") -> Dict[str, Any]:
"""
Fix garbled text in an SVG file
"""
try:
from bs4 import BeautifulSoup
except ImportError:
return {
"success": False,
"error": "BeautifulSoup4 is required. Install: pip install beautifulsoup4"
}
try:
with open(input_path, 'r', encoding='utf-8', errors='ignore') as f:
content = f.read()
except Exception as e:
return {"success": False, "error": f"Cannot read SVG: {str(e)}"}
soup = BeautifulSoup(content, 'xml')
# Extract all text elements
self.text_elements = []
repair_count = 0
text_tags = soup.find_all(['text', 'tspan', 'textPath'])
for idx, tag in enumerate(text_tags):
text_content = tag.get_text()
if not text_content.strip():
continue
is_garbled, confidence = self.detector.is_garbled(text_content)
# Get position and font information
x = tag.get('x', '0')
y = tag.get('y', '0')
font_family = tag.get('font-family', 'default')
font_size = tag.get('font-size', '12')
tb = TextBlock(
id=f"text_{idx}",
bbox=[float(x) if x else 0, float(y) if y else 0, 0, 0],
original_text=text_content,
font_info={
"font_family": font_family,
"font_size": font_size
},
page_num=1
)
if is_garbled:
tb.is_garbled = True
tb.confidence = confidence
tb.suggested_fix = self._suggest_fix(text_content, repair_level)
repair_count += 1
self.text_elements.append(tb)
# Get SVG basic information
svg_tag = soup.find('svg')
svg_info = {
"width": svg_tag.get('width', 'unknown') if svg_tag else 'unknown',
"height": svg_tag.get('height', 'unknown') if svg_tag else 'unknown',
"viewBox": svg_tag.get('viewBox', '') if svg_tag else ''
}
result = {
"success": True,
"file_type": "svg",
"svg_info": svg_info,
"total_elements": len(self.text_elements),
"garbled_elements": repair_count,
"text_elements": [asdict(t) for t in self.text_elements],
"output_path": output_path
}
return result
def _suggest_fix(self, garbled_text: str, repair_level: str) -> str:
"""Suggest SVG text repair"""
if repair_level == "minimal":
return garbled_text.replace('\ufffd', '').strip()
elif repair_level == "aggressive":
return self._try_xml_entity_fix(garbled_text)
else:
if '\ufffd' in garbled_text:
return f"[Manual input required - original: {len(garbled_text)} garbled characters]"
return garbled_text
def _try_xml_entity_fix(self, text: str) -> str:
"""Attempt to fix XML entity encoding issues"""
import html
# Decode HTML entities
decoded = html.unescape(text)
if not self.detector.is_garbled(decoded)[0]:
return decoded
return f"[Manual input required]"
class VectorTextFixer:
"""Vector text fixer main class"""
def __init__(self):
self.detector = GarbledTextDetector()
self.pdf_fixer = PDFFixer(self.detector)
self.svg_fixer = SVGFixer(self.detector)
def fix_file(self, input_path: str, output_path: str,
repair_level: str = "standard") -> Dict[str, Any]:
"""
Automatically select repair method based on file type
"""
input_path = Path(input_path)
if not input_path.exists():
return {"success": False, "error": f"File not found: {input_path}"}
suffix = input_path.suffix.lower()
if suffix == '.pdf':
return self.pdf_fixer.fix(str(input_path), output_path, repair_level)
elif suffix == '.svg':
return self.svg_fixer.fix(str(input_path), output_path, repair_level)
else:
return {"success": False, "error": f"Unsupported file format: {suffix}"}
def batch_fix(self, input_folder: str, output_folder: str,
repair_level: str = "standard") -> List[Dict[str, Any]]:
"""
Batch repair PDF/SVG files in a folder
"""
input_folder = Path(input_folder)
output_folder = Path(output_folder)
output_folder.mkdir(parents=True, exist_ok=True)
results = []
for file_path in input_folder.iterdir():
if file_path.suffix.lower() in ['.pdf', '.svg']:
output_path = output_folder / f"fixed_{file_path.name}"
result = self.fix_file(str(file_path), str(output_path), repair_level)
results.append(result)
return results
def export_editable_json(self, input_path: str, output_path: str) -> Dict[str, Any]:
"""
Export editable JSON format for manual repair in AI tools
"""
result = self.fix_file(input_path, "", repair_level="standard")
if not result.get("success"):
return result
# Add editable markers
editable_data = {
"file_info": {
"original_path": input_path,
"exported_at": self._get_timestamp(),
"tool": "Vector Text Fixer v1.0.0"
},
"repair_data": result
}
# Add user-editable fields
if result["file_type"] == "pdf":
for block in editable_data["repair_data"]["text_blocks"]:
block["user_editable"] = block.get("suggested_fix", "")
elif result["file_type"] == "svg":
for elem in editable_data["repair_data"]["text_elements"]:
elem["user_editable"] = elem.get("suggested_fix", "")
# Save JSON
try:
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(editable_data, f, ensure_ascii=False, indent=2)
editable_data["export_success"] = True
except Exception as e:
editable_data["export_success"] = False
editable_data["export_error"] = str(e)
return editable_data
def _get_timestamp(self) -> str:
"""Get current timestamp"""
from datetime import datetime
return datetime.now().isoformat()
def main():
"""Command-line entry point"""
parser = argparse.ArgumentParser(
description='Vector Text Fixer - Fix garbled text in PDF/SVG vector graphics'
)
# Input options
input_group = parser.add_mutually_exclusive_group(required=True)
input_group.add_argument('--input', '-i', help='Input file path (PDF or SVG)')
input_group.add_argument('--batch', '-b', help='Batch process input folder')
# Output options
parser.add_argument('--output', '-o', help='Output file/folder path')
parser.add_argument('--export-json', '-j', help='Export editable JSON format')
# Repair options
parser.add_argument('--repair-level', '-r',
choices=['minimal', 'standard', 'aggressive'],
default='standard',
help='Repair level (default: standard)')
parser.add_argument('--interactive', action='store_true',
help='Enable interactive repair mode')
args = parser.parse_args()
# Create fixer instance
fixer = VectorTextFixer()
# Export JSON mode
if args.export_json:
if not args.input:
print("Error: --export-json requires --input to be specified")
sys.exit(1)
result = fixer.export_editable_json(args.input, args.export_json)
if result.get("export_success"):
print(f"✓ JSON export successful: {args.export_json}")
print(f" File type: {result['repair_data'].get('file_type')}")
print(f" Detected text blocks: {result['repair_data'].get('total_blocks') or result['repair_data'].get('total_elements')}")
print(f" Garbled text blocks: {result['repair_data'].get('garbled_blocks') or result['repair_data'].get('garbled_elements')}")
else:
print(f"✗ Export failed: {result.get('export_error', 'Unknown error')}")
sys.exit(1)
return
# Batch processing mode
if args.batch:
if not args.output:
print("Error: Batch processing requires --output to be specified")
sys.exit(1)
print(f"Starting batch processing: {args.batch}")
results = fixer.batch_fix(args.batch, args.output, args.repair_level)
success_count = sum(1 for r in results if r.get("success"))
total_count = len(results)
print(f"\nProcessing complete: {success_count}/{total_count} files succeeded")
for r in results:
if r.get("success"):
garbled = r.get("garbled_blocks") or r.get("garbled_elements", 0)
print(f" ✓ {r.get('output_path')} (garbled: {garbled})")
else:
print(f" ✗ {r.get('error', 'Unknown error')}")
return
# Single file processing mode
if args.input:
print(f"Processing file: {args.input}")
result = fixer.fix_file(args.input, args.output or "", args.repair_level)
if result.get("success"):
print(f"✓ Analysis complete")
print(f" File type: {result.get('file_type')}")
if result.get('file_type') == 'pdf':
print(f" Pages: {result.get('pages')}")
print(f" Text blocks: {result.get('total_blocks')}")
print(f" Garbled blocks: {result.get('garbled_blocks')}")
else:
print(f" Text elements: {result.get('total_elements')}")
print(f" Garbled elements: {result.get('garbled_elements')}")
# Show garbled text details
blocks = result.get('text_blocks') or result.get('text_elements', [])
garbled_blocks = [b for b in blocks if b.get('is_garbled')]
if garbled_blocks:
print(f"\nDetected garbled text:")
for i, block in enumerate(garbled_blocks[:5], 1):
orig = block.get('original_text', '')[:50]
sugg = block.get('suggested_fix', '')[:50]
print(f" {i}. ID: {block.get('id')}")
print(f" Original: {orig}")
print(f" Suggested: {sugg}")
print(f" Confidence: {block.get('confidence', 0):.2f}")
if len(garbled_blocks) > 5:
print(f" ... and {len(garbled_blocks) - 5} more garbled text items")
if args.output:
print(f"\nOutput path: {args.output}")
print("Tip: Use --export-json to export editable format for manual repair")
else:
print(f"✗ Processing failed: {result.get('error', 'Unknown error')}")
sys.exit(1)
if __name__ == "__main__":
main()
Integrate REVEL, CADD, PolyPhen scores to predict variant pathogenicity.
---
name: variant-pathogenicity-predictor
description: Integrate REVEL, CADD, PolyPhen scores to predict variant pathogenicity.
license: MIT
skill-author: AIPOCH
---
# Variant Pathogenicity Predictor
Integrate REVEL, CADD, PolyPhen and other scores to predict variant pathogenicity.
## When to Use
- Use this skill when the task needs Integrate REVEL, CADD, PolyPhen scores to predict variant pathogenicity.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
- Scope-focused workflow aligned to: Integrate REVEL, CADD, PolyPhen scores to predict variant pathogenicity.
- Packaged executable path(s): `scripts/main.py`.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Data Analytics/variant-pathogenicity-predictor"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Usage
```text
python scripts/main.py --variant "chr17:43094692:G:A" --gene "BRCA1"
python scripts/main.py --vcf variants.vcf --output report.json
```
## Parameters
- `--variant`: Variant in format chr:pos:ref:alt
- `--vcf`: VCF file with variants
- `--gene`: Gene symbol
- `--scores`: Prediction scores to use (REVEL,CADD,PolyPhen)
## Integrated Scores
- REVEL (Rare Exome Variant Ensemble Learner)
- CADD (Combined Annotation Dependent Depletion)
- PolyPhen-2 (Polymorphism Phenotyping)
- SIFT (Sorting Intolerant From Tolerant)
- MutationTaster
## Output
- Pathogenicity classification
- ACMG guideline interpretation
- Individual score breakdown
- Confidence assessment
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
No additional Python packages required.
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `variant-pathogenicity-predictor` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `variant-pathogenicity-predictor` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Variant Pathogenicity Predictor
Integrate REVEL, CADD, PolyPhen scores for variant classification.
"""
import argparse
import json
class VariantPathogenicityPredictor:
"""Predict variant pathogenicity."""
def predict(self, variant_data):
"""Predict pathogenicity from multiple scores."""
scores = {
"REVEL": variant_data.get("REVEL", 0.5),
"CADD": variant_data.get("CADD", 15),
"PolyPhen": variant_data.get("PolyPhen", 0.5)
}
# Normalize CADD (higher = more damaging)
cadd_norm = min(scores["CADD"] / 30, 1.0)
# Composite score
composite = (scores["REVEL"] * 0.4 +
cadd_norm * 0.3 +
scores["PolyPhen"] * 0.3)
# Classification
if composite >= 0.9:
classification = "Pathogenic"
elif composite >= 0.7:
classification = "Likely Pathogenic"
elif composite >= 0.3:
classification = "Uncertain Significance"
elif composite >= 0.1:
classification = "Likely Benign"
else:
classification = "Benign"
return {
"composite_score": composite,
"classification": classification,
"individual_scores": scores
}
def main():
parser = argparse.ArgumentParser(description="Variant Pathogenicity Predictor")
parser.add_argument("--revel", type=float, help="REVEL score")
parser.add_argument("--cadd", type=float, help="CADD score")
parser.add_argument("--polyphen", type=float, help="PolyPhen score")
parser.add_argument("--demo", action="store_true", help="Run demo")
args = parser.parse_args()
predictor = VariantPathogenicityPredictor()
if args.demo:
variant = {"REVEL": 0.85, "CADD": 25, "PolyPhen": 0.92}
else:
variant = {
"REVEL": args.revel or 0.5,
"CADD": args.cadd or 15,
"PolyPhen": args.polyphen or 0.5
}
result = predictor.predict(variant)
print(f"\n{'='*60}")
print("VARIANT PATHOGENICITY PREDICTION")
print(f"{'='*60}\n")
print(f"Composite Score: {result['composite_score']:.3f}")
print(f"Classification: {result['classification']}")
print("\nIndividual Scores:")
for tool, score in result['individual_scores'].items():
print(f" {tool}: {score}")
print(f"\n{'='*60}\n")
if __name__ == "__main__":
main()
Score the novelty of biological targets through literature mining and.
---
name: target-novelty-scorer
description: Score the novelty of biological targets through literature mining and.
license: MIT
skill-author: AIPOCH
---
# Target Novelty Scorer
ID: 177
## When to Use
- Use this skill when the task needs Score the novelty of biological targets through literature mining and.
- Use this skill for evidence insight tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
See `## Features` above for related details.
- Scope-focused workflow aligned to: Score the novelty of biological targets through literature mining and.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
- Python 3.9+
- requests
- pandas
- biopython (Entrez API)
- numpy
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Evidence Insight/target-novelty-scorer"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Description
Score the novelty of biological targets based on literature mining. By analyzing literature in academic databases such as PubMed and PubMed Central, assess the research popularity, uniqueness, and innovation potential of target molecules in the research field.
## Features
- 🔬 **Literature Retrieval**: Automatically retrieve literature related to targets from PubMed and other databases
- 📊 **Novelty Scoring**: Calculate target novelty score based on multi-dimensional indicators (0-100)
- 📈 **Trend Analysis**: Analyze temporal trends in target research
- 🧬 **Cross-validation**: Verify current research status of targets by combining multiple databases
- 📝 **Report Generation**: Generate detailed novelty analysis reports
## Scoring Criteria
1. **Research Heat (0-25 points)**: Number of related publications and citations in recent years
2. **Uniqueness (0-25 points)**: Distinction from known popular targets
3. **Research Depth (0-20 points)**: Progress of preclinical/clinical research
4. **Collaboration Network (0-15 points)**: Diversity of research institutions/teams
5. **Temporal Trend (0-15 points)**: Research growth trends in recent years
## Usage
### Basic Usage
```text
cd /Users/z04030865/.openclaw/workspace/skills/target-novelty-scorer
python scripts/main.py --target "PD-L1"
```
### Advanced Options
```text
python scripts/main.py \
--target "BRCA1" \
--db pubmed \
--years 10 \
--output report.json \
--format json
```
### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `--target` | string | required | Target molecule name or gene symbol |
| `--db` | string | pubmed | Data source (pubmed, pmc, all) |
| `--years` | int | 5 | Analysis year range |
| `--output` | string | stdout | Output file path |
| `--format` | string | text | Output format (text, json, csv) |
| `--verbose` | flag | false | Verbose output |
## Output Format
### JSON Output
```json
{
"target": "PD-L1",
"novelty_score": 72.5,
"confidence": 0.85,
"breakdown": {
"research_heat": 18.5,
"uniqueness": 20.0,
"research_depth": 15.2,
"collaboration": 12.0,
"trend": 6.8
},
"metadata": {
"total_papers": 15234,
"recent_papers": 3421,
"clinical_trials": 89,
"analysis_date": "2026-02-06"
},
"interpretation": "This target has moderate novelty, with moderate research heat in recent years..."
}
```
## API Requirements
- NCBI API Key (for PubMed retrieval)
- Optional: Europe PMC API
## Installation
```text
pip install -r requirements.txt
```
## License
MIT License - Part of OpenClaw Bioinformatics Skills Collection
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python scripts with tools | High |
| Network Access | External API calls | High |
| File System Access | Read/write data | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Data handled securely | Medium |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] API requests use HTTPS only
- [ ] Input validated against allowed patterns
- [ ] API timeout and retry mechanisms implemented
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no internal paths exposed)
- [ ] Dependencies audited
- [ ] No exposure of internal service architecture
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `target-novelty-scorer` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `target-novelty-scorer` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## References
- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/audit-reference.md
# Audit Reference
## Scope
- Skill: `target-novelty-scorer`
- Core purpose: Score the novelty of biological targets through literature mining and.
- Use only within the documented workflow and category boundary defined in `SKILL.md`
## Supported Audit Paths
- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`
## Fallback Boundary
If required inputs are incomplete, the skill should still return:
- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable
FILE:requirements.txt
dataclasses
numpy
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Target Novelty Scorer
Target novelty scoring tool based on literature mining
Usage:
python main.py --target "PD-L1"
python main.py --target "BRCA1" --years 10 --format json
"""
import argparse
import json
import sys
import time
from dataclasses import asdict, dataclass
from datetime import datetime
from typing import Dict, List, Optional
import numpy as np
@dataclass
class NoveltyScore:
"""Novelty score data structure"""
target: str
novelty_score: float
confidence: float
breakdown: Dict[str, float]
metadata: Dict
interpretation: str
class PubMedSearcher:
"""PubMed literature searcher (simulated implementation)"""
def __init__(self, api_key: Optional[str] = None):
self.api_key = api_key
self.base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def search(self, query: str, years: int = 5) -> Dict:
"""
Search PubMed literature
Actual implementation should call NCBI E-utilities API
Here using simulated data for demonstration
"""
# Simulated search results
current_year = datetime.now().year
# Generate simulated data based on target name (actual implementation should call real API)
np.random.seed(hash(query) % 2**32)
total_papers = np.random.randint(100, 50000)
recent_papers = int(total_papers * np.random.uniform(0.1, 0.5))
clinical_trials = np.random.randint(0, min(200, total_papers // 100))
# Generate year distribution
year_distribution = {
str(year): np.random.randint(recent_papers // years // 2, recent_papers // years * 2)
for year in range(current_year - years, current_year + 1)
}
return {
"query": query,
"total_count": total_papers,
"recent_count": recent_papers,
"clinical_trials": clinical_trials,
"year_distribution": year_distribution,
"search_year": current_year,
"years_analyzed": years
}
class NoveltyScorer:
"""Target novelty scorer"""
def __init__(self):
self.searcher = PubMedSearcher()
# Scoring weight configuration
self.weights = {
"research_heat": 0.25, # Research heat 25%
"uniqueness": 0.25, # Uniqueness 25%
"research_depth": 0.20, # Research depth 20%
"collaboration": 0.15, # Collaboration network 15%
"trend": 0.15 # Time trend 15%
}
def _score_research_heat(self, data: Dict) -> float:
"""
Score: Research heat (0-25 points)
Based on recent literature count and citations
"""
total = data.get("total_count", 0)
recent = data.get("recent_count", 0)
# Literature quantity score (0-15 points)
if total < 500:
quantity_score = 12 + (total / 500) * 3 # Scarce: high score
elif total < 5000:
quantity_score = 8 + (5000 - total) / 4500 * 4
elif total < 20000:
quantity_score = 4 + (20000 - total) / 15000 * 4
else:
quantity_score = max(0, 4 - (total - 20000) / 50000)
# Recent activity score (0-10 points)
recent_ratio = recent / total if total > 0 else 0
recency_score = min(10, recent_ratio * 20)
return min(25, quantity_score + recency_score)
def _score_uniqueness(self, data: Dict, target: str) -> float:
"""
Score: Uniqueness (0-25 points)
Distinctiveness from known hot targets
"""
total = data.get("total_count", 0)
# Hot target list (examples)
hot_targets = ["p53", "KRAS", "EGFR", "HER2", "PD-L1", "PD-1", "VEGF"]
# If itself is a hot target, uniqueness is lower
if target.upper() in [t.upper() for t in hot_targets]:
base_score = 5
else:
# Uniqueness based on literature count
if total < 1000:
base_score = 20 + min(5, (1000 - total) / 200)
elif total < 5000:
base_score = 15 + (5000 - total) / 4000 * 5
elif total < 15000:
base_score = 8 + (15000 - total) / 10000 * 7
else:
base_score = max(0, 8 - (total - 15000) / 20000 * 8)
return min(25, base_score)
def _score_research_depth(self, data: Dict) -> float:
"""
Score: Research depth (0-20 points)
Progress level of preclinical/clinical research
"""
clinical_trials = data.get("clinical_trials", 0)
total = data.get("total_count", 0)
# Clinical trial count score
if clinical_trials == 0:
clinical_score = 8 # Early target, unknown potential
elif clinical_trials < 10:
clinical_score = 12 + clinical_trials * 0.5
elif clinical_trials < 50:
clinical_score = 16 + (clinical_trials - 10) * 0.1
else:
clinical_score = max(10, 20 - (clinical_trials - 50) * 0.05)
# Basic research depth (based on total literature count)
if total < 1000:
basic_score = 2
elif total < 5000:
basic_score = 3
else:
basic_score = 4
return min(20, clinical_score + basic_score)
def _score_collaboration(self, data: Dict) -> float:
"""
Score: Collaboration network (0-15 points)
Diversity of research institutions/teams distribution
"""
total = data.get("total_count", 0)
# Estimated collaboration diversity based on literature count
if total < 100:
diversity_score = 5
elif total < 1000:
diversity_score = 8 + (total - 100) / 900 * 4
elif total < 5000:
diversity_score = 12 + (total - 1000) / 4000 * 3
else:
diversity_score = 15
return min(15, diversity_score)
def _score_trend(self, data: Dict) -> float:
"""
Score: Time trend (0-15 points)
Recent research growth trend
"""
year_dist = data.get("year_distribution", {})
if not year_dist or len(year_dist) < 2:
return 7.5 # Neutral score
# Calculate growth trend
years = sorted(year_dist.keys())
values = [year_dist[y] for y in years]
if len(values) >= 2:
# Simple linear trend
early_avg = np.mean(values[:len(values)//2])
recent_avg = np.mean(values[len(values)//2:])
if early_avg > 0:
growth_rate = (recent_avg - early_avg) / early_avg
else:
growth_rate = 0
# Convert to 0-15 points
if growth_rate > 1.0: # Growth over 100%
trend_score = 15
elif growth_rate > 0.5:
trend_score = 12 + (growth_rate - 0.5) * 6
elif growth_rate > 0.2:
trend_score = 9 + (growth_rate - 0.2) * 10
elif growth_rate > 0:
trend_score = 6 + growth_rate * 15
elif growth_rate > -0.2:
trend_score = max(0, 6 + growth_rate * 30)
else:
trend_score = max(0, 3 + (growth_rate + 0.2) * 15)
else:
trend_score = 7.5
return min(15, max(0, trend_score))
def calculate_confidence(self, data: Dict) -> float:
"""Calculate confidence (0-1)"""
total = data.get("total_count", 0)
# More literature, higher confidence
if total < 50:
return 0.4
elif total < 200:
return 0.6
elif total < 1000:
return 0.75
elif total < 5000:
return 0.85
else:
return 0.90
def generate_interpretation(self, score: float, data: Dict) -> str:
"""Generate score interpretation"""
total = data.get("total_count", 0)
clinical = data.get("clinical_trials", 0)
if score >= 80:
level = "Extremely High Novelty"
desc = "This target has limited research but unique value, representing a highly promising innovative direction."
elif score >= 65:
level = "High Novelty"
desc = "This target has some research foundation but still has significant exploration space, recommended for focused attention."
elif score >= 50:
level = "Medium Novelty"
desc = "This target has moderate research heat, requiring further evaluation of its differentiation advantages."
elif score >= 35:
level = "Low Novelty"
desc = "This target already has substantial research, with intense competition, requiring breakthroughs in specific niche areas."
else:
level = "Very Low Novelty"
desc = "This target is a mature research direction with limited innovation space, requiring careful evaluation of input-output ratio."
details = f"Total literature: {total}, Clinical trials: {clinical}."
return f"[{level}] {desc} {details}"
def score(self, target: str, years: int = 5) -> NoveltyScore:
"""
Calculate target novelty score
Args:
target: Target name or gene symbol
years: Analysis year range
Returns:
NoveltyScore object
"""
# Search literature data
search_result = self.searcher.search(target, years)
# Calculate scores for each dimension
breakdown = {
"research_heat": self._score_research_heat(search_result),
"uniqueness": self._score_uniqueness(search_result, target),
"research_depth": self._score_research_depth(search_result),
"collaboration": self._score_collaboration(search_result),
"trend": self._score_trend(search_result)
}
# Calculate total score (weighted average)
total_score = sum(
breakdown[key] * self.weights[key]
for key in breakdown
)
# Normalize to 0-100
novelty_score = round(total_score * 100 / 25, 1)
# Calculate confidence
confidence = self.calculate_confidence(search_result)
# Build metadata
metadata = {
"total_papers": search_result["total_count"],
"recent_papers": search_result["recent_count"],
"clinical_trials": search_result["clinical_trials"],
"analysis_date": datetime.now().strftime("%Y-%m-%d"),
"years_analyzed": years
}
# Generate interpretation
interpretation = self.generate_interpretation(novelty_score, search_result)
return NoveltyScore(
target=target,
novelty_score=novelty_score,
confidence=round(confidence, 2),
breakdown={k: round(v * 100 / 25, 1) for k, v in breakdown.items()},
metadata=metadata,
interpretation=interpretation
)
def format_text_output(result: NoveltyScore) -> str:
"""Format text output"""
lines = [
"=" * 60,
f"Target Novelty Score Report: {result.target}",
"=" * 60,
"",
f"Overall Score: {result.novelty_score}/100",
f"Confidence: {result.confidence * 100:.0f}%",
"",
"Dimension Scores:",
f" - Research Heat: {result.breakdown['research_heat']:.1f}/100",
f" - Uniqueness: {result.breakdown['uniqueness']:.1f}/100",
f" - Research Depth: {result.breakdown['research_depth']:.1f}/100",
f" - Collaboration Network: {result.breakdown['collaboration']:.1f}/100",
f" - Time Trend: {result.breakdown['trend']:.1f}/100",
"",
"Statistics:",
f" - Total Literature: {result.metadata['total_papers']:,}",
f" - Recent Literature: {result.metadata['recent_papers']:,}",
f" - Clinical Trials: {result.metadata['clinical_trials']}",
f" - Analysis Date: {result.metadata['analysis_date']}",
"",
f"Interpretation: {result.interpretation}",
"",
"=" * 60
]
return "\n".join(lines)
def format_csv_output(result: NoveltyScore) -> str:
"""Format CSV output"""
headers = [
"target", "novelty_score", "confidence",
"research_heat", "uniqueness", "research_depth",
"collaboration", "trend", "total_papers",
"recent_papers", "clinical_trials", "interpretation"
]
values = [
result.target,
result.novelty_score,
result.confidence,
result.breakdown['research_heat'],
result.breakdown['uniqueness'],
result.breakdown['research_depth'],
result.breakdown['collaboration'],
result.breakdown['trend'],
result.metadata['total_papers'],
result.metadata['recent_papers'],
result.metadata['clinical_trials'],
f'"{result.interpretation}"'
]
return ",".join(map(str, values))
def main():
"""Main function"""
parser = argparse.ArgumentParser(
description="Target Novelty Scoring Tool - Based on Literature Mining",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python main.py --target "PD-L1"
python main.py --target "BRCA1" --years 10 --format json
python main.py --target "EGFR" --output report.json
"""
)
parser.add_argument(
"--target", "-t",
required=True,
help="Target name or gene symbol (e.g.: PD-L1, BRCA1)"
)
parser.add_argument(
"--db", "-d",
choices=["pubmed", "pmc", "all"],
default="pubmed",
help="Data source (default: pubmed)"
)
parser.add_argument(
"--years", "-y",
type=int,
default=5,
help="Analysis year range (default: 5)"
)
parser.add_argument(
"--output", "-o",
help="Output file path (default: stdout)"
)
parser.add_argument(
"--format", "-f",
choices=["text", "json", "csv"],
default="text",
help="Output format (default: text)"
)
parser.add_argument(
"--verbose", "-v",
action="store_true",
help="Verbose output mode"
)
args = parser.parse_args()
# Execute scoring
try:
scorer = NoveltyScorer()
result = scorer.score(args.target, args.years)
# Format output
if args.format == "json":
output = json.dumps(asdict(result), ensure_ascii=False, indent=2)
elif args.format == "csv":
output = format_csv_output(result)
else:
output = format_text_output(result)
# Output results
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(output)
print(f"Report saved to: {args.output}")
else:
print(output)
# Verbose mode
if args.verbose and args.format == "text":
print(f"\nRaw Data: {json.dumps(result.metadata, ensure_ascii=False)}")
return 0
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return 1
if __name__ == "__main__":
sys.exit(main())
Polishes response letters by transforming defensive or harsh language.
---
name: response-tone-polisher
description: Polishes response letters by transforming defensive or harsh language.
license: MIT
skill-author: AIPOCH
---
# Response Tone Polisher
Polishes response letters to peer reviewers by softening harsh or defensive language while preserving the author's position and scientific integrity.
## When to Use
- Use this skill when the task needs Polishes response letters by transforming defensive or harsh language.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
- **Tone Analysis**: Identifies defensive, confrontational, or overly direct language
- **Polite Transformation**: Converts harsh statements into courteous academic prose
- **Position Preservation**: Maintains the author's scientific stance while improving delivery
- **Context Awareness**: Adapts based on response type (acceptance, partial acceptance, respectful decline)
- **Academic Expression Library**: Built-in collection of polished academic phrasings
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `dataclasses`: `unspecified`. Declared in `requirements.txt`.
- `enum`: `unspecified`. Declared in `requirements.txt`.
## Example Usage
```bash
cd "20260318/scientific-skills/Academic Writing/response-tone-polisher"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Overview` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```
## Overview
This skill analyzes author draft responses to reviewer comments and transforms confrontational or defensive phrasing into professional, diplomatic academic language. It helps researchers maintain positive relationships with reviewers while standing firm on scientifically justified positions.
## Usage Examples
### Basic Usage
```
Input:
Reviewer: The sample size is too small for meaningful conclusions.
Draft Response: I disagree. Our sample size is standard in this field.
Output:
We appreciate the reviewer's concern regarding sample size. While we acknowledge
that larger samples provide greater statistical power, our sample size is consistent
with established conventions in this field and meets the requirements for adequate
power analysis (as detailed in the Methods section).
```
### Defensive Language Transformation
| Original (Defensive) | Polished (Professional) |
|---------------------|------------------------|
| "I will not change this." | "We have carefully considered this suggestion and respectfully maintain our original approach because..." |
| "The reviewer is wrong." | "We respectfully offer a different interpretation..." |
| "This is unnecessary." | "We appreciate this suggestion; however, we believe the current presentation adequately addresses this point." |
| "We already explained this." | "We have expanded our explanation to enhance clarity (Page X, Lines Y-Z)." |
| "That's not our fault." | "We acknowledge this limitation and have added appropriate caveats to the Discussion." |
## Input Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `reviewer_comment` | str | Yes | The reviewer's original comment or criticism |
| `draft_response` | str | Yes | Author's initial draft response (may contain harsh/defensive language) |
| `response_type` | str | No | One of: `accept`, `partial`, `decline` (default: auto-detect) |
| `polish_level` | str | No | `light`, `moderate`, `heavy` (default: moderate) |
| `preserve_meaning` | bool | No | Ensure scientific position is preserved (default: true) |
## Output Format
```json
{
"polished_response": "string",
"original_tone_score": "float (0-1, higher = more defensive)",
"improvements": [
{
"original_phrase": "string",
"polished_phrase": "string",
"issue_type": "string"
}
],
"suggestions": ["string"],
"politeness_score": "float (0-1)"
}
```
## Tone Patterns Detected
The skill identifies and transforms:
### 1. Direct Refusals
- "No" / "We won't" → "We respectfully decline to..."
- "We can't" → "We are unable to..."
### 2. Defensive Statements
- "But we already..." → "We have now clarified..."
- "This is not correct" → "We respectfully note that..."
### 3. Blame Shifting
- "The reviewer misunderstood" → "We apologize for the lack of clarity; we have revised..."
- "This is standard" → "This approach aligns with established conventions..."
### 4. Emotional Language
- "Unfortunately" (overused) → [removed or softened]
- "Obviously" → [removed]
- "Clearly" → [removed or context-dependent]
## Polite Academic Expressions
### Acknowledging Reviewers
- "We thank the reviewer for this insightful observation."
- "We appreciate the reviewer's careful attention to this detail."
- "We are grateful for this constructive feedback."
- "This is an excellent point."
### Expressing Disagreement Diplomatically
- "We respectfully offer an alternative interpretation..."
- "Upon careful reconsideration, we believe..."
- "While we appreciate this perspective, we note that..."
- "We respectfully maintain our position that..."
### Explaining Limitations
- "We acknowledge this limitation and have addressed it by..."
- "This constraint reflects the trade-off between..."
- "We have added appropriate caveats regarding this limitation."
### Describing Changes
- "We have revised the manuscript to clarify..."
- "We have expanded the relevant section to include..."
- "We have incorporated this suggestion by..."
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Command Line Usage
```text
# Interactive mode
python scripts/main.py --interactive
# File-based
python scripts/main.py \
--reviewer-comment "comment.txt" \
--draft-response "draft.txt" \
--output "polished.txt"
# Direct input
python scripts/main.py \
--reviewer "The data is insufficient." \
--draft "You are wrong. We have enough data." \
--polish-level heavy
```
## Python API
```python
from scripts.main import TonePolisher
polisher = TonePolisher()
result = polisher.polish(
reviewer_comment="The methodology is flawed.",
draft_response="No it's not. We did it right.",
response_type="decline",
polish_level="moderate"
)
print(result["polished_response"])
```
## References
- `references/polite_expressions.json` - Curated library of academic polite expressions
- `references/tone_patterns.md` - Common defensive patterns and their transformations
- `references/examples/` - Before/after polishing examples
## Limitations
- Does not verify scientific accuracy of responses
- Requires human review for complex nuanced disagreements
- May over-soften; authors should verify position is still clear
- Best for English-language responses
## Quality Checklist
After polishing, verify:
- [ ] Original scientific position is preserved
- [ ] No confrontational language remains
- [ ] Professional tone throughout
- [ ] Clear acknowledgment of reviewer's effort
- [ ] Specific changes are still referenced
- [ ] Response directly addresses the comment
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `response-tone-polisher` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `response-tone-polisher` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/polite_expressions.json
{
"defensive_patterns": {
"direct_refusal": {
"patterns": [
"I will not change this",
"We will not modify",
"We cannot do this",
"No, we won't",
"This is not possible"
],
"replacements": [
"We have carefully considered this suggestion and respectfully maintain our original approach",
"Upon thorough evaluation, we believe our current approach is appropriate",
"We respectfully decline to make this change because"
]
},
"confrontational": {
"patterns": [
"I disagree",
"We disagree",
"This is wrong",
"You are incorrect",
"The reviewer is mistaken"
],
"replacements": [
"We respectfully offer a different interpretation",
"We note that an alternative view may be warranted",
"Upon careful reconsideration, we maintain that"
]
},
"defensive": {
"patterns": [
"But we already explained",
"We already stated this",
"This was mentioned before",
"We previously noted"
],
"replacements": [
"We have now expanded our explanation to enhance clarity",
"We have revised this section to provide additional detail",
"To improve clarity, we have added"
]
},
"blame_shifting": {
"patterns": [
"The reviewer misunderstood",
"This was misinterpreted",
"You didn't understand",
"This is not our fault"
],
"replacements": [
"We apologize for the lack of clarity; we have revised",
"We recognize that our explanation was insufficient; we have clarified",
"We acknowledge this limitation and have addressed it"
]
},
"dismissive": {
"patterns": [
"This is unnecessary",
"This is not needed",
"Obviously",
"Clearly",
"It is evident that"
],
"replacements": [
"While we appreciate this suggestion, we believe the current presentation is appropriate",
"We have considered this and maintain that",
"We note that"
]
}
},
"polite_acknowledgments": {
"standard": [
"We thank the reviewer for this insightful observation.",
"We appreciate the reviewer's careful attention to this detail.",
"We are grateful for this constructive feedback.",
"We value this thoughtful comment.",
"Thank you for raising this important point."
],
"major_comment": [
"We are grateful for this significant observation.",
"This is an excellent and important point.",
"We deeply appreciate this critical feedback."
],
"minor_comment": [
"Thank you for this helpful suggestion.",
"We appreciate this refinement.",
"This is a valuable correction."
]
},
"respectful_disagreement": {
"opening": [
"We respectfully offer an alternative interpretation.",
"Upon careful reconsideration, we believe that",
"While we appreciate this perspective, we respectfully note that",
"We respectfully maintain our position that",
"We believe an alternative approach may be warranted because"
],
"rationale": [
"our data strongly support the current interpretation",
"the established methodology in this field supports our approach",
"the specific constraints of our study necessitate this approach",
"additional analysis (described below) confirms our original conclusion"
],
"closing": [
"We hope this explanation adequately addresses the reviewer's concern.",
"We have added further discussion of this point to the manuscript.",
"We believe this approach best serves the scientific integrity of the study."
]
},
"accepting_changes": {
"simple": [
"We have revised the manuscript accordingly.",
"This has been addressed in the revised version.",
"We have incorporated this suggestion."
],
"detailed": [
"We have expanded the {section} section to include {detail} (Page {page}, Lines {lines}).",
"We have added {content} as suggested (Page {page}).",
"The {section} has been revised to clarify {point} (Lines {lines})."
]
},
"explaining_limitations": {
"sample_size": [
"We acknowledge that our sample size is limited; however, it is consistent with similar studies in this field.",
"While we recognize the constraint of our sample size, our power analysis indicates adequate statistical power.",
"We have added a discussion of this limitation to the manuscript."
],
"methodological": [
"This methodological choice reflects the trade-off between {factor1} and {factor2}.",
"We selected this approach because {rationale}, while acknowledging that {alternative} is also valid.",
"We have added appropriate caveats regarding this methodological choice."
],
"general": [
"We acknowledge this limitation and have addressed it by {action}.",
"We recognize this constraint and have noted it in the Discussion section.",
"This limitation is now explicitly discussed in the revised manuscript."
]
},
"clarification_phrases": [
"To enhance clarity, we have",
"We have revised this section to more clearly state that",
"The text now explicitly states",
"We have clarified this point by",
"For improved clarity, we have added"
],
"location_indicators": [
"(Page {page}, Lines {start}-{end})",
"(Page {page})",
"(Lines {start}-{end})",
"(Figure {number})",
"(Table {number}, Page {page})"
]
}
FILE:references/tone_patterns.md
# Tone Transformation Patterns
This document catalogs common defensive or harsh language patterns in peer review responses and their polished academic alternatives.
## Pattern Categories
### 1. Direct Refusals
When you must decline a reviewer's suggestion, avoid blunt refusals.
| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "I will not change this." | "We have carefully considered this suggestion and respectfully maintain our original approach because..." |
| "We can't do this." | "We are unable to make this change due to [reason]. However, we have..." |
| "This is not possible." | "Unfortunately, this modification is not feasible within the scope of our study because..." |
| "No, we won't do that." | "We respectfully decline to make this change because [rationale]. Instead, we have..." |
| "I refuse to change this." | "Upon careful evaluation, we believe the current formulation best serves the manuscript because..." |
### 2. Expressing Disagreement
Disagreement is acceptable; rudeness is not.
| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "You're wrong." | "We respectfully offer a different interpretation..." |
| "The reviewer is incorrect." | "We note that an alternative view may be warranted..." |
| "This is not true." | "We respectfully suggest that the evidence may support an alternative conclusion..." |
| "I disagree completely." | "Upon careful reconsideration, we maintain that..." |
| "That's not right." | "We believe the data are best interpreted as..." |
| "The reviewer misunderstood." | "We apologize for the lack of clarity; we have revised the text to clarify that..." |
### 3. Defensive Responses
Don't defend by pointing out what you already did.
| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "But we already explained this." | "We have now expanded our explanation in Section X to provide additional detail..." |
| "We stated this clearly in the original." | "We have revised this section to enhance clarity regarding..." |
| "This was mentioned on page X." | "To make this more prominent, we have added additional emphasis to the discussion on page X..." |
| "We already discussed this limitation." | "We have expanded our discussion of this limitation to better address the reviewer's concern..." |
| "This point was already made." | "We have strengthened the presentation of this point by..." |
### 4. Dismissive Language
Never dismiss a reviewer's comment, even if it seems trivial.
| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "This is unnecessary." | "While we appreciate this suggestion, we believe the current presentation adequately addresses this point because..." |
| "This doesn't matter." | "We have considered this point and believe the current approach is appropriate because..." |
| "This is obvious." | "We have ensured this point is clearly stated in the revised manuscript..." |
| "This is standard practice." | "This approach aligns with established conventions in the field..." |
| "Everyone knows this." | "We have added a brief explanation of this concept for clarity..." |
### 5. Blame Shifting
Take responsibility for any misunderstanding.
| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "The reviewer didn't understand." | "We recognize that our explanation was insufficient; we have revised the text to clarify..." |
| "This was clear in the original." | "We have enhanced the clarity of this section by..." |
| "The figure caption was clear." | "We have revised the figure caption to provide additional clarity..." |
| "This is not our fault." | "We acknowledge this limitation and have added appropriate caveats to the Discussion..." |
| "The reviewer missed this." | "We have made this point more prominent in the revised manuscript..." |
### 6. Emotional Language
Keep emotions out of academic writing.
| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "Unfortunately, we cannot..." | "We are unable to..." |
| "We are disappointed that..." | [Remove entirely] |
| "This is frustrating because..." | "This presents a challenge because..." |
| "We strongly believe..." | "We maintain that..." |
| "We feel that..." | "We believe that..." |
### 7. Hedges and Qualifiers
Academic writing uses measured, precise language.
| ❌ Too Strong | ✅ Measured |
|--------------|------------|
| "This proves that..." | "These results suggest that..." |
| "This clearly shows..." | "These findings indicate that..." |
| "Obviously..." | [Remove or provide evidence] |
| "Undoubtedly..." | "The evidence suggests..." |
| "It is certain that..." | "It is likely that..." |
## Response Templates by Scenario
### Accepting a Suggestion
**Structure:**
1. Acknowledge the value of the suggestion
2. Describe the specific change made
3. Provide location reference
**Template:**
```
We thank the reviewer for this valuable suggestion. We have revised the
[section] to [describe change]. Specifically, we have [specific action]
(Page X, Lines Y-Z).
```
### Partial Acceptance
**Structure:**
1. Acknowledge the suggestion
2. Explain what was implemented
3. Explain what was not and why
4. Provide rationale
**Template:**
```
We appreciate the reviewer's suggestion regarding [topic]. We have adopted
part of this recommendation by [action taken]. However, we have maintained
our original approach to [aspect] because [brief rationale]. We have added
a discussion of this rationale to the manuscript (Page X).
```
### Respectful Decline
**Structure:**
1. Thank the reviewer
2. State your position respectfully
3. Provide clear rationale
4. Offer a compromise if possible
**Template:**
```
We thank the reviewer for this suggestion. Upon careful consideration, we
respectfully maintain our original [approach/formulation/interpretation]
because [clear rationale supported by evidence]. We hope this explanation
adequately addresses the reviewer's concern. To enhance clarity, we have
added additional discussion of this point (Page X, Lines Y-Z).
```
## Key Phrases Library
### Opening Acknowledgments
- "We thank the reviewer for this insightful observation."
- "We appreciate the reviewer's careful attention to this detail."
- "We are grateful for this constructive feedback."
- "This is an excellent point."
- "We value this thoughtful comment."
### Expressing Alternative Views
- "We respectfully offer an alternative interpretation..."
- "Upon careful reconsideration, we believe..."
- "While we appreciate this perspective, we respectfully note that..."
- "We respectfully maintain our position that..."
- "An alternative view might be that..."
### Describing Limitations
- "We acknowledge this limitation and have addressed it by..."
- "This constraint reflects the trade-off between..."
- "We have added appropriate caveats regarding this limitation..."
- "We recognize this constraint and have noted it in the Discussion..."
### Describing Changes
- "We have revised the manuscript to clarify..."
- "We have expanded the relevant section to include..."
- "We have incorporated this suggestion by..."
- "The text has been updated to reflect..."
- "We have added the following clarification..."
### Closing Statements
- "We hope this explanation adequately addresses the reviewer's concern."
- "We believe these revisions have substantially improved the manuscript."
- "We are grateful for the opportunity to strengthen our work."
## Common Pitfalls
### 1. Over-Apologizing
Avoid excessive apologies. One "thank you" is sufficient.
❌ "We sincerely apologize for this error and are deeply sorry..."
✅ "We thank the reviewer for identifying this error. We have corrected..."
### 2. Over-Explaining
Keep rationales concise. Reviewers don't need your entire thought process.
❌ "We spent many weeks considering this and consulted with several colleagues and reviewed the literature extensively and..."
✅ "Based on [specific reference/standard practice], we maintain that..."
### 3. Being Too Vague
Be specific about what you changed.
❌ "We have addressed this concern."
✅ "We have added a paragraph discussing [specific topic] (Page X, Lines Y-Z)."
### 4. Promising Too Much
Don't promise changes you won't make.
❌ "We will conduct additional experiments..." (if you won't)
✅ "We respectfully note that additional experiments are beyond the scope..."
## Quick Reference: Severity Levels
### Light Polish (Defensiveness 0.3-0.5)
- Remove obvious emotional language
- Add basic acknowledgments
- Fix direct refusals
### Moderate Polish (Defensiveness 0.5-0.7)
- Transform confrontational language
- Add rationale for disagreements
- Improve overall flow
### Heavy Polish (Defensiveness 0.7+)
- Complete restructuring of response
- Add comprehensive acknowledgments
- Multiple revision passes
- Suggest alternative compromises
FILE:requirements.txt
dataclasses
enum
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Response Tone Polisher
Polishes response letters to peer reviewers by transforming defensive or harsh
language into professional, courteous academic prose.
Usage:
python main.py --interactive
python main.py --reviewer-comment "comment.txt" --draft-response "draft.txt"
python main.py --reviewer "The data is insufficient" --draft "We disagree"
Author: AI Assistant
Version: 1.0.0
"""
import argparse
import json
import re
import sys
from dataclasses import dataclass, field
from enum import Enum
from typing import List, Optional, Dict, Tuple
import os
class ResponseType(Enum):
"""Types of response to reviewer comments."""
ACCEPT = "accept"
PARTIAL = "partial"
DECLINE = "decline"
class PolishLevel(Enum):
"""Levels of tone polishing."""
LIGHT = "light"
MODERATE = "moderate"
HEAVY = "heavy"
@dataclass
class ToneIssue:
"""Represents a tone issue found in text."""
original_phrase: str
issue_type: str
severity: float # 0-1, higher = more problematic
position: Tuple[int, int] # start, end positions
@dataclass
class PolishResult:
"""Result of tone polishing."""
polished_response: str
original_tone_score: float
politeness_score: float
improvements: List[Dict]
suggestions: List[str]
class TonePolisher:
"""
Main class for polishing response letter tone.
Transforms defensive or harsh language into professional,
courteous academic prose while preserving scientific positions.
"""
# Defensive/harsh patterns and their polished alternatives
TONE_PATTERNS = {
# Direct refusals
r"\b[Ii] will not (?:change|modify|correct|revise|update)\b": {
"replacement": "We have carefully considered this suggestion and respectfully maintain our original {noun}",
"issue_type": "direct_refusal",
"severity": 0.9
},
r"\b[Ww]e will not (?:change|modify|correct|revise|update)\b": {
"replacement": "We have carefully considered this suggestion and respectfully maintain our original {noun}",
"issue_type": "direct_refusal",
"severity": 0.9
},
r"\b[Nn]o,? (?:we )?(?:will not|won't|cannot|can't)\b": {
"replacement": "We respectfully decline to {verb} because",
"issue_type": "direct_refusal",
"severity": 0.85
},
r"\b[Ii] (?:disagree|don't agree)\b": {
"replacement": "We respectfully offer a different interpretation",
"issue_type": "confrontational",
"severity": 0.7
},
r"\b[Ww]e (?:disagree|don't agree)\b": {
"replacement": "We respectfully offer a different interpretation",
"issue_type": "confrontational",
"severity": 0.7
},
# Defensive statements
r"\b[Tt]he reviewer is (?:wrong|incorrect|mistaken)\b": {
"replacement": "We respectfully note that",
"issue_type": "defensive",
"severity": 0.95
},
r"\b[Tt]his is (?:wrong|incorrect|not correct)\b": {
"replacement": "We respectfully suggest an alternative view",
"issue_type": "defensive",
"severity": 0.8
},
r"\b[Ww]e (?:already|previously) (?:stated|mentioned|explained|noted)\b": {
"replacement": "We have now expanded our explanation to enhance clarity",
"issue_type": "defensive",
"severity": 0.75
},
r"\b[Bb]ut we (?:already|did)\b": {
"replacement": "We have now clarified",
"issue_type": "defensive",
"severity": 0.7
},
# Blame shifting
r"\b[Tt]he reviewer (?:misunderstood|misinterpreted|did not understand)\b": {
"replacement": "We apologize for the lack of clarity; we have revised",
"issue_type": "blame_shifting",
"severity": 0.9
},
r"\b[Tt]his was not our (?:fault|responsibility|error)\b": {
"replacement": "We acknowledge this limitation and have added appropriate caveats",
"issue_type": "blame_shifting",
"severity": 0.85
},
# Dismissive language
r"\b[Tt]his is (?:unnecessary|not needed|redundant)\b": {
"replacement": "While we appreciate this suggestion, we believe the current presentation adequately addresses this point",
"issue_type": "dismissive",
"severity": 0.75
},
r"\b[Tt]his is (?:obvious|clear|evident)\b": {
"replacement": "This point is addressed",
"issue_type": "dismissive",
"severity": 0.6
},
r"\b[Ii]t is (?:obvious|clear|evident) that\b": {
"replacement": "We note that",
"issue_type": "dismissive",
"severity": 0.65
},
# Overly casual/direct
r"\b[Ii]t's not our (?:fault|problem)\b": {
"replacement": "This reflects an inherent limitation that we have now addressed",
"issue_type": "unprofessional",
"severity": 0.9
},
r"\b[Nn]ot our (?:fault|responsibility)\b": {
"replacement": "We acknowledge this constraint",
"issue_type": "unprofessional",
"severity": 0.85
},
# Emotional/overly apologetic
r"\b[Uu]nfortunately,? we (?:cannot|can't|are unable to)\b": {
"replacement": "We are unable to",
"issue_type": "emotional",
"severity": 0.4
},
r"\b[Ww]e (?:cannot|can't) (?:possibly|really)\b": {
"replacement": "We are unable to",
"issue_type": "emotional",
"severity": 0.5
},
}
# Polite academic expression library
POLITE_EXPRESSIONS = {
"acknowledgments": [
"We thank the reviewer for this insightful observation.",
"We appreciate the reviewer's careful attention to this detail.",
"We are grateful for this constructive feedback.",
"This is an excellent point.",
"We value this thoughtful comment.",
],
"partial_agreement": [
"We thank the reviewer for this suggestion. While we appreciate this perspective, we note that",
"We appreciate this recommendation. Upon careful consideration, we believe",
"We have carefully evaluated this suggestion. However,",
],
"respectful_disagreement": [
"We respectfully offer an alternative interpretation",
"Upon careful reconsideration, we maintain that",
"While we appreciate this perspective, we respectfully note that",
"We respectfully maintain our position that",
"We believe an alternative approach may be warranted",
],
"limitations": [
"We acknowledge this limitation and have addressed it by",
"This constraint reflects the trade-off between",
"We have added appropriate caveats regarding this limitation",
"We recognize this constraint and have noted it in",
],
"changes_made": [
"We have revised the manuscript to clarify",
"We have expanded the relevant section to include",
"We have incorporated this suggestion by",
"We have updated the text to reflect",
"We have added the following clarification",
],
"clarification": [
"To enhance clarity, we have",
"We have revised this section to more clearly state",
"The text now explicitly states",
"We have clarified this point by",
]
}
def __init__(self, polish_level: PolishLevel = PolishLevel.MODERATE):
self.polish_level = polish_level
self.context_noun = "approach" # Default noun for replacements
def analyze_tone(self, text: str) -> Tuple[float, List[ToneIssue]]:
"""
Analyze the tone of a draft response.
Returns:
Tuple of (defensiveness_score, list_of_issues)
"""
issues = []
total_severity = 0
for pattern, info in self.TONE_PATTERNS.items():
matches = list(re.finditer(pattern, text, re.IGNORECASE))
for match in matches:
issue = ToneIssue(
original_phrase=match.group(),
issue_type=info["issue_type"],
severity=info["severity"],
position=(match.start(), match.end())
)
issues.append(issue)
total_severity += info["severity"]
# Calculate overall defensiveness score (0-1)
word_count = len(text.split())
if word_count > 0:
defensiveness_score = min(1.0, total_severity / (word_count / 20 + 1))
else:
defensiveness_score = 0.0
return defensiveness_score, issues
def polish(
self,
reviewer_comment: str,
draft_response: str,
response_type: Optional[ResponseType] = None,
polish_level: Optional[PolishLevel] = None
) -> PolishResult:
"""
Polish a draft response to make it more professional and courteous.
Args:
reviewer_comment: The original reviewer comment
draft_response: The author's draft response
response_type: Type of response (accept/partial/decline)
polish_level: Level of polishing to apply
Returns:
PolishResult containing polished response and metadata
"""
if polish_level:
self.polish_level = polish_level
# Auto-detect response type if not provided
if not response_type:
response_type = self._detect_response_type(draft_response)
# Analyze original tone
original_tone_score, issues = self.analyze_tone(draft_response)
# Generate polished response
polished = self._generate_polished_response(
reviewer_comment, draft_response, response_type, issues
)
# Calculate politeness score
politeness_score = self._calculate_politeness(polished)
# Generate improvements list
improvements = self._generate_improvements_list(issues, polished)
# Generate additional suggestions
suggestions = self._generate_suggestions(reviewer_comment, polished)
return PolishResult(
polished_response=polished,
original_tone_score=original_tone_score,
politeness_score=politeness_score,
improvements=improvements,
suggestions=suggestions
)
def _detect_response_type(self, draft_response: str) -> ResponseType:
"""Auto-detect the type of response based on content."""
decline_indicators = [
r"\bnot change\b", r"\bnot modify\b", r"\bnot revise\b",
r"\bdisagree\b", r"\bwon't\b", r"\bcannot\b",
r"\bmaintain\b", r"\boriginal\b"
]
accept_indicators = [
r"\bhave revised\b", r"\bhave changed\b", r"\bhave updated\b",
r"\bthank you\b", r"\bagree\b", r"\bmodified\b"
]
decline_count = sum(1 for p in decline_indicators if re.search(p, draft_response, re.I))
accept_count = sum(1 for p in accept_indicators if re.search(p, draft_response, re.I))
if decline_count > accept_count:
return ResponseType.DECLINE
elif accept_count > 0:
return ResponseType.ACCEPT
else:
return ResponseType.PARTIAL
def _generate_polished_response(
self,
reviewer_comment: str,
draft_response: str,
response_type: ResponseType,
issues: List[ToneIssue]
) -> str:
"""Generate the polished response text."""
# Start with acknowledgment
result_parts = []
# Add appropriate opening based on response type
if response_type == ResponseType.ACCEPT:
opener = self._select_expression("acknowledgments")
result_parts.append(opener)
result_parts.append(self._transform_to_acceptance(draft_response))
elif response_type == ResponseType.PARTIAL:
opener = self._select_expression("partial_agreement")
result_parts.append(opener)
result_parts.append(self._transform_to_partial(draft_response))
else: # DECLINE
opener = self._select_expression("acknowledgments")
result_parts.append(opener)
result_parts.append(self._transform_to_decline(draft_response))
# Apply pattern-based replacements
polished = " ".join(result_parts)
# Apply specific pattern replacements
for pattern, info in self.TONE_PATTERNS.items():
replacement = info["replacement"]
if "{noun}" in replacement:
replacement = replacement.format(noun=self.context_noun)
if "{verb}" in replacement:
# Extract verb from context - simplified
replacement = replacement.format(verb="make this change")
polished = re.sub(pattern, replacement, polished, flags=re.IGNORECASE)
# Clean up and format
polished = self._clean_text(polished)
return polished
def _select_expression(self, category: str) -> str:
"""Select a polite expression from the library."""
import random
expressions = self.POLITE_EXPRESSIONS.get(category, ["Thank you for this comment."])
return random.choice(expressions)
def _transform_to_acceptance(self, draft: str) -> str:
"""Transform draft into acceptance language."""
# Remove defensive prefixes
cleaned = re.sub(r"^\s*(?:[Bb]ut|[Hh]owever)\s*,?\s*", "", draft)
# Add change language if not present
if not re.search(r"\b(?:have revised|have changed|have updated|have modified)\b", cleaned, re.I):
return f"We have revised the manuscript as suggested. {cleaned}"
return cleaned
def _transform_to_partial(self, draft: str) -> str:
"""Transform draft into partial acceptance language."""
# Start with partial acceptance framing
return draft
def _transform_to_decline(self, draft: str) -> str:
"""Transform draft into respectful decline language."""
# Replace harsh refusals with polite alternatives
return draft
def _clean_text(self, text: str) -> str:
"""Clean up and format the text."""
# Fix spacing
text = re.sub(r"\s+", " ", text)
# Fix punctuation spacing
text = re.sub(r"\s+([.,;:!?])", r"\1", text)
# Capitalize first letter
text = text.strip()
if text:
text = text[0].upper() + text[1:]
# Ensure period at end
if text and not text.endswith((".", "!", "?")):
text += "."
return text
def _calculate_politeness(self, text: str) -> float:
"""Calculate a politeness score for the text."""
polite_indicators = [
r"\bthank\b", r"\bappreciate\b", r"\bgrateful\b",
r"\brespectfully\b", r"\bcarefully considered\b",
r"\bvaluable\b", r"\binsightful\b", r"\bconstructive\b"
]
rude_indicators = [
r"\bwrong\b", r"\bincorrect\b", r"\bdisagree\b(?!\s+with)",
r"\bnot\s+(?:correct|true|accurate)\b", r"\byou\s+(?:are|have)\s+(?:wrong|mistaken)"
]
polite_count = sum(1 for p in polite_indicators if re.search(p, text, re.I))
rude_count = sum(1 for p in rude_indicators if re.search(p, text, re.I))
score = 0.5 + (polite_count * 0.1) - (rude_count * 0.2)
return max(0.0, min(1.0, score))
def _generate_improvements_list(
self,
issues: List[ToneIssue],
polished: str
) -> List[Dict]:
"""Generate a list of improvements made."""
improvements = []
for issue in issues:
# Find what this issue was transformed to (simplified)
improvements.append({
"original_phrase": issue.original_phrase,
"polished_phrase": f"[Transformed {issue.issue_type}]",
"issue_type": issue.issue_type,
"severity": issue.severity
})
return improvements
def _generate_suggestions(self, reviewer_comment: str, polished: str) -> List[str]:
"""Generate additional suggestions for improvement."""
suggestions = []
# Check for missing location references
if not re.search(r"\b(?:[Pp]age|[Ll]ine|[Ff]igure|[Tt]able)\b", polished):
suggestions.append("Consider adding page/line references to help reviewers locate changes.")
# Check for rationale in decline situations
if "respectfully" in polished.lower() and "because" not in polished.lower():
suggestions.append("When declining suggestions, provide a brief rationale.")
# Check length
word_count = len(polished.split())
if word_count < 20:
suggestions.append("The response is quite brief; consider adding more detail about the changes made.")
return suggestions
def interactive_mode():
"""Run in interactive mode to collect input from user."""
print("=" * 70)
print("Response Tone Polisher - Interactive Mode")
print("=" * 70)
print()
print("This tool helps polish your responses to peer reviewers.")
print("It transforms defensive or harsh language into professional, courteous prose.")
print()
# Get polish level
print("Select polish level:")
print(" 1. Light (minimal changes)")
print(" 2. Moderate (recommended)")
print(" 3. Heavy (aggressive polishing)")
level_choice = input("Choice (1-3): ").strip() or "2"
level_map = {
"1": PolishLevel.LIGHT,
"2": PolishLevel.MODERATE,
"3": PolishLevel.HEAVY
}
polish_level = level_map.get(level_choice, PolishLevel.MODERATE)
# Get reviewer comment
print("\n" + "-" * 70)
print("Paste the REVIEWER COMMENT below (type 'END' on new line when done):")
print("-" * 70)
lines = []
while True:
line = input()
if line.strip() == "END":
break
lines.append(line)
reviewer_comment = "\n".join(lines)
# Get draft response
print("\n" + "-" * 70)
print("Paste your DRAFT RESPONSE below (type 'END' on new line when done):")
print("-" * 70)
lines = []
while True:
line = input()
if line.strip() == "END":
break
lines.append(line)
draft_response = "\n".join(lines)
# Process
print("\n" + "=" * 70)
print("Polishing your response...")
print("=" * 70)
polisher = TonePolisher(polish_level=polish_level)
result = polisher.polish(reviewer_comment, draft_response)
# Display results
print("\n📊 TONE ANALYSIS:")
print(f" Original defensiveness score: {result.original_tone_score:.2f}/1.0")
print(f" Politeness score: {result.politeness_score:.2f}/1.0")
if result.improvements:
print(f"\n🔧 IMPROVEMENTS MADE ({len(result.improvements)}):")
for i, imp in enumerate(result.improvements[:5], 1):
print(f" {i}. [{imp['issue_type'].replace('_', ' ').title()}]")
print(f" Original: \"{imp['original_phrase'][:50]}...\" " if len(imp['original_phrase']) > 50 else f" Original: \"{imp['original_phrase']}\"")
if result.suggestions:
print(f"\n💡 SUGGESTIONS:")
for i, sug in enumerate(result.suggestions, 1):
print(f" {i}. {sug}")
print("\n" + "=" * 70)
print("✨ POLISHED RESPONSE:")
print("=" * 70)
print(result.polished_response)
# Save option
save = input("\n\nSave to file? (filename or 'n'): ").strip()
if save and save.lower() != 'n':
output = {
"polished_response": result.polished_response,
"original_tone_score": result.original_tone_score,
"politeness_score": result.politeness_score,
"improvements": result.improvements,
"suggestions": result.suggestions
}
with open(save, 'w') as f:
json.dump(output, f, indent=2)
print(f"✅ Saved to {save}")
def main():
parser = argparse.ArgumentParser(
description="Polish response letters by transforming defensive language into professional prose"
)
parser.add_argument(
"--reviewer-comment", "-r",
help="File containing reviewer comment, or the comment text directly"
)
parser.add_argument(
"--draft-response", "-d",
help="File containing draft response, or the text directly"
)
parser.add_argument(
"--output", "-o",
help="Output file for the polished response"
)
parser.add_argument(
"--response-type", "-t",
choices=["accept", "partial", "decline"],
help="Type of response (default: auto-detect)"
)
parser.add_argument(
"--polish-level", "-p",
choices=["light", "moderate", "heavy"],
default="moderate",
help="Level of polishing (default: moderate)"
)
parser.add_argument(
"--interactive", "-i",
action="store_true",
help="Run in interactive mode"
)
parser.add_argument(
"--json", "-j",
action="store_true",
help="Output as JSON"
)
args = parser.parse_args()
if args.interactive or (not args.reviewer_comment and not args.draft_response):
interactive_mode()
return
# Read inputs
reviewer_comment = args.reviewer_comment
draft_response = args.draft_response
# Check if inputs are files
if reviewer_comment and os.path.isfile(reviewer_comment):
with open(reviewer_comment, 'r') as f:
reviewer_comment = f.read()
if draft_response and os.path.isfile(draft_response):
with open(draft_response, 'r') as f:
draft_response = f.read()
if not reviewer_comment or not draft_response:
print("Error: Both --reviewer-comment and --draft-response are required.")
sys.exit(1)
# Map response type
response_type = None
if args.response_type:
response_type = ResponseType(args.response_type)
# Map polish level
polish_level = PolishLevel(args.polish_level)
# Process
polisher = TonePolisher(polish_level=polish_level)
result = polisher.polish(reviewer_comment, draft_response, response_type, polish_level)
# Output
if args.json:
output = {
"polished_response": result.polished_response,
"original_tone_score": result.original_tone_score,
"politeness_score": result.politeness_score,
"improvements": result.improvements,
"suggestions": result.suggestions
}
output_text = json.dumps(output, indent=2)
else:
output_text = result.polished_response
if args.output:
with open(args.output, 'w') as f:
f.write(output_text)
print(f"Polished response saved to {args.output}")
else:
print(output_text)
if __name__ == "__main__":
main()
Determine whether an incident in a clinical trial is a "major deviation.
---
name: protocol-deviation-classifier
description: Determine whether an incident in a clinical trial is a "major deviation.
license: MIT
skill-author: AIPOCH
---
# Protocol Deviation Classifier
Clinical trial protocol deviation classification tool, based on GCP and ICH E6 guidelines, automatically determines whether deviations belong to "major deviations" or "minor deviations".
## When to Use
- Use this skill when the task needs Determine whether an incident in a clinical trial is a "major deviation.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
See `## Features` above for related details.
- Scope-focused workflow aligned to: Determine whether an incident in a clinical trial is a "major deviation.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
- Python 3.8+
- No third-party dependencies (pure Python standard library implementation)
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Data Analytics/protocol-deviation-classifier"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py --input "Audit validation sample with explicit symptoms, history, assessment, and next-step plan." --format json
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Features
- **Automatic Classification**: Automatically determines severity based on deviation description
- **Risk Assessment**: Assesses impact on subject safety, data integrity, and scientific validity
- **Regulatory Basis**: Classification basis complies with GCP, ICH E6, and FDA/EMA guidelines
- **Report Generation**: Generates deviation classification reports that meet regulatory requirements
- **Chinese Support**: Full support for Chinese clinical trial scenarios
## Deviation Classification Standards
### Major/Critical Deviation
Deviations that may affect trial data integrity, subject safety, or trial scientific validity:
| Category | Examples |
|------|------|
| Informed Consent | Performing research procedures without informed consent, using expired/incorrect informed consent forms |
| Inclusion/Exclusion Criteria | Enrolling subjects who don't meet inclusion criteria, enrolling subjects who meet exclusion criteria |
| Investigational Product | Overdose administration, contraindicated concomitant medication, incorrect route of administration, randomization error |
| Safety | Not performing safety monitoring as required by protocol, missing SAE/SUSAR reports, delayed reporting |
| Blinding | Unblinding by unauthorized personnel, unrecorded emergency unblinding procedures |
| Data Integrity | Falsifying/fabricating data, systematic missing of critical data |
| Prohibited Operations | Violating key operational procedures of trial protocol, not performing key efficacy assessments |
### Minor Deviation
Deviations unlikely to affect trial data integrity, subject safety, or trial scientific validity:
| Category | Examples |
|------|------|
| Visit Window | Slightly exceeding visit time window (e.g., within a few days), delay of non-critical visits |
| Sample Collection | Minor timing deviations in non-critical sample collection, slight delays in sample processing |
| Questionnaire Completion | Quality of life questionnaires/diary cards submitted a few days late |
| Data Recording | Delays in non-critical data recording, spelling/formatting errors |
| Procedure Execution | Adjustment of secondary procedure execution order, omission of non-critical assessments (e.g., height measurement) |
| Documentation | Delays in source document signatures, missing secondary documents (e.g., non-critical examination reports) |
## Usage
### Python API
```python
from scripts.main import DeviationClassifier
# Initialize classifier
classifier = DeviationClassifier()
# Classify single deviation
result = classifier.classify(
description="Subject visit delayed by 2 days",
deviation_type="Visit Window"
)
print(result.classification) # "Minor Deviation"
print(result.confidence) # 0.92
print(result.rationale) # Classification rationale explanation
# Batch classification
deviations = [
{"description": "Blood sample collected without informed consent", "type": "Informed Consent"},
{"description": "Quality of life questionnaire submitted 3 days late", "type": "Data Collection"}
]
batch_results = classifier.classify_batch(deviations)
# Generate report
report = classifier.generate_report(batch_results)
```
### CLI Usage
```text
# Classify single deviation
python scripts/main.py classify --description "Subject visit delayed by 2 days" --type "Visit Window"
# Batch classification from file
python scripts/main.py batch --input deviations.json --output report.json
# Interactive classification
python scripts/main.py interactive
# Assess deviation impact
python scripts/main.py assess \
--description "Subject accidentally took double dose of investigational drug" \
--safety-impact high \
--data-impact medium \
--scientific-impact medium
```
### Input Format
**JSON Input File Format:**
```json
[
{
"id": "DEV-001",
"description": "Subject visit delayed by 2 days",
"type": "Visit Window",
"occurrence_date": "2024-01-15",
"severity_factors": {
"safety_impact": "none",
"data_impact": "low",
"scientific_impact": "low"
}
},
{
"id": "DEV-002",
"description": "Blood collection performed without informed consent",
"type": "Informed Consent",
"severity_factors": {
"safety_impact": "high",
"data_impact": "high",
"scientific_impact": "high"
}
}
]
```
### Output Format
**Classification Result:**
```json
{
"id": "DEV-001",
"classification": "Minor Deviation",
"classification_en": "Minor Deviation",
"confidence": 0.92,
"rationale": "Visit time window slightly delayed (2 days), does not affect subject safety, data integrity, or trial scientific validity.",
"risk_factors": {
"safety_risk": "none",
"data_integrity_risk": "low",
"scientific_validity_risk": "none"
},
"regulatory_basis": [
"ICH E6(R2) Section 4.5",
"GCP Section 6.4.4"
],
"recommended_actions": [
"Document in file",
"Track trends"
]
}
```
## Classification Algorithm
Classification based on the following assessment dimensions:
1. **Subject Safety Impact** (Safety Impact)
- None: No impact
- Low: Minor impact
- Medium: Moderate impact
- High: Serious impact
2. **Data Integrity Impact** (Data Integrity Impact)
- None: No impact
- Low: Minor impact on non-critical data
- Medium: Partial impact on critical data
- High: Serious damage to critical data
3. **Trial Scientific Validity Impact** (Scientific Validity Impact)
- None: No impact
- Low: Minor impact on statistical power
- Medium: May affect primary endpoint
- High: Seriously affects trial conclusion
**Classification Rules:**
- Any dimension is High → Major Deviation
- Safety dimension is Medium and Data/Science either is Medium+ → Major Deviation
- Other cases → Minor Deviation
## Regulatory Basis
- ICH E6(R2) Good Clinical Practice Guideline
- ICH E6(R3) Good Clinical Practice Guideline (Draft)
- FDA 21 CFR Part 312 (IND Regulations)
- FDA Guidance for Industry: Oversight of Clinical Investigations
- EMA Reflection Paper on Risk Based Quality Management
- NMPA Good Clinical Practice for Drug Clinical Trials
## Notes
1. This tool provides classification recommendations, final determination must be confirmed by clinical quality assurance personnel
2. Serious/critical deviations must be reported to sponsor and ethics committee immediately
3. It is recommended to regularly review deviation trends and implement CAPA (Corrective and Preventive Actions)
4. Classification standards may vary by regulatory agency, trial type, and protocol requirements
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `protocol-deviation-classifier` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `protocol-deviation-classifier` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
## Inputs to Collect
- Required inputs: the user goal, the primary data or source file, and the requested output format.
- Optional inputs: output directory, formatting preferences, and validation constraints.
- If a required input is unavailable, return a short clarification request before continuing.
## Output Contract
- Return a short summary, the main deliverables, and any assumptions that materially affect interpretation.
- If execution is partial, label what succeeded, what failed, and the next safe recovery step.
- Keep the final answer within the documented scope of the skill.
## Validation and Safety Rules
- Validate identifiers, file paths, and user-provided parameters before execution.
- Do not fabricate results, metrics, citations, or downstream conclusions.
- Use safe fallback behavior when dependencies, credentials, or required inputs are missing.
- Surface any execution failure with a concise diagnosis and recovery path.
FILE:references/runtime_checklist.md
# Runtime Checklist
- Category: `Data Analysis`
- Validate the user goal, required inputs, and output format before taking action.
- Ask a targeted clarification question when a required input is missing.
- Keep the response scoped to the documented workflow and state assumptions explicitly.
- Run a non-destructive smoke check before any file-dependent or data-dependent command.
- Recommended smoke check: `python -m py_compile scripts/main.py`
- If execution fails, stop and return a concise recovery path instead of fabricating results.
FILE:requirements.txt
dataclasses
enum
FILE:scripts/main.py
#!/usr/bin/env python3
"""Protocol Deviation Classifier
Clinical trial protocol deviation classification tool
Based on the GCP/ICH E6 guidelines, it can automatically determine whether a deviation is a "major deviation" or a "minor deviation".
Technical: Risk-based quality management, GCP compliance assessment, deviation classification"""
import argparse
import json
import sys
import re
from enum import Enum
from dataclasses import dataclass, field, asdict
from typing import Dict, List, Optional, Any, Tuple
from datetime import datetime
class Classification(Enum):
"""Deviation classification enumeration"""
MAJOR = "major"
MINOR = "minor"
CRITICAL = "critical"
def __str__(self):
mapping = {
Classification.MAJOR: "significant deviation",
Classification.MINOR: "small deviation",
Classification.CRITICAL: "critical deviation"
}
return mapping.get(self, self.value)
@property
def en_name(self):
mapping = {
Classification.MAJOR: "Major Deviation",
Classification.MINOR: "Minor Deviation",
Classification.CRITICAL: "Critical Deviation"
}
return mapping.get(self, self.value.title())
class RiskLevel(Enum):
"""Risk level enumeration"""
NONE = "none"
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
def __str__(self):
mapping = {
RiskLevel.NONE: "none",
RiskLevel.LOW: "Low",
RiskLevel.MEDIUM: "medium",
RiskLevel.HIGH: "high"
}
return mapping.get(self, self.value)
@property
def score(self):
"""Returns the risk score used in the calculation"""
scores = {
RiskLevel.NONE: 0,
RiskLevel.LOW: 1,
RiskLevel.MEDIUM: 2,
RiskLevel.HIGH: 3
}
return scores.get(self, 0)
@dataclass
class DeviationEvent:
"""Plan deviation event data class"""
id: str = ""
description: str = ""
deviation_type: str = ""
occurrence_date: Optional[str] = None
site_id: Optional[str] = None
subject_id: Optional[str] = None
safety_impact: RiskLevel = RiskLevel.NONE
data_impact: RiskLevel = RiskLevel.NONE
scientific_impact: RiskLevel = RiskLevel.NONE
@classmethod
def from_dict(cls, data: Dict) -> 'DeviationEvent':
"""Create event from dictionary"""
def parse_risk(value):
if isinstance(value, str):
try:
return RiskLevel(value.lower())
except ValueError:
return RiskLevel.NONE
return RiskLevel.NONE
factors = data.get('severity_factors', {})
return cls(
id=data.get('id', ''),
description=data.get('description', ''),
deviation_type=data.get('type', data.get('deviation_type', '')),
occurrence_date=data.get('occurrence_date'),
site_id=data.get('site_id'),
subject_id=data.get('subject_id'),
safety_impact=parse_risk(factors.get('safety_impact', 'none')),
data_impact=parse_risk(factors.get('data_impact', 'none')),
scientific_impact=parse_risk(factors.get('scientific_impact', 'none'))
)
@dataclass
class ClassificationResult:
"""Classification result data class"""
id: str
classification: Classification
confidence: float
rationale: str
safety_risk: RiskLevel
data_integrity_risk: RiskLevel
scientific_validity_risk: RiskLevel
risk_score: int
regulatory_basis: List[str] = field(default_factory=list)
recommended_actions: List[str] = field(default_factory=list)
key_indicators: List[str] = field(default_factory=list)
def to_dict(self) -> Dict:
"""Convert to dictionary"""
return {
"id": self.id,
"classification": str(self.classification),
"classification_en": self.classification.en_name,
"confidence": round(self.confidence, 2),
"rationale": self.rationale,
"risk_factors": {
"safety_risk": str(self.safety_risk),
"data_integrity_risk": str(self.data_integrity_risk),
"scientific_validity_risk": str(self.scientific_validity_risk)
},
"risk_score": self.risk_score,
"regulatory_basis": self.regulatory_basis,
"recommended_actions": self.recommended_actions,
"key_indicators": self.key_indicators
}
class DeviationClassifier:
"""Scheme deviation classifier
Based on GCP/ICH E6 guidelines, automatically determine the severity of clinical trial deviations."""
# Major deviation keyword pattern (Chinese and English)
MAJOR_INDICATORS = {
'informed_consent': [
'informed consent', 'Not obtained.*Consent', 'consent form', r'informed consent',
'not signed', r'consent', r'signed.*consent'
],
'eligibility': [
'Inclusion criteria', 'Exclusion criteria', 'Not in compliance.*Enter the group', r'eligibility',
r'inclusion.*criteria', r'exclusion.*criteria', r'ineligible'
],
'dosing': [
'overdose', 'double dose', r'overdose', 'excess', 'Error.*Dose',
r'wrong dose', r'double dose', r'dosing error'
],
'concomitant': [
'Concomitant medication', 'Contraindications.* Medication', r'concomitant', r'prohibited medication',
r'forbidden drug'
],
'randomization': [
'Randomize.*Error', 'Wrong.*Random', r'randomization error',
r'wrong randomization'
],
'safety_reporting': [
r'SAE', r'SUSAR', 'Security.*Report', 'False negative', 'Delayed reporting',
r'serious adverse event', r'safety reporting'
],
'blinding': [
'Break the blindness', r'unblind', 'Breaking the blindness.*Not yet', 'Unauthorized.*Break the blind'
],
'data_integrity': [
'forgery', 'tamper', 'data.*false', r'falsified',
r'fabricated', 'Data fraud'
],
'critical_procedures': [
'Key.* not executed', 'Not.*Key', 'Missing.*Primary endpoint',
r'primary endpoint.*missed', r'critical procedure'
]
}
# Small deviation keyword pattern
MINOR_INDICATORS = {
'visit_window': [
'Visit.*Delay', 'Visit.* in advance', r'visit.*window',
'Visit.*[12].*days', r'visit.*[12].*day'
],
'sample_collection': [
'sample.*delay', 'sampling.*time', r'sample.*delay',
'Non-critical.*Sample', r'non-critical sample'
],
'questionnaire': [
'Questionnaire', 'diary cards', r'questionnaire', r'diary card',
r'QOL', 'quality of life'
],
'documentation': [
'signature.*delay', 'Documentation.* is missing', r'document.*missing',
r'signature.*delay', 'record.*delay'
],
'non_critical': [
'non-critical', 'secondary', 'slight', r'non-critical',
r'minor', r'slight'
]
}
# regulatory basis
REGULATORY_BASIS = {
'major': [
"ICH E6(R2) Section 4.5 - Subject Safety",
"ICH E6(R2) Section 4.9 - Informed Consent",
"GCP Section 6.2 - Subject Rights",
"FDA 21 CFR Part 312.60 - General Investigator Obligations"
],
'minor': [
"ICH E6(R2) Section 4.6 - Investigational Product",
"ICH E6(R2) Section 5.1 - Trial Management",
"GCP Section 6.4.4 - Protocol Compliance"
],
'critical': [
"ICH E6(R2) Section 2.13 - Data Integrity",
"ICH E6(R2) Section 4.1.5 - Fraud Prevention",
"FDA 21 CFR Part 312.70 - Disqualification of Investigators"
]
}
def __init__(self):
self._compile_patterns()
def _compile_patterns(self):
"""Compile regular expression pattern"""
self.major_patterns = {
category: [re.compile(p, re.IGNORECASE) for p in patterns]
for category, patterns in self.MAJOR_INDICATORS.items()
}
self.minor_patterns = {
category: [re.compile(p, re.IGNORECASE) for p in patterns]
for category, patterns in self.MINOR_INDICATORS.items()
}
def classify(
self,
description: str,
deviation_type: str = "",
event_id: str = "",
safety_impact: Optional[RiskLevel] = None,
data_impact: Optional[RiskLevel] = None,
scientific_impact: Optional[RiskLevel] = None
) -> ClassificationResult:
"""Classify individual deviation events
Args:
description: Deviation description
deviation_type: deviation type
event_id: event ID
safety_impact: Safety impact level (if known)
data_impact: Data integrity impact level (if known)
scientific_impact: scientific impact level (if known)
Returns:
ClassificationResult: classification result"""
# If no impact level is provided, it will be automatically determined based on the description.
if safety_impact is None:
safety_impact = self._assess_safety_impact(description, deviation_type)
if data_impact is None:
data_impact = self._assess_data_impact(description, deviation_type)
if scientific_impact is None:
scientific_impact = self._assess_scientific_impact(description, deviation_type)
# Calculate risk score
risk_score = (
safety_impact.score * 3 +
data_impact.score * 2 +
scientific_impact.score * 2
)
# Apply classification rules
classification, confidence = self._apply_classification_rules(
safety_impact, data_impact, scientific_impact, risk_score, description
)
# Generate classification reasons
rationale = self._generate_rationale(
classification, safety_impact, data_impact, scientific_impact, description
)
# Get key metrics
key_indicators = self._extract_key_indicators(description, deviation_type)
# Obtain regulatory basis
regulatory_basis = self._get_regulatory_basis(classification, description)
# Generate recommended actions
recommended_actions = self._get_recommended_actions(classification)
return ClassificationResult(
id=event_id or self._generate_event_id(),
classification=classification,
confidence=confidence,
rationale=rationale,
safety_risk=safety_impact,
data_integrity_risk=data_impact,
scientific_validity_risk=scientific_impact,
risk_score=risk_score,
regulatory_basis=regulatory_basis,
recommended_actions=recommended_actions,
key_indicators=key_indicators
)
def classify_batch(self, events: List[Dict]) -> List[ClassificationResult]:
"""Batch classification deviation event
Args:
events: Dictionary list of deviation events
Returns:
List[ClassificationResult]: Classification result list"""
results = []
for event_data in events:
event = DeviationEvent.from_dict(event_data)
result = self.classify(
description=event.description,
deviation_type=event.deviation_type,
event_id=event.id,
safety_impact=event.safety_impact,
data_impact=event.data_impact,
scientific_impact=event.scientific_impact
)
results.append(result)
return results
def _assess_safety_impact(self, description: str, deviation_type: str) -> RiskLevel:
"""Assess the impact on subject safety"""
text = f"{description} {deviation_type}".lower()
# High security impact keywords
high_risk = [
r'overdose', 'overdose', 'double dose', 'Contraindications', 'allergic reaction',
'serious adverse events', r'sae', 'die', 'life-threatening', 'Hospitalized',
r'death', r'life-threatening', r'hospitalization'
]
# Medium safety impact keywords
medium_risk = [
'adverse reactions', 'side effect', r'adverse event', 'Concomitant medication',
r'concomitant', 'drug interactions', r'drug interaction',
'Dosage adjustment', r'dose adjustment'
]
for pattern in high_risk:
if re.search(pattern, text, re.IGNORECASE):
return RiskLevel.HIGH
for pattern in medium_risk:
if re.search(pattern, text, re.IGNORECASE):
return RiskLevel.MEDIUM
# If informed consent is involved but no specific harm is involved
if re.search('informed consent|consent', text, re.IGNORECASE):
return RiskLevel.HIGH
return RiskLevel.NONE
def _assess_data_impact(self, description: str, deviation_type: str) -> RiskLevel:
"""Assess the impact on data integrity"""
text = f"{description} {deviation_type}".lower()
# High data impact
high_patterns = [
'forgery|tampering|false|falsif|fabricat',
'data.*lost|data.*lost',
'critical.*data.*missing|critical.*data.*missing'
]
# Medium data impact
medium_patterns = [
'primary endpoint|primary endpoint',
'critical visit|critical visit',
'not.*assess|not.*assess'
]
for pattern in high_patterns:
if re.search(pattern, text, re.IGNORECASE):
return RiskLevel.HIGH
for pattern in medium_patterns:
if re.search(pattern, text, re.IGNORECASE):
return RiskLevel.MEDIUM
return RiskLevel.LOW
def _assess_scientific_impact(self, description: str, deviation_type: str) -> RiskLevel:
"""Assessing the impact on the scientific validity of the trial"""
text = f"{description} {deviation_type}".lower()
# Check whether it affects the primary endpoint or randomization
high_patterns = [
'randomize.*error|error.*random|randomiz',
'primary endpoint|primary endpoint',
'Enter the group.*Error|Wrong.*Enter the group|Inconsistent.*Enter the group|ineligible'
]
# medium impact
medium_patterns = [
'Visit.*missing|missed visit',
'Efficacy.*Assessment|efficacy assessment',
'Break the blind|unblind'
]
for pattern in high_patterns:
if re.search(pattern, text, re.IGNORECASE):
return RiskLevel.HIGH
for pattern in medium_patterns:
if re.search(pattern, text, re.IGNORECASE):
return RiskLevel.MEDIUM
return RiskLevel.NONE
def _apply_classification_rules(
self,
safety_impact: RiskLevel,
data_impact: RiskLevel,
scientific_impact: RiskLevel,
risk_score: int,
description: str
) -> Tuple[Classification, float]:
"""Apply classification rules
Classification rules:
- Any dimension is High → significant deviation
- Security dimension is Medium and data/science is either Medium+ → major deviation
- Involving informed consent issues → Major deviations
- Involving data falsification → critical deviation
- Other cases → minor deviations"""
text = description.lower()
# Check whether it is a critical deviation (data fraud)
if re.search('forgery|tampering|false|falsif|fabricat', text):
return Classification.CRITICAL, 0.98
# Check for major deviations
if safety_impact == RiskLevel.HIGH:
return Classification.MAJOR, 0.95
if data_impact == RiskLevel.HIGH or scientific_impact == RiskLevel.HIGH:
return Classification.MAJOR, 0.90
if safety_impact == RiskLevel.MEDIUM and (
data_impact.score >= 2 or scientific_impact.score >= 2
):
return Classification.MAJOR, 0.85
# Check issues related to informed consent
if re.search('Informed consent|Not obtained.*Consent|consent', text, re.IGNORECASE):
if not re.search('non-critical|minor|delay|delay', text, re.IGNORECASE):
return Classification.MAJOR, 0.92
# Minor deviations in other cases
if risk_score <= 4:
confidence = 0.90 - (risk_score * 0.05)
else:
confidence = 0.70
return Classification.MINOR, max(0.65, confidence)
def _generate_rationale(
self,
classification: Classification,
safety_impact: RiskLevel,
data_impact: RiskLevel,
scientific_impact: RiskLevel,
description: str
) -> str:
"""Generate classification reasons"""
reasons = []
if classification == Classification.CRITICAL:
reasons.append("Involving data forgery or tampering, seriously affecting the credibility of test data.")
elif classification == Classification.MAJOR:
reasons.append("This deviation has the following high-risk characteristics:")
if safety_impact == RiskLevel.HIGH:
reasons.append("-Seriously affecting subject safety")
if data_impact == RiskLevel.HIGH:
reasons.append("- Serious damage to data integrity")
if scientific_impact == RiskLevel.HIGH:
reasons.append("-Seriously damaging the scientific nature of the test")
if safety_impact == RiskLevel.MEDIUM:
reasons.append("- Moderate impact on subject safety")
# Check informed consent
if re.search('informed consent|consent', description, re.IGNORECASE):
reasons.append("- Involving violations of informed consent procedures")
else:
reasons.append("This deviation has the following characteristics:")
if safety_impact == RiskLevel.NONE:
reasons.append("- Does not affect subject safety")
if data_impact == RiskLevel.LOW:
reasons.append("- Minor impact on data integrity")
if scientific_impact == RiskLevel.NONE:
reasons.append("- Does not affect the scientific nature of the test")
# Check for minor time delays
if re.search('delay|delay|postpone|postpone', description, re.IGNORECASE):
reasons.append("- It is only a procedural delay and does not affect the core elements of the test")
return "\n".join(reasons)
def _extract_key_indicators(self, description: str, deviation_type: str) -> List[str]:
"""Extract key indicators"""
indicators = []
text = f"{description} {deviation_type}".lower()
# Check for significant deviation indicators
for category, patterns in self.major_patterns.items():
for pattern in patterns:
if pattern.search(text):
indicator_map = {
'informed_consent': 'Informed consent issues',
'eligibility': 'Inclusion/exclusion criteria violations',
'dosing': 'Administration/Dosage Issues',
'concomitant': 'Concomitant medication violations',
'randomization': 'randomization problem',
'safety_reporting': 'Security reporting issues',
'blinding': 'blind violation',
'data_integrity': 'Data integrity issues',
'critical_procedures': 'Missing key procedures'
}
ind = indicator_map.get(category, category)
if ind not in indicators:
indicators.append(ind)
break
# Check for minor deviation indicators
for category, patterns in self.minor_patterns.items():
for pattern in patterns:
if pattern.search(text):
indicator_map = {
'visit_window': 'Visit time window deviation',
'sample_collection': 'Sample collection bias',
'questionnaire': 'Questionnaire/Diary Card Bias',
'documentation': 'Document/Signature Delay',
'non_critical': 'Non-critical procedural deviations'
}
ind = indicator_map.get(category, category)
if ind not in indicators:
indicators.append(ind)
break
return indicators[:5] # Returns up to 5 indicators
def _get_regulatory_basis(self, classification: Classification, description: str) -> List[str]:
"""Obtain regulatory basis"""
basis = []
text = description.lower()
if classification == Classification.CRITICAL:
basis = self.REGULATORY_BASIS['critical'].copy()
elif classification == Classification.MAJOR:
basis = self.REGULATORY_BASIS['major'].copy()
# Add specific regulations based on description
if re.search('informed consent|consent', text, re.IGNORECASE):
basis.append("ICH E6(R2) Section 4.8 - Informed Consent Requirements")
if re.search('Randomization|randomiz', text, re.IGNORECASE):
basis.append("ICH E9 - Statistical Principles for Clinical Trials")
else:
basis = self.REGULATORY_BASIS['minor'].copy()
return basis
def _get_recommended_actions(self, classification: Classification) -> List[str]:
"""Get recommended actions"""
if classification == Classification.CRITICAL:
return [
"Notify the sponsor and ethics committee immediately",
"Initiate root cause investigation",
"Implement corrective and preventive actions (CAPA)",
"Consider blacklisting researchers",
"Assess impact on overall trial data"
]
elif classification == Classification.MAJOR:
return [
"Recorded in the deviation log",
"Report to the sponsor within 24 hours",
"Report to ethics committee (if required by protocol)",
"Assess whether remedial action is needed",
"Track trends and assess whether there is a systemic issue"
]
else:
return [
"Recorded in the deviation log",
"Follow trends",
"Solved at the research center level",
"Periodic summary reporting (e.g. quarterly)"
]
def _generate_event_id(self) -> str:
"""Generate event ID"""
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
return f"DEV-{timestamp}"
def generate_report(self, results: List[ClassificationResult]) -> Dict:
"""Generate deviation classification report
Args:
results: list of classified results
Returns:
Dict: report dictionary"""
if not results:
return {"error": "No results to report"}
total = len(results)
major_count = sum(1 for r in results if r.classification == Classification.MAJOR)
minor_count = sum(1 for r in results if r.classification == Classification.MINOR)
critical_count = sum(1 for r in results if r.classification == Classification.CRITICAL)
# Summarize various types of deviations
by_category = {}
for result in results:
for indicator in result.key_indicators:
by_category[indicator] = by_category.get(indicator, 0) + 1
return {
"report_date": datetime.now().isoformat(),
"summary": {
"total_deviations": total,
"critical_count": critical_count,
"major_count": major_count,
"minor_count": minor_count,
"major_rate": round(major_count / total * 100, 1) if total > 0 else 0,
"minor_rate": round(minor_count / total * 100, 1) if total > 0 else 0
},
"category_breakdown": by_category,
"critical_deviations": [
r.to_dict() for r in results
if r.classification == Classification.CRITICAL
],
"major_deviations": [
r.to_dict() for r in results
if r.classification == Classification.MAJOR
],
"minor_deviations": [
r.to_dict() for r in results
if r.classification == Classification.MINOR
],
"recommendations": self._generate_summary_recommendations(results)
}
def _generate_summary_recommendations(self, results: List[ClassificationResult]) -> List[str]:
"""Generate summary recommendations"""
recommendations = []
critical_count = sum(1 for r in results if r.classification == Classification.CRITICAL)
major_count = sum(1 for r in results if r.classification == Classification.MAJOR)
total = len(results)
if critical_count > 0:
recommendations.append(
f"⚠️ Discover{critical_count}key deviation,Recommend immediate root cause analysis"
)
if total > 0:
major_rate = major_count / total * 100
if major_rate > 20:
recommendations.append(
f"significant deviation rate({major_rate:.1f}%)On the high side,It is recommended to strengthen research center training"
)
elif major_rate > 10:
recommendations.append(
f"significant deviation rate({major_rate:.1f}%)medium,It is recommended to pay close attention to"
)
# Check trends
safety_issues = sum(
1 for r in results
if r.safety_risk in [RiskLevel.HIGH, RiskLevel.MEDIUM]
)
if safety_issues > 3:
recommendations.append(
f"Discover{safety_issues}deviations involving safety issues,Recommend evaluation of subject protection measures"
)
if not recommendations:
recommendations.append("Deviation control is generally good and it is recommended to continue to maintain it.")
return recommendations
def main():
"""CLI entry point"""
parser = argparse.ArgumentParser(
description="Protocol Deviation Classifier - clinical trial protocol deviation classification tool",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""Example:
# Classify individual deviations
python scripts/main.py classify -d "Subject visit delayed by 2 days"
# Batch classification from files
python scripts/main.py batch -i deviations.json -o report.json
#Interactive classification
python scripts/main.py interactive"""
)
subparsers = parser.add_subparsers(dest="command", help="Available commands")
# classify command
classify_parser = subparsers.add_parser("classify", help="Classify individual deviations")
classify_parser.add_argument("-d", "--description", required=True, help="Deviation description")
classify_parser.add_argument("-t", "--type", default="", help="Deviation type")
classify_parser.add_argument("--id", default="", help="Event ID")
classify_parser.add_argument("--safety-impact",
choices=["none", "low", "medium", "high"],
default="none", help="security impact level")
classify_parser.add_argument("--data-impact",
choices=["none", "low", "medium", "high"],
default="low", help="Data integrity impact level")
classify_parser.add_argument("--scientific-impact",
choices=["none", "low", "medium", "high"],
default="none", help="scientific impact level")
classify_parser.add_argument("-o", "--output", choices=["json", "table"],
default="table", help="Output format")
# batch command
batch_parser = subparsers.add_parser("batch", help="Batch classification")
batch_parser.add_argument("-i", "--input", required=True, help="Enter JSON file path")
batch_parser.add_argument("-o", "--output", default="", help="Output file path")
batch_parser.add_argument("--format", choices=["json", "report"],
default="json", help="Output format")
# report command
report_parser = subparsers.add_parser("report", help="Generate summary report")
report_parser.add_argument("-i", "--input", required=True, help="Classification results JSON file")
# interactive command
subparsers.add_parser("interactive", help="interactive classification")
# demo command
subparsers.add_parser("demo", help="Run sample classification")
args = parser.parse_args()
if not args.command:
parser.print_help()
sys.exit(1)
classifier = DeviationClassifier()
try:
if args.command == "classify":
# Analyze impact levels
safety = RiskLevel(args.safety_impact)
data = RiskLevel(args.data_impact)
scientific = RiskLevel(args.scientific_impact)
result = classifier.classify(
description=args.description,
deviation_type=args.type,
event_id=args.id,
safety_impact=safety,
data_impact=data,
scientific_impact=scientific
)
if args.output == "json":
print(json.dumps(result.to_dict(), ensure_ascii=False, indent=2))
else:
_print_table_result(result)
elif args.command == "batch":
# Read input file
with open(args.input, 'r', encoding='utf-8') as f:
events = json.load(f)
results = classifier.classify_batch(events)
if args.format == "report":
report = classifier.generate_report(results)
output = report
else:
output = [r.to_dict() for r in results]
if args.output:
with open(args.output, 'w', encoding='utf-8') as f:
json.dump(output, f, ensure_ascii=False, indent=2)
print(f"English: {args.output}")
else:
print(json.dumps(output, ensure_ascii=False, indent=2))
elif args.command == "report":
with open(args.input, 'r', encoding='utf-8') as f:
data = json.load(f)
# Assume that the input is a list of classification results
if isinstance(data, list):
# Convert to ClassificationResult object
results = []
for item in data:
result = ClassificationResult(
id=item.get('id', ''),
classification=Classification(item.get('classification_en', '').lower().split()[0]),
confidence=item.get('confidence', 0),
rationale=item.get('rationale', ''),
safety_risk=RiskLevel.NONE,
data_integrity_risk=RiskLevel.NONE,
scientific_validity_risk=RiskLevel.NONE,
risk_score=item.get('risk_score', 0),
regulatory_basis=item.get('regulatory_basis', []),
recommended_actions=item.get('recommended_actions', []),
key_indicators=item.get('key_indicators', [])
)
results.append(result)
report = classifier.generate_report(results)
print(json.dumps(report, ensure_ascii=False, indent=2))
elif args.command == "interactive":
_run_interactive(classifier)
elif args.command == "demo":
_run_demo(classifier)
except Exception as e:
print(f"mistake: {e}", file=sys.stderr)
sys.exit(1)
def _print_table_result(result: ClassificationResult):
"""Print results in table format"""
print("\n" + "=" * 60)
print("Plan deviation classification result")
print("=" * 60)
print(f"{'Event ID:':<20} {result.id}")
print(f"{'Classification results:':<20} {result.classification} ({result.classification.en_name})")
print(f"{'Confidence:':<20} {result.confidence*100:.1f}%")
print(f"{'Risk score:':<20} {result.risk_score}")
print("-" * 60)
print("risk assessment:")
print(f" - Subject safety: {result.safety_risk}")
print(f" - data integrity: {result.data_integrity_risk}")
print(f" - Experimental science: {result.scientific_validity_risk}")
print("-" * 60)
print("Reason for classification:")
print(result.rationale)
if result.key_indicators:
print("-" * 60)
print("Key indicators:")
for indicator in result.key_indicators:
print(f" • {indicator}")
print("-" * 60)
print("Recommended actions:")
for action in result.recommended_actions:
print(f" • {action}")
print("=" * 60)
def _run_interactive(classifier: DeviationClassifier):
"""Run interactive classification"""
print("\n" + "=" * 60)
print("Solution Deviation Classifier - Interactive Mode")
print("=" * 60)
print("Type 'quit' or 'q' to exit")
while True:
print("\n" + "-" * 40)
description = input("Please enter a description of the deviation:").strip()
if description.lower() in ['quit', 'q', 'exit']:
print("goodbye!")
break
if not description:
print("Description cannot be empty, please re-enter.")
continue
deviation_type = input("Please enter deviation type (optional):").strip()
print("Analyzing...")
result = classifier.classify(
description=description,
deviation_type=deviation_type
)
_print_table_result(result)
def _run_demo(classifier: DeviationClassifier):
"""Run sample classification"""
demo_cases = [
{
"id": "DEV-001",
"description": "Subject visit will be delayed by 2 days",
"type": "visit window"
},
{
"id": "DEV-002",
"description": "Collecting blood samples without obtaining informed consent",
"type": "informed consent"
},
{
"id": "DEV-003",
"description": "Subject mistakenly takes double dose of study drug",
"type": "Medication error"
},
{
"id": "DEV-004",
"description": "Submission of quality of life questionnaire delayed by 3 days",
"type": "data collection"
},
{
"id": "DEV-005",
"description": "Subjects who did not meet the inclusion criteria were enrolled (age exceeded the limit)",
"type": "Inclusion criteria"
}
]
print("\n" + "=" * 60)
print("Scheme deviation classifier - example run")
print("=" * 60)
results = []
for case in demo_cases:
print(f"\n【Case {case['id']}】")
print(f"describe: {case['description']}")
print(f"type: {case['type']}")
result = classifier.classify(
description=case['description'],
deviation_type=case['type'],
event_id=case['id']
)
results.append(result)
print(f"→ Classification results: {result.classification} (Confidence: {result.confidence*100:.0f}%)")
print(f" risk score: {result.risk_score}")
# Generate summary report
print("\n" + "=" * 60)
print("summary report")
print("=" * 60)
report = classifier.generate_report(results)
summary = report['summary']
print(f"Total deviations: {summary['total_deviations']}")
print(f"critical deviation: {summary['critical_count']}")
print(f"significant deviation: {summary['major_count']} ({summary['major_rate']}%)")
print(f"small deviation: {summary['minor_count']} ({summary['minor_rate']}%)")
print("suggestion:")
for rec in report['recommendations']:
print(f" • {rec}")
if __name__ == "__main__":
main()
Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
---
name: ectd-xml-compiler
description: Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
license: MIT
skill-author: AIPOCH
---
# eCTD XML Compiler
ID: 197
Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
## When to Use
- Use this skill when the task needs Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
- Scope-focused workflow aligned to: Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
```
python-docx>=0.8.11 # Word document parsing
PyPDF2>=3.0.0 # PDF text extraction
lxml>=4.9.0 # XML processing
```
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Academic Writing/ectd-xml-compiler"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Overview
eCTD (electronic Common Technical Document) is the electronic Common Technical Document standard established by ICH for submitting drug registration applications to regulatory agencies such as FDA and EMA.
This tool parses uploaded drug application documents (Word/PDF) and converts them into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
## eCTD Structure
```
eCTD/
├── m1/ # Module 1: Administrative Information and Prescribing Information (region-specific)
│ ├── m1.xml
│ └── ...
├── m2/ # Module 2: CTD Summaries
│ ├── m2.xml
│ └── ...
├── m3/ # Module 3: Quality
│ ├── m3.xml
│ └── ...
├── m4/ # Module 4: Nonclinical Study Reports
│ ├── m4.xml
│ └── ...
├── m5/ # Module 5: Clinical Study Reports
│ ├── m5.xml
│ └── ...
├── index.xml # Master index file
├── index-md5.txt # MD5 checksum file
└── dtd/ # DTD files
```
## Usage
```text
python skills/ectd-xml-compiler/scripts/main.py [options] <input_files...>
```
### Arguments
| Argument | Description |
|----------|-------------|
| `input_files` | Input Word/PDF file paths (supports multiple) |
### Options
| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--output` | `-o` | Output directory path | `./ectd-output` |
| `--module` | `-m` | Target module (m1-m5, auto) | `auto` |
| `--region` | `-r` | Target region (FDA, EMA, ICH) | `ICH` |
| `--version` | `-v` | eCTD version (3.2.2, 4.0) | `4.0` |
| `--dtd-path` | `-d` | Custom DTD path | Built-in DTD |
| `--validate` | | Validate generated XML | `False` |
### Examples
```text
# Basic usage - auto-detect module
python skills/ectd-xml-compiler/scripts/main.py document1.docx document2.pdf
# Specify output directory and module
python skills/ectd-xml-compiler/scripts/main.py -o ./my-ectd -m m3 quality-doc.docx
# FDA submission format
python skills/ectd-xml-compiler/scripts/main.py -r FDA -v 3.2.2 *.pdf
# Validate generated XML
python skills/ectd-xml-compiler/scripts/main.py --validate submission.pdf
```
## Input Document Processing
### Supported Formats
- Microsoft Word (.docx, .doc)
- PDF (.pdf)
### Document Parsing Logic
1. **Title Recognition**: Extract heading hierarchy based on font size and style
2. **TOC Mapping**: Auto-recognize section numbers (e.g., 3.2.S.1.1)
3. **Metadata Extraction**: Extract author, date, version, and other information
4. **Content Classification**: Map to corresponding eCTD modules based on keyword matching
### Module Auto-Recognition Rules
| Keyword Pattern | Target Module |
|------------|----------|
| Administrative, Label, Package Insert | m1 |
| Summary, summary, Overview | m2 |
| Quality, quality, CMC, API, Drug Product | m3 |
| Nonclinical, Toxicology, Pharmacokinetics | m4 |
| Clinical, clinical, Study, Trial | m5 |
## Output Structure
Generated eCTD skeleton contains:
### index.xml
Master index file containing references and sequence information for all modules.
### Module XML (m1.xml - m5.xml)
XML skeleton for each module, containing:
- Document hierarchy structure (`<leaf>`, `<node>`)
- Cross-references (`<cross-reference>`)
- Attribute definitions (ID, version, operation type)
### MD5 Checksums
MD5 checksum values for each file to ensure integrity.
## Installation
```text
# Install dependencies
pip install python-docx PyPDF2 lxml
```
## Validation
Using `--validate` option can validate generated XML:
- DTD structure validation
- Required elements and attributes check
- Cross-reference integrity check
## References
- [ICH eCTD Specification v4.0](https://www.ich.org/page/ectd)
- [FDA eCTD Guidance](https://www.fda.gov/drugs/electronic-regulatory-submission-and-review/ectd-technical-conformance-guide)
- [EMA eCTD Requirements](https://www.ema.europa.eu/en/human-regulatory/marketing-authorisation/application-procedures/electronic-application-forms)
## License
MIT License
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
No additional Python packages required.
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `ectd-xml-compiler` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `ectd-xml-compiler` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## References
- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/audit-reference.md
# Audit Reference
## Scope
- Skill: `ectd-xml-compiler`
- Core purpose: Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
- Use only within the documented workflow and category boundary defined in `SKILL.md`
## Supported Audit Paths
- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`
## Fallback Boundary
If required inputs are incomplete, the skill should still return:
- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable
FILE:requirements.txt
docx
lxml
pypdf2
FILE:scripts/main.py
#!/usr/bin/env python3
"""eCTD XML Compiler
Automatically convert drug application documents (Word/PDF) into eCTD XML skeleton structures that comply with FDA/EMA requirements.
ID: 197"""
import argparse
import hashlib
import os
import re
import sys
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from xml.etree import ElementTree as ET
from xml.dom import minidom
# Optional imports with fallbacks
try:
from docx import Document
HAS_DOCX = True
except ImportError:
HAS_DOCX = False
try:
import PyPDF2
HAS_PYPDF2 = True
except ImportError:
HAS_PYPDF2 = False
try:
from lxml import etree
HAS_LXML = True
except ImportError:
HAS_LXML = False
class ECTDError(Exception):
"""eCTD handling errors"""
pass
class DocumentParser:
"""Document parser base class"""
def __init__(self, file_path: str):
self.file_path = file_path
self.content = ""
self.metadata = {}
self.headings = []
def parse(self) -> Dict:
"""Parse the document and return structured data"""
raise NotImplementedError
def detect_module(self) -> str:
"""Automatically detect the eCTD module it belongs to according to the content"""
content_lower = self.content.lower()
# Module detection rules
module_keywords = {
"m1": ["administrative", "Label", "manual", "labeling", "administrative", "form 356h"],
"m2": ["summary", "summary", "Overview", "introduction", "overview"],
"m3": ["quality", "quality", "cmc", "API", "preparation", "manufacture", "control"],
"m4": ["non-clinical", "Toxicology", "Drug generation", "nonclinical", "toxicology", "pharmacology"],
"m5": ["clinical", "clinical", "study report", "efficacy", "safety"],
}
scores = {mod: 0 for mod in module_keywords}
for module, keywords in module_keywords.items():
for keyword in keywords:
if keyword.lower() in content_lower:
scores[module] += 1
# Returns the module with the highest score, default is m3
if max(scores.values()) > 0:
return max(scores, key=scores.get)
return "m3"
class WordParser(DocumentParser):
"""Word document parser"""
def parse(self) -> Dict:
if not HAS_DOCX:
raise ECTDError("python-docx is not installed and the Word document cannot be parsed. Please run: pip install python-docx")
doc = Document(self.file_path)
# Extract text content
paragraphs = []
for para in doc.paragraphs:
text = para.text.strip()
if text:
paragraphs.append(text)
# Detect title (based on style)
if para.style and para.style.name:
style_name = para.style.name.lower()
if 'heading' in style_name or 'title' in style_name:
level = self._extract_heading_level(style_name)
self.headings.append({
'level': level,
'text': text,
'style': para.style.name
})
self.content = "\n".join(paragraphs)
# Extract metadata
self.metadata = {
'author': doc.core_properties.author if doc.core_properties else '',
'created': doc.core_properties.created.isoformat() if doc.core_properties and doc.core_properties.created else '',
'modified': doc.core_properties.modified.isoformat() if doc.core_properties and doc.core_properties.modified else '',
'title': doc.core_properties.title if doc.core_properties else '',
}
return {
'content': self.content,
'metadata': self.metadata,
'headings': self.headings,
'paragraphs': paragraphs
}
def _extract_heading_level(self, style_name: str) -> int:
"""Extract heading level from style name"""
match = re.search(r'heading\s*(\d)', style_name, re.IGNORECASE)
if match:
return int(match.group(1))
match = re.search('Title\\s*(\\d)', style_name)
if match:
return int(match.group(1))
return 1
class PDFParser(DocumentParser):
"""PDF document parser"""
def parse(self) -> Dict:
if not HAS_PYPDF2:
raise ECTDError("PyPDF2 is not installed and the PDF document cannot be parsed. Please run: pip install PyPDF2")
with open(self.file_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
# Extract metadata
meta = reader.metadata
self.metadata = {
'author': meta.get('/Author', '') if meta else '',
'creator': meta.get('/Creator', '') if meta else '',
'producer': meta.get('/Producer', '') if meta else '',
'title': meta.get('/Title', '') if meta else '',
'num_pages': len(reader.pages),
}
# Extract text
paragraphs = []
for page in reader.pages:
text = page.extract_text()
if text:
# Simple segmentation
for line in text.split('\n'):
line = line.strip()
if line:
paragraphs.append(line)
self.content = "\n".join(paragraphs)
# Try to identify headers (based on line length and formatting)
self.headings = self._extract_headings(paragraphs)
return {
'content': self.content,
'metadata': self.metadata,
'headings': self.headings,
'paragraphs': paragraphs
}
def _extract_headings(self, paragraphs: List[str]) -> List[Dict]:
"""Extract possible titles from PDF paragraphs"""
headings = []
for para in paragraphs[:50]: # Only check first 50 rows
# Titles are usually shorter and may include chapter numbers
if len(para) < 100:
# Detect chapter numbering pattern (e.g. 3.2.S.1.1 or 1.1 or 2.3.4)
if re.match(r'^\d+(\.\d+)*\s+[A-Za-z]', para) or \
re.match(r'^[\d\.]+\s+[\u4e00-\u9fa5]', para):
level = para.count('.') + 1
headings.append({
'level': min(level, 6),
'text': para,
'style': 'extracted'
})
return headings
class ECTDCompiler:
"""eCTD XML Compiler"""
def __init__(self, output_dir: str, region: str = "ICH", version: str = "4.0"):
self.output_dir = Path(output_dir)
self.region = region
self.version = version
self.modules = {f"m{i}": [] for i in range(1, 6)}
self.file_hashes = {}
def compile_document(self, file_path: str, target_module: Optional[str] = None):
"""Compile a single document"""
file_path = Path(file_path)
if not file_path.exists():
raise ECTDError(f"File does not exist: {file_path}")
# Select parser
suffix = file_path.suffix.lower()
if suffix in ['.docx', '.doc']:
parser = WordParser(str(file_path))
elif suffix == '.pdf':
parser = PDFParser(str(file_path))
else:
raise ECTDError(f"Unsupported file format: {suffix}")
# Parse document
data = parser.parse()
# Determine target module
if target_module is None or target_module == "auto":
module = parser.detect_module()
else:
module = target_module
# Store parsing results
self.modules[module].append({
'file_path': str(file_path),
'file_name': file_path.name,
'data': data,
'module': module
})
# Calculate file hash
self.file_hashes[file_path.name] = self._calculate_md5(file_path)
return module
def _calculate_md5(self, file_path: Path) -> str:
"""Calculate MD5 hash of file"""
hash_md5 = hashlib.md5()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def generate_xml(self):
"""Generate eCTD XML skeleton"""
self.output_dir.mkdir(parents=True, exist_ok=True)
# Create module directory and XML
for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
module_dir = self.output_dir / module_dir
module_dir.mkdir(exist_ok=True)
self._generate_module_xml(module_id, module_dir)
# Generate main index file
self._generate_index_xml()
# Generate MD5 check file
self._generate_md5_file()
return self.output_dir
def _generate_module_xml(self, module_id: str, module_dir: Path):
"""Generate module XML file"""
root = ET.Element("ectd:ectd")
root.set("xmlns:ectd", f"http://www.ich.org/ectd/{self.version}")
root.set("xmlns:xlink", "http://www.w3.org/1999/xlink")
# Add module root node
module_elem = ET.SubElement(root, f"ectd:{module_id}")
documents = self.modules.get(module_id, [])
# If there is a document, create a leaf node
for doc in documents:
leaf = ET.SubElement(module_elem, "ectd:leaf")
leaf.set("ID", f"{module_id}-{doc['file_name']}")
leaf.set("xlink:href", f"{module_id}/{doc['file_name']}")
leaf.set("operation", "new")
# Add title information
title = ET.SubElement(leaf, "ectd:title")
title.text = doc['data']['metadata'].get('title', '') or doc['file_name']
# Add document properties
if doc['data']['headings']:
for heading in doc['data']['headings'][:5]: # Only include the first 5 titles
node = ET.SubElement(leaf, "ectd:node")
node.set("level", str(heading['level']))
node.text = heading['text']
# If there is no document, add placeholder instructions
if not documents:
placeholder = ET.SubElement(module_elem, "ectd:placeholder")
placeholder.text = f"No documents assigned to {module_id}"
# format and write
xml_str = self._prettify_xml(root)
xml_path = module_dir / f"{module_id}.xml"
with open(xml_path, 'w', encoding='utf-8') as f:
f.write(xml_str)
def _generate_index_xml(self):
"""Generate main index file"""
root = ET.Element("ectd:ectd")
root.set("xmlns:ectd", f"http://www.ich.org/ectd/{self.version}")
root.set("xmlns:xlink", "http://www.w3.org/1999/xlink")
root.set("dtd-version", self.version)
# Add header information
header = ET.SubElement(root, "ectd:header")
submission = ET.SubElement(header, "ectd:submission")
submission.set("type", "initial")
submission.set("date", datetime.now().strftime("%Y-%m-%d"))
applicant = ET.SubElement(header, "ectd:applicant")
applicant.text = "[Applicant name]"
product = ET.SubElement(header, "ectd:product")
product.set("name", "[drug name]")
# Add module reference
modules_elem = ET.SubElement(root, "ectd:modules")
for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
module_ref = ET.SubElement(modules_elem, "ectd:module")
module_ref.set("id", module_id)
module_ref.set("xlink:href", f"{module_id}/{module_id}.xml")
module_ref.set("title", self._get_module_title(module_id))
# format and write
xml_str = self._prettify_xml(root)
index_path = self.output_dir / "index.xml"
with open(index_path, 'w', encoding='utf-8') as f:
f.write(xml_str)
def _get_module_title(self, module_id: str) -> str:
"""Get module title"""
titles = {
"m1": "Administrative Information and Prescribing Information",
"m2": "Common Technical Document Summaries",
"m3": "Quality",
"m4": "Nonclinical Study Reports",
"m5": "Clinical Study Reports",
}
return titles.get(module_id, module_id)
def _generate_md5_file(self):
"""Generate MD5 check file"""
md5_path = self.output_dir / "index-md5.txt"
with open(md5_path, 'w', encoding='utf-8') as f:
f.write("# eCTD MD5 Checksums\n")
f.write(f"# Generated: {datetime.now().isoformat()}\n")
f.write(f"# Version: {self.version}\n")
f.write(f"# Region: {self.region}\n\n")
for filename, hash_value in sorted(self.file_hashes.items()):
f.write(f"{hash_value} {filename}\n")
# Add generated XML file hash
for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
xml_file = self.output_dir / module_id / f"{module_id}.xml"
if xml_file.exists():
hash_value = self._calculate_md5(xml_file)
f.write(f"{hash_value} {module_id}/{module_id}.xml\n")
# index.xml
index_file = self.output_dir / "index.xml"
if index_file.exists():
hash_value = self._calculate_md5(index_file)
f.write(f"{hash_value} index.xml\n")
def _prettify_xml(self, elem) -> str:
"""Formatted XML output"""
rough_string = ET.tostring(elem, encoding='unicode')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")
def validate(self) -> List[str]:
"""Validate the generated XML structure"""
errors = []
# Basic verification
index_file = self.output_dir / "index.xml"
if not index_file.exists():
errors.append("Missing main index file index.xml")
for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
module_xml = self.output_dir / module_id / f"{module_id}.xml"
if not module_xml.exists():
errors.append(f"Missing module file {module_id}.xml")
# DTD validation using lxml (if available)
if HAS_LXML:
errors.extend(self._dtd_validation())
return errors
def _dtd_validation(self) -> List[str]:
"""DTD verification"""
errors = []
# Here you can add the actual DTD validation logic
# Need to load ICH eCTD DTD file
return errors
def create_parser() -> argparse.ArgumentParser:
"""Create a command line argument parser"""
parser = argparse.ArgumentParser(
description="eCTD XML Compiler - Convert drug submission documents to eCTD XML skeleton structures",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""Example:
%(prog)s document.docx report.pdf
%(prog)s -o ./my-ectd -m m3 quality-doc.docx
%(prog)s -r FDA -v 3.2.2 *.pdf
%(prog)s --validate submission.pdf"""
)
parser.add_argument(
'input_files',
nargs='+',
help='Input Word/PDF file path (supports multiple)'
)
parser.add_argument(
'-o', '--output',
default='./ectd-output',
help='Output directory path (default: ./ectd-output)'
)
parser.add_argument(
'-m', '--module',
choices=['m1', 'm2', 'm3', 'm4', 'm5', 'auto'],
default='auto',
help='Target module (default: auto automatically detected)'
)
parser.add_argument(
'-r', '--region',
choices=['FDA', 'EMA', 'ICH'],
default='ICH',
help='Target region (Default: ICH)'
)
parser.add_argument(
'-v', '--version',
choices=['3.2.2', '4.0'],
default='4.0',
help='eCTD version (default: 4.0)'
)
parser.add_argument(
'-d', '--dtd-path',
help='Custom DTD path'
)
parser.add_argument(
'--validate',
action='store_true',
help='Validate generated XML'
)
parser.add_argument(
'--verbose',
action='store_true',
help='Show details'
)
return parser
def main():
"""main function"""
parser = create_parser()
args = parser.parse_args()
# Check dependencies
missing_deps = []
if not HAS_DOCX:
missing_deps.append("python-docx")
if not HAS_PYPDF2:
missing_deps.append("PyPDF2")
if missing_deps:
print(f"warn: Missing optional dependencies: {', '.join(missing_deps)}")
print("Installation: pip install" + " ".join(missing_deps))
# Initialize the compiler
try:
compiler = ECTDCompiler(
output_dir=args.output,
region=args.region,
version=args.version
)
except Exception as e:
print(f"mistake: Failed to initialize compiler - {e}", file=sys.stderr)
sys.exit(1)
# Process input files
processed = 0
for file_path in args.input_files:
if args.verbose:
print(f"deal with: {file_path}")
try:
module = compiler.compile_document(file_path, args.module)
if args.verbose:
print(f" -> assigned to module: {module}")
processed += 1
except ECTDError as e:
print(f"mistake [{file_path}]: {e}", file=sys.stderr)
except Exception as e:
print(f"mistake [{file_path}]: {str(e)}", file=sys.stderr)
if processed == 0:
print("Error: No files were successfully processed", file=sys.stderr)
sys.exit(1)
print(f"\nsuccessfully processed {processed} files")
# Generate XML
print(f"\ngenerateeCTD XMLskeleton...")
output_dir = compiler.generate_xml()
print(f"Output directory: {output_dir.absolute()}")
# Show module allocation statistics
print("Module assignment:")
for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
count = len(compiler.modules[module_id])
if count > 0:
print(f" {module_id}: {count} documents")
# Verification (if requested)
if args.validate:
print("Validate XML structure...")
errors = compiler.validate()
if errors:
print("The following issues were found:")
for error in errors:
print(f" - {error}")
else:
print("Verification passed!")
print("Finish!")
print(f"eCTDThe skeleton has been generated in: {output_dir.absolute()}")
if __name__ == '__main__':
main()
Design clinical trial CRFs with proper validation rules
---
name: ecrf-designer
description: Design clinical trial CRFs with proper validation rules
version: 1.0.0
category: Pharma
tags: []
author: AIPOCH
license: MIT
status: Draft
risk_level: Medium
skill_type: Tool/Script
owner: AIPOCH
reviewer: ''
last_updated: '2026-02-06'
---
# eCRF Designer
Clinical data collection form design.
## Use Cases
- Case report form creation
- CDISC SDTM compliance
- EDC system setup
- Data validation rules
## Parameters
- `visit_schedule`: Time points
- `data_elements`: Variables to collect
- `cdisc_domain`: SDTM domain
## Returns
- CRF specifications
- Field validation rules
- Logic skip patterns
- Data dictionary
## Example
Demographics form with edit checks for age range
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
No additional Python packages required.
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
FILE:scripts/main.py
#!/usr/bin/env python3
"""
eCRF Designer
Design clinical trial CRFs with proper validation rules.
"""
import argparse
import json
class eCRFDesigner:
"""Design electronic Case Report Forms."""
FIELD_TYPES = {
"text": {"validation": ["max_length", "regex"], "example": "Free text"},
"number": {"validation": ["min", "max", "decimal_places"], "example": "123.45"},
"integer": {"validation": ["min", "max"], "example": "42"},
"date": {"validation": ["min_date", "max_date"], "example": "2024-01-15"},
"choice": {"validation": ["options"], "example": "Single select"},
"multichoice": {"validation": ["options", "max_selections"], "example": "Multi select"},
"yesno": {"validation": [], "example": "Yes/No"},
"calculated": {"validation": ["formula"], "example": "Auto-calculated"}
}
CRF_TEMPLATES = {
"demographics": {
"fields": [
{"name": "subject_id", "type": "text", "required": True, "label": "Subject ID"},
{"name": "birth_date", "type": "date", "required": True, "label": "Date of Birth"},
{"name": "gender", "type": "choice", "options": ["Male", "Female", "Other"], "label": "Gender"},
{"name": "race", "type": "choice", "options": ["Asian", "Black", "White", "Other"], "label": "Race"}
]
},
"adverse_event": {
"fields": [
{"name": "ae_description", "type": "text", "required": True, "label": "AE Description"},
{"name": "severity", "type": "choice", "options": ["Mild", "Moderate", "Severe"], "label": "Severity"},
{"name": "start_date", "type": "date", "required": True, "label": "Start Date"},
{"name": "related", "type": "yesno", "label": "Related to Study Drug?"}
]
},
"efficacy": {
"fields": [
{"name": "visit_date", "type": "date", "required": True, "label": "Visit Date"},
{"name": "tumor_size", "type": "number", "min": 0, "label": "Tumor Size (mm)"},
{"name": "response", "type": "choice", "options": ["CR", "PR", "SD", "PD"], "label": "Response"}
]
}
}
def generate_crf(self, template_name, custom_fields=None):
"""Generate CRF from template."""
crf = {
"form_name": template_name,
"version": "1.0",
"fields": []
}
if template_name in self.CRF_TEMPLATES:
crf["fields"] = self.CRF_TEMPLATES[template_name]["fields"].copy()
if custom_fields:
crf["fields"].extend(custom_fields)
return crf
def add_validation(self, field, rules):
"""Add validation rules to field."""
field["validation"] = rules
return field
def export_json(self, crf, output_file):
"""Export CRF to JSON."""
with open(output_file, 'w') as f:
json.dump(crf, f, indent=2)
def print_crf(self, crf):
"""Print CRF structure."""
print(f"\n{'='*60}")
print(f"eCRF: {crf['form_name']} (v{crf['version']})")
print(f"{'='*60}\n")
for field in crf["fields"]:
print(f"Field: {field['name']}")
print(f" Label: {field['label']}")
print(f" Type: {field['type']}")
print(f" Required: {field.get('required', False)}")
if "options" in field:
print(f" Options: {', '.join(field['options'])}")
print()
print(f"{'='*60}\n")
def main():
parser = argparse.ArgumentParser(description="eCRF Designer")
parser.add_argument("--template", "-t", choices=["demographics", "adverse_event", "efficacy"],
default="demographics", help="CRF template")
parser.add_argument("--output", "-o", default="crf.json", help="Output file")
parser.add_argument("--list-types", action="store_true", help="List field types")
args = parser.parse_args()
designer = eCRFDesigner()
if args.list_types:
print("\nAvailable field types:")
for ftype, info in designer.FIELD_TYPES.items():
print(f" {ftype}: {info['example']}")
return
crf = designer.generate_crf(args.template)
designer.print_crf(crf)
designer.export_json(crf, args.output)
print(f"CRF exported to: {args.output}")
if __name__ == "__main__":
main()
Generates detailed text descriptions of medical images and charts for.
---
name: visual-content-desc
description: Generates detailed text descriptions of medical images and charts for.
license: MIT
skill-author: AIPOCH
---
# Visual Content Desc
Creates accessible descriptions of medical visuals.
## When to Use
- Use this skill when the task needs Generates detailed text descriptions of medical images and charts for.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
See `## Features` above for related details.
- Scope-focused workflow aligned to: Generates detailed text descriptions of medical images and charts for.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.
## Example Usage
```bash
cd "20260318/scientific-skills/Academic Writing/visual-content-desc"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Features
- Image description generation
- Chart interpretation
- Alt text creation
- Figure legend assistance
## Input Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `image_type` | str | Yes | "microscopy", "chart", "scan" |
| `key_features` | list | Yes | Important visual elements |
## Output Format
```json
{
"description": "string",
"alt_text": "string",
"figure_text": "string"
}
```
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
No additional Python packages required.
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `visual-content-desc` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `visual-content-desc` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/guidelines.md
# Visual Content Desc - References
## Accessibility Standards
- WCAG Image Guidelines
- Alt Text Best Practices
- Medical Image Description Standards
FILE:scripts/main.py
#!/usr/bin/env python3
"""Visual Content Desc - Image description generator."""
import json
class VisualContentDesc:
"""Generates descriptions of medical images."""
def describe(self, image_type: str, key_features: list) -> dict:
"""Generate image description."""
features_text = ", ".join(key_features)
description = f"This {image_type} shows {features_text}."
alt_text = f"Medical {image_type}: {features_text}"
return {
"description": description,
"alt_text": alt_text[:150],
"figure_text": f"Figure shows {features_text}.",
"image_type": image_type
}
def main():
describer = VisualContentDesc()
result = describer.describe("microscopy", ["cellular structures", "stained nuclei", "tissue architecture"])
print(json.dumps(result, indent=2))
if __name__ == "__main__":
main()
Simulate standardized patient encounters for medical training, supporting OSCE-style history-taking practice, communication skills rehearsal, and educational...
---
name: virtual-patient-roleplay
description: Simulate standardized patient encounters for medical training, supporting OSCE-style history-taking practice, communication skills rehearsal, and educational debriefing.
license: MIT
skill-author: AIPOCH
---
# Virtual Patient Roleplay
Structured standardized-patient simulation for medical training and clinical interview practice.
> **Educational Disclaimer:** All output is for training simulation only. This skill does not provide real clinical diagnosis, treatment selection, or emergency instructions. Faculty supervision is required for formal assessment use.
## Quick Check
```bash
python -m py_compile scripts/main.py
python -c "from scripts.main import PatientSimulator; sim=PatientSimulator('chest_pain'); print(sim.ask('Where does the pain go?')['patient_response'])"
```
## When to Use
- Use this skill for OSCE-style history-taking practice, communication skills rehearsal, or debrief planning.
- Use this skill when a learner needs to practice clinical interviewing with a simulated patient response.
- Do not use this skill for real patient triage, clinical diagnosis, treatment selection, or emergency guidance.
## Workflow
1. Confirm the training goal, scenario type, learner level, and output focus (questioning, bedside manner, or debriefing).
2. Check whether the request is for live roleplay, case setup, feedback, or post-encounter summary.
3. Use the packaged simulator for supported scenarios; otherwise provide a manual roleplay scaffold without inventing unsupported medical certainty.
4. Return the patient response or teaching artifact with assumptions, missed-question prompts, and debrief notes.
5. If the request exceeds educational scope, stop and restate the boundary explicitly.
## Usage
```text
python -c "from scripts.main import PatientSimulator; sim=PatientSimulator('chest_pain'); print(sim.ask('Where does the pain go?')['patient_response'])"
python -c "from scripts.main import PatientSimulator; sim=PatientSimulator('headache'); print(sim.ask('Did the pain start suddenly?')['patient_response'])"
```
## Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `scenario` | string | No | `chest_pain` | Scenario: `chest_pain`, `headache`, `abdominal_pain` |
| `student_question` | string | Yes (for interaction) | — | Learner question posed to the patient |
| `difficulty` | string | No | `intermediate` | Scenario difficulty level |
## Output
- Simulated patient response
- Scenario-specific cues and debrief elements
- Explicit reminder that output is educational, not clinical advice
## Scope Boundaries
- This skill supports training simulations, not real clinical triage.
- This skill does not provide diagnosis, treatment selection, or emergency instructions.
- This skill should not be used as a substitute for faculty supervision or patient care.
## Stress-Case Rules
For complex multi-constraint requests, always include these explicit blocks:
1. Training Objective
2. Scenario Assumptions
3. Roleplay Output
4. Educational Limits
5. Debrief and Next Checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate clinical certainty, real patient data, or verified diagnostic outcomes.
## Input Validation
This skill accepts: a scenario identifier and a learner question for standardized patient simulation in a medical training context.
If the request does not involve educational patient simulation — for example, asking for real clinical diagnosis, treatment recommendations, emergency triage, or non-medical roleplay — do not proceed with the workflow. Instead respond:
> "virtual-patient-roleplay is designed for medical training simulations only. Your request appears to be outside this scope. Please provide a scenario and learner question for educational practice, or use a more appropriate tool."
## References
- [references/references.md](references/references.md) — Educational standards and simulation frameworks
- [references/audit-reference.md](references/audit-reference.md) — Supported scope, audit commands, and fallback boundaries
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/audit-reference.md
# Audit Reference
## Supported Scope
- Simulate standardized-patient interactions for teaching and rehearsal.
- Return educational feedback, missed-question hints, and debrief support.
- Keep outputs bounded to simulation and training support.
## Stable Audit Commands
```bash
python -m py_compile scripts/main.py
python -c "from scripts.main import PatientSimulator; sim=PatientSimulator('chest_pain'); print(sim.ask('Where does the pain go?')['patient_response'])"
```
## Fallback Boundaries
- If the request asks for real diagnosis, treatment, or emergency advice, stop and restate that the skill is educational only.
- If the requested scenario is unsupported, provide a manual scenario outline rather than claiming a simulated run succeeded.
- If execution fails, return a structured teaching scaffold with learner goals, patient cues, and debrief prompts.
## Output Guardrails
- Separate simulated patient speech from teaching commentary.
- Keep hidden-case facts distinct from revealed information.
- State clearly that final clinical judgment belongs to qualified supervision and real patient evaluation.
FILE:references/guidelines.md
# Virtual Patient Roleplay - References
## Educational Standards
- ASPE (Association of Standardized Patient Educators) Standards of Practice
- AAMC OSCE Guidelines for Medical Schools
- ACGME Milestones for Clinical Skills Assessment
## Clinical Scenarios Reference
- NBME Step 2 CK Practice Materials
- USMLE Clinical Skills Examination Cases
- Standardized Patient Case Development Guidelines
## Communication Skills Framework
- Calgary-Cambridge Guide to the Medical Interview
- SEGUE Framework for Communication Assessment
- Kalamazoo Consensus Statement on Communication Skills
## Technical Implementation
- Python 3.8+
- JSON for case data storage
- Modular design for extensibility
FILE:references/references.md
# Virtual Patient Roleplay References
## Educational Standards
- ASPE Standards of Best Practice
- AAMC OSCE guidance for medical schools
- ACGME clinical-skills milestones
## Simulation Design
- Standardized patient case-design principles
- Structured debriefing guidance for learner feedback
- Communication frameworks for patient interviewing
FILE:requirements.txt
dataclasses
enum
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Virtual Patient Roleplay - Standardized Patient Simulator
Simulates patient interactions for medical education training.
"""
import json
import random
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from enum import Enum
class EmotionalState(Enum):
CALM = "calm"
ANXIOUS = "anxious"
IRRITABLE = "irritable"
DEPRESSED = "depressed"
CONFUSED = "confused"
@dataclass
class PatientProfile:
"""Patient demographic and background information."""
name: str
age: int
gender: str
occupation: str
medical_history: List[str] = field(default_factory=list)
allergies: List[str] = field(default_factory=list)
medications: List[str] = field(default_factory=list)
@dataclass
class ClinicalScenario:
"""Clinical case template."""
chief_complaint: str
history_of_present_illness: str
associated_symptoms: List[str]
key_findings: Dict[str, str]
emotional_state: EmotionalState
hidden_concerns: List[str]
class PatientSimulator:
"""Simulates standardized patient interactions."""
SCENARIOS = {
"chest_pain": {
"chief_complaint": "Chest pain for 2 hours",
"hpi": "Started while climbing stairs. Pressure-like, 8/10, radiates to left arm. Associated with shortness of breath and sweating.",
"symptoms": ["dyspnea", "diaphoresis", "nausea"],
"findings": {"vitals": "HR 110, BP 160/95", "ecg": "ST elevation in V1-V4"},
"emotion": EmotionalState.ANXIOUS,
"hidden": ["Family history of MI", "Smokes 1 pack/day", "Afraid of dying"]
},
"headache": {
"chief_complaint": "Severe headache for 3 days",
"hpi": "Worst headache of life. Sudden onset. Associated with photophobia and neck stiffness.",
"symptoms": ["photophobia", "phonophobia", "neck_stiffness"],
"findings": {"vitals": "BP 180/110", "neuro": "Positive Kernig sign"},
"emotion": EmotionalState.IRRITABLE,
"hidden": ["Had similar episode 6 months ago", "Takes ibuprofen daily"]
},
"abdominal_pain": {
"chief_complaint": "Right lower quadrant pain",
"hpi": "Started around belly button, moved to RLQ. Gradual worsening over 12 hours. Loss of appetite.",
"symptoms": ["anorexia", "nausea", "low_grade_fever"],
"findings": {"vitals": "T 37.8C, HR 95", "exam": "Tenderness at McBurney's point"},
"emotion": EmotionalState.CALM,
"hidden": ["Fear of surgery", "No health insurance"]
}
}
def __init__(self, scenario: str = "chest_pain", difficulty: str = "intermediate"):
self.scenario_type = scenario
self.difficulty = difficulty
# Deterministic seed keeps the same scenario setup reproducible across audits.
self.seed = hash((scenario, difficulty)) & 0xFFFFFFFF
self.rng = random.Random(self.seed)
self.scenario = self._load_scenario(scenario)
self.conversation_history: List[Dict] = []
self.asked_questions: set = set()
self.revealed_info: set = set()
# Generate patient profile
self.patient = self._generate_patient_profile()
def _load_scenario(self, scenario_type: str) -> Dict:
"""Load scenario template."""
if scenario_type not in self.SCENARIOS:
raise ValueError(f"Unknown scenario: {scenario_type}. Available: {list(self.SCENARIOS.keys())}")
return self.SCENARIOS[scenario_type]
def _generate_patient_profile(self) -> PatientProfile:
"""Generate random but consistent patient demographics."""
first_names_m = ["James", "Robert", "John", "Michael", "David"]
first_names_f = ["Mary", "Patricia", "Jennifer", "Linda", "Elizabeth"]
last_names = ["Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia"]
gender = self.rng.choice(["male", "female"])
first_names = first_names_m if gender == "male" else first_names_f
name = f"{self.rng.choice(first_names)} {self.rng.choice(last_names)}"
age = self.rng.randint(35, 75) if self.scenario_type == "chest_pain" else self.rng.randint(18, 65)
occupations = ["teacher", "retired", "office worker", "truck driver", "nurse"]
return PatientProfile(
name=name,
age=age,
gender=gender,
occupation=self.rng.choice(occupations),
medical_history=["hypertension"] if self.rng.random() > 0.5 else [],
allergies=["penicillin"] if self.rng.random() > 0.7 else [],
medications=["lisinopril"] if self.rng.random() > 0.5 else []
)
def ask(self, question: str) -> Dict:
"""
Process student question and generate patient response.
Args:
question: Student's question text
Returns:
Response dictionary with patient answer and feedback
"""
question_lower = question.lower()
self.asked_questions.add(question_lower)
# Determine response based on question type
response = self._generate_response(question_lower)
# Record in history
exchange = {
"question": question,
"response": response["patient_response"],
"revealed": list(self.revealed_info)
}
self.conversation_history.append(exchange)
# Generate feedback
feedback = self._generate_feedback()
return {
"patient_response": response["patient_response"],
"emotional_state": self.scenario["emotion"].value,
"physical_cues": response.get("physical_cues", ""),
"hidden_information": self._get_unrevealed_info(),
"conversation_history": self.conversation_history,
"feedback": feedback
}
def _generate_response(self, question: str) -> Dict:
"""Generate contextual patient response."""
responses = {
"chest_pain": {
"pain": ("It feels like an elephant is sitting on my chest. It's crushing and heavy.",
"Patient clutches chest while speaking"),
"arm": ("Yes, it goes down my left arm. Feels numb and tingly.",
"Patient rubs left arm"),
"family": ("My father died of a heart attack when he was 55.",
""),
"smoke": ("I smoke about a pack a day. I know I should quit...",
"Patient looks embarrassed"),
"name": (f"I'm {self.patient.name}.",
""),
"default": ("I'm not sure I understand. Can you ask differently?",
"Patient looks confused")
},
"headache": {
"head": ("It's the worst headache I've ever had. Like someone hit me with a hammer.",
"Patient holds head in hands, avoids light"),
"light": ("Yes, the light hurts my eyes terribly.",
"Patient squints"),
"neck": ("My neck is so stiff I can barely move it.",
"Patient demonstrates limited neck mobility"),
"default": ("Could you repeat that? The pain makes it hard to concentrate.",
"Patient appears distracted")
},
"abdominal_pain": {
"pain": ("It started near my belly button but now it's here on the right side.",
"Patient points to RLQ"),
"eat": ("I haven't felt like eating anything since yesterday.",
""),
"fever": ("I think I might have a slight fever. I feel warm.",
"Patient touches forehead"),
"surgery": ("I'm scared about needing surgery...",
"Patient voice trembles slightly"),
"default": ("What do you mean?",
"Patient looks puzzled")
}
}
scenario_responses = responses.get(self.scenario_type, responses["chest_pain"])
# Match keywords
for keyword, (answer, cue) in scenario_responses.items():
if keyword != "default" and keyword in question:
self.revealed_info.add(keyword)
return {
"patient_response": answer,
"physical_cues": cue
}
# Default response
return {
"patient_response": scenario_responses.get("default", ("I'm not sure.", ""))[0],
"physical_cues": scenario_responses.get("default", ("", ""))[1]
}
def _get_unrevealed_info(self) -> List[str]:
"""Return information not yet revealed to student."""
hidden = set(self.scenario.get("hidden", []))
return list(hidden - self.revealed_info)
def _generate_feedback(self) -> Dict:
"""Generate educational feedback."""
feedback = {
"communication_tips": [],
"missed_questions": [],
"strengths": []
}
# Check for missed critical questions
critical_questions = {
"chest_pain": ["pain", "arm", "family"],
"headache": ["head", "light", "neck"],
"abdominal_pain": ["pain", "eat", "fever"]
}
scenario_critical = critical_questions.get(self.scenario_type, [])
asked_keywords = set()
for q in self.asked_questions:
for keyword in scenario_critical:
if keyword in q:
asked_keywords.add(keyword)
missed = set(scenario_critical) - asked_keywords
if missed:
feedback["missed_questions"] = [f"Consider asking about: {m}" for m in missed]
# Communication tips
if len(self.conversation_history) > 5:
feedback["communication_tips"].append("Good thoroughness in history taking.")
else:
feedback["communication_tips"].append("Consider asking more follow-up questions.")
return feedback
def get_case_summary(self) -> Dict:
"""Return complete case information for debriefing."""
return {
"patient_profile": {
"name": self.patient.name,
"age": self.patient.age,
"gender": self.patient.gender,
"occupation": self.patient.occupation
},
"chief_complaint": self.scenario["chief_complaint"],
"history_of_present_illness": self.scenario["hpi"],
"key_findings": self.scenario["findings"],
"hidden_concerns": self.scenario.get("hidden", []),
"questions_asked": len(self.asked_questions),
"information_revealed": len(self.revealed_info),
"conversation_transcript": self.conversation_history
}
def main():
"""CLI interface for testing."""
import sys
scenario = sys.argv[1] if len(sys.argv) > 1 else "chest_pain"
print(f"🩺 Virtual Patient Simulator - {scenario.upper()}\n")
print("=" * 50)
simulator = PatientSimulator(scenario=scenario)
print(f"\nPatient: {simulator.patient.name}, {simulator.patient.age}y {simulator.patient.gender}")
print(f"Occupation: {simulator.patient.occupation}")
print("\nThe patient enters the room, holding their chest...")
print("\nType your questions (or 'quit' to exit, 'summary' for debrief):\n")
while True:
try:
question = input("> ").strip()
if question.lower() == 'quit':
break
elif question.lower() == 'summary':
summary = simulator.get_case_summary()
print("\n" + "=" * 50)
print("CASE DEBRIEF")
print("=" * 50)
print(json.dumps(summary, indent=2))
continue
if not question:
continue
result = simulator.ask(question)
print(f"\n🧑 Patient: {result['patient_response']}")
if result['physical_cues']:
print(f" [Physical: {result['physical_cues']}]")
if result['feedback']['missed_questions']:
print(f"\n💡 Tip: {result['feedback']['missed_questions'][0]}")
except KeyboardInterrupt:
break
except Exception as e:
print(f"Error: {e}")
print("\nSession ended.")
if __name__ == "__main__":
main()
Convert complex Venn diagrams with more than 4 sets to clearer Upset.
---
name: upset-plot-converter
description: Convert complex Venn diagrams with more than 4 sets to clearer Upset.
license: MIT
skill-author: AIPOCH
---
# Upset Plot Converter
Convert complex Venn diagrams (more than 4 sets) to clearer Upset Plots.
## When to Use
- Use this skill when the task is to Convert complex Venn diagrams with more than 4 sets to clearer Upset.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
- Scope-focused workflow aligned to: Convert complex Venn diagrams with more than 4 sets to clearer Upset.
- Packaged executable path(s): `scripts/main.py`.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `matplotlib`: `unspecified`. Declared in `requirements.txt`.
- `numpy`: `unspecified`. Declared in `requirements.txt`.
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Data Analytics/upset-plot-converter"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Usage
```python
from skills.upset_plot_converter.scripts.main import convert_venn_to_upset
# From set data
sets = {
'A': {1, 2, 3, 4, 5},
'B': {4, 5, 6, 7, 8},
'C': {3, 5, 7, 9, 10},
'D': {2, 4, 6, 8, 10},
'E': {1, 3, 5, 7, 9}
}
convert_venn_to_upset(sets, output_path="upset_plot.png")
# From list data
from skills.upset_plot_converter.scripts.main import upset_from_lists
set_names = ['Genes A', 'Genes B', 'Genes C', 'Genes D', 'Genes E']
lists = [
['gene1', 'gene2', 'gene3'],
['gene2', 'gene4', 'gene5'],
['gene3', 'gene5', 'gene6'],
['gene7', 'gene8', 'gene9'],
['gene1', 'gene10', 'gene11']
]
upset_from_lists(set_names, lists, output_path="gene_upset.png", title="Gene Intersections")
```
## Input
- **sets**: Dictionary of set names to sets/lists of elements, OR
- **set_names**: List of set names
- **lists**: List of lists (each containing elements)
- **output_path**: Path to save the output figure
- **title**: Optional title for the plot
- **min_subset_size**: Minimum subset size to display (default: 1)
- **max_intersections**: Maximum number of intersections to show (default: 30)
## Output
PNG file of the Upset Plot visualization.
## Notes
- When Venn diagrams exceed 4 sets, they become difficult to read
- Upset Plots provide a clearer alternative for visualizing set intersections
- The x-axis shows set intersections as dot patterns
- Bar heights represent the size of each intersection
- Automatically sorts intersections by size for better readability
## Requirements
- matplotlib
- numpy
- pandas
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `upset-plot-converter` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `upset-plot-converter` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:requirements.txt
matplotlib
numpy
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Upset Plot Converter
Convert complex Venn diagrams (>4 sets) to clearer Upset Plots.
"""
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
from collections import defaultdict
from itertools import combinations
from typing import Dict, List, Set, Optional, Tuple
def _compute_intersections(sets: Dict[str, Set]) -> Dict[frozenset, Set]:
"""Compute all possible intersections between sets."""
set_names = list(sets.keys())
all_elements = set()
for s in sets.values():
all_elements.update(s)
intersections = defaultdict(set)
for elem in all_elements:
# Find which sets contain this element
containing_sets = frozenset(name for name, s in sets.items() if elem in s)
if containing_sets: # Only if element is in at least one set
intersections[containing_sets].add(elem)
return intersections
def _prepare_upset_data(
intersections: Dict[frozenset, Set],
set_names: List[str],
min_subset_size: int = 1,
max_intersections: int = 30
) -> Tuple[List[frozenset], List[int], np.ndarray]:
"""Prepare data for upset plot rendering."""
# Filter by minimum subset size
filtered = {k: v for k, v in intersections.items() if len(v) >= min_subset_size}
# Sort by size (descending) and take top N
sorted_items = sorted(filtered.items(), key=lambda x: len(x[1]), reverse=True)
sorted_items = sorted_items[:max_intersections]
if not sorted_items:
return [], [], np.array([])
intersection_sets = [item[0] for item in sorted_items]
sizes = [len(item[1]) for item in sorted_items]
# Create membership matrix
n_sets = len(set_names)
n_intersections = len(intersection_sets)
membership = np.zeros((n_intersections, n_sets), dtype=int)
for i, inter_set in enumerate(intersection_sets):
for j, name in enumerate(set_names):
if name in inter_set:
membership[i, j] = 1
return intersection_sets, sizes, membership
def _plot_upset(
set_names: List[str],
intersection_sets: List[frozenset],
sizes: List[int],
membership: np.ndarray,
output_path: str,
title: Optional[str] = None
) -> None:
"""Create the upset plot visualization."""
n_sets = len(set_names)
n_intersections = len(intersection_sets)
if n_intersections == 0:
print("No intersections to plot.")
return
# Figure dimensions
bar_height = 0.6
matrix_height = 0.4
set_label_width = 0.15
fig_width = max(10, n_intersections * 0.5 + 3)
fig_height = max(6, n_sets * 0.5 + 2)
fig = plt.figure(figsize=(fig_width, fig_height))
gs = fig.add_gridspec(2, 2,
width_ratios=[set_label_width, 1 - set_label_width],
height_ratios=[0.6, 0.4],
wspace=0.05, hspace=0.1)
# Bar chart (top right)
ax_bars = fig.add_subplot(gs[0, 1])
ax_bars.bar(range(n_intersections), sizes, color='#4CAF50', edgecolor='black', linewidth=0.5)
ax_bars.set_xticks(range(n_intersections))
ax_bars.set_xticklabels([])
ax_bars.set_ylabel('Intersection Size', fontsize=10)
ax_bars.set_xlim(-0.5, n_intersections - 0.5)
# Add value labels on bars
for i, (size, inter_set) in enumerate(zip(sizes, intersection_sets)):
ax_bars.text(i, size + max(sizes) * 0.02, str(size),
ha='center', va='bottom', fontsize=8, fontweight='bold')
# Set labels (bottom left)
ax_set_labels = fig.add_subplot(gs[1, 0])
ax_set_labels.set_xlim(0, 1)
ax_set_labels.set_ylim(-0.5, n_sets - 0.5)
for i, name in enumerate(reversed(set_names)):
ax_set_labels.text(0.9, i, name, ha='right', va='center', fontsize=9)
ax_set_labels.axis('off')
# Matrix (bottom right)
ax_matrix = fig.add_subplot(gs[1, 1])
ax_matrix.set_xlim(-0.5, n_intersections - 0.5)
ax_matrix.set_ylim(-0.5, n_sets - 0.5)
# Draw grid lines
for i in range(n_sets):
ax_matrix.axhline(i - 0.5, color='gray', linewidth=0.5, alpha=0.3)
for i in range(n_intersections + 1):
ax_matrix.axvline(i - 0.5, color='gray', linewidth=0.5, alpha=0.3)
# Draw dots and connecting lines
colors = plt.cm.tab10(np.linspace(0, 1, n_sets))
for col in range(n_intersections):
active_rows = []
for row in range(n_sets):
if membership[col, row] == 1:
# Draw circle
circle = plt.Circle((col, n_sets - 1 - row), 0.15,
color=colors[row], ec='black', linewidth=1)
ax_matrix.add_patch(circle)
active_rows.append(n_sets - 1 - row)
# Draw connecting line if multiple sets in intersection
if len(active_rows) > 1:
min_y, max_y = min(active_rows), max(active_rows)
ax_matrix.plot([col, col], [min_y, max_y], 'k-', linewidth=2)
ax_matrix.set_xticks(range(n_intersections))
ax_matrix.set_xticklabels([])
ax_matrix.set_yticks(range(n_sets))
ax_matrix.set_yticklabels([])
ax_matrix.set_aspect('equal')
# Set total sizes on the left
ax_totals = fig.add_subplot(gs[0, 0])
ax_totals.set_xlim(0, 1)
ax_totals.set_ylim(-0.5, n_sets - 0.5)
# Hide the totals subplot (not needed for upset plot style)
ax_totals.axis('off')
if title:
fig.suptitle(title, fontsize=14, fontweight='bold', y=0.98)
plt.savefig(output_path, dpi=150, bbox_inches='tight', facecolor='white')
plt.close()
print(f"Upset plot saved to: {output_path}")
def convert_venn_to_upset(
sets: Dict[str, Set],
output_path: str = "upset_plot.png",
title: Optional[str] = None,
min_subset_size: int = 1,
max_intersections: int = 30
) -> None:
"""
Convert a dictionary of sets to an Upset Plot.
Args:
sets: Dictionary mapping set names to sets of elements
output_path: Path to save the output figure
title: Optional title for the plot
min_subset_size: Minimum subset size to display
max_intersections: Maximum number of intersections to show
"""
if len(sets) <= 4:
print(f"Warning: Only {len(sets)} sets provided. Venn diagrams work well for ≤4 sets.")
print("Upset plot will still be generated for comparison.")
if len(sets) > 10:
print(f"Warning: {len(sets)} sets may produce a very large plot.")
set_names = list(sets.keys())
# Convert all values to sets
sets_converted = {name: set(s) for name, s in sets.items()}
# Compute intersections
intersections = _compute_intersections(sets_converted)
# Prepare data
intersection_sets, sizes, membership = _prepare_upset_data(
intersections, set_names, min_subset_size, max_intersections
)
# Generate plot
_plot_upset(set_names, intersection_sets, sizes, membership, output_path, title)
def upset_from_lists(
set_names: List[str],
lists: List[List],
output_path: str = "upset_plot.png",
title: Optional[str] = None,
min_subset_size: int = 1,
max_intersections: int = 30
) -> None:
"""
Create an Upset Plot from lists of elements.
Args:
set_names: List of names for each set
lists: List of lists, each containing elements of a set
output_path: Path to save the output figure
title: Optional title for the plot
min_subset_size: Minimum subset size to display
max_intersections: Maximum number of intersections to show
"""
if len(set_names) != len(lists):
raise ValueError("set_names and lists must have the same length")
sets = {name: set(lst) for name, lst in zip(set_names, lists)}
convert_venn_to_upset(
sets, output_path, title, min_subset_size, max_intersections
)
def print_intersection_stats(sets: Dict[str, Set]) -> None:
"""Print statistics about set intersections."""
intersections = _compute_intersections(sets)
print("\n=== Set Intersection Statistics ===")
print(f"Total unique elements: {len(set().union(*sets.values()))}")
print(f"Number of sets: {len(sets)}")
print(f"Number of unique intersections: {len(intersections)}")
print("\nIntersections (sorted by size):")
sorted_items = sorted(intersections.items(), key=lambda x: len(x[1]), reverse=True)
for inter_set, elements in sorted_items:
if len(inter_set) == 0:
continue
set_names = ', '.join(sorted(inter_set))
print(f" {set_names}: {len(elements)} elements")
# Example usage
if __name__ == "__main__":
# Example with 5 sets
sets = {
'A': {1, 2, 3, 4, 5, 20, 21},
'B': {4, 5, 6, 7, 8, 20, 22},
'C': {3, 5, 7, 9, 10, 21, 22},
'D': {2, 4, 6, 8, 10, 20, 23},
'E': {1, 3, 5, 7, 9, 21, 23}
}
print_intersection_stats(sets)
convert_venn_to_upset(sets, output_path="upset_plot.png", title="Example Upset Plot")
Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.
---
name: unstructured-medical-text-miner
description: Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.
license: MIT
skill-author: AIPOCH
---
# Unstructured Medical Text Miner (ID: 213)
## When to Use
- Use this skill when the task needs Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.
- Use this skill for evidence insight tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
See `## Features` above for related details.
- Scope-focused workflow aligned to: Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.
- Packaged executable path(s): `scripts/__init__.py` plus 1 additional script(s).
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
```
pandas>=1.3.0
spacy>=3.4.0
scispacy>=0.5.1
radlex (for radiology terminology)
negspacy (for negation detection)
```
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Evidence Insight/unstructured-medical-text-miner"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/__init__.py` with additional helper scripts under `scripts/`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py -h
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Overview
Mine "text data" that has been long overlooked in MIMIC-IV, extracting unstructured diagnostic logic, order details, and progress notes.
## Purpose
The MIMIC-IV database contains large amounts of structured data (vital signs, laboratory results, etc.), but its true clinical value is often hidden in unstructured text:
- Diagnostic reasoning chains in discharge summaries
- Subtle finding descriptions in imaging reports
- Treatment decision logic in progress notes
- Personalized medication considerations in orders
This Skill provides a complete text mining toolchain to transform raw medical text into analyzable structured insights.
## Features
### 1. Text Extraction
- **NOTEEVENTS**: Extract clinical notes from MIMIC-IV NOTE module
- **Radiology Reports**: Extract imaging diagnostic text
- **ECG Reports**: Parse ECG interpretation text
- **Discharge Summaries**: Extract complete diagnostic and treatment course
### 2. Information Extraction
- **Entity Recognition**: Diseases, symptoms, medications, procedures, anatomical sites
- **Relation Extraction**: Medication-disease treatment relationships, symptom-disease diagnostic relationships
- **Timeline Extraction**: Event occurrence times, disease progression sequence
- **Negation Detection**: Identify negated clinical findings (e.g., "no fever")
### 3. Clinical Logic Parsing
- **Diagnostic Reasoning Chain**: Reasoning path from symptoms → examination → diagnosis
- **Treatment Decision Tree**: Clinical basis for medication selection and dosage adjustment
- **Disease Progression**: Disease progression and outcome descriptions
### 4. Structured Output
- FHIR-compatible clinical document format
- Knowledge graph-friendly triple format
- Temporal event sequences
## Usage
```python
from skills.unstructured_medical_text_miner.scripts.main import MedicalTextMiner
# Initialize miner
miner = MedicalTextMiner()
# Load MIMIC-IV note data
miner.load_notes(notes_path="path/to/noteevents.csv")
# Extract all text records for a specific patient
patient_texts = miner.get_patient_texts(subject_id=10000032)
# Execute complete information extraction
insights = miner.extract_insights(
text=patient_texts,
extract_entities=True,
extract_relations=True,
extract_timeline=True
)
```
## Input
### Data Sources
- MIMIC-IV NOTEEVENTS table (csv/parquet format)
- Discharge summary files
- Imaging report files
- Custom medical text
### Field Requirements
| Field Name | Description | Required |
|--------|------|------|
| subject_id | Patient unique identifier | Yes |
| hadm_id | Hospital admission record identifier | No |
| note_type | Note type (DS/RR/ECG, etc.) | Yes |
| note_text | Note text content | Yes |
| charttime | Record time | No |
## Output
### Entity Extraction Results
```json
{
"entities": [
{
"text": "acute myocardial infarction",
"type": "DISEASE",
"start": 156,
"end": 183,
"confidence": 0.94
},
{
"text": "aspirin 81mg",
"type": "MEDICATION",
"start": 245,
"end": 257,
"attributes": {
"dose": "81mg",
"frequency": "daily"
}
}
]
}
```
### Clinical Logic Graph
```json
{
"clinical_logic": {
"presenting_complaint": "chest pain",
"differential_diagnoses": ["ACS", "PE", "aortic dissection"],
"workup": ["ECG", "troponin", "CTA chest"],
"final_diagnosis": "STEMI",
"treatment_plan": ["PCI", "dual antiplatelet"]
}
}
```
### Temporal Events
```json
{
"timeline": [
{
"time": "2020-03-15 08:30",
"event": "admission",
"description": "presented with chest pain"
},
{
"time": "2020-03-15 09:15",
"event": "ECG",
"description": "ST elevation in V1-V4"
}
]
}
```
## Configuration
```yaml
# config.yaml
extraction:
entity_types: ["DISEASE", "SYMPTOM", "MEDICATION", "PROCEDURE", "ANATOMY"]
relation_types: ["TREATS", "CAUSES", "CONTRAINDICATED_WITH"]
enable_negation_detection: true
models:
ner_model: "en_core_sci_lg" # or "en_core_sci_scibert"
relation_model: "custom_relation_extractor"
output:
format: "json" # json/fhir/kg
include_raw_text: false
```
## CLI Usage
```text
# Process single file
python -m skills.unstructured_medical_text_miner.scripts.main \
--input notes.csv \
--output extracted.json \
--extract all
# Process specific patient
python -m skills.unstructured_medical_text_miner.scripts.main \
--subject-id 10000032 \
--db-path mimic_iv.db \
--output patient_insights.json
```
## References
1. MIMIC-IV Clinical Database: https://physionet.org/content/mimiciv/
2. scispacy: https://allenai.github.io/scispacy/
3. NegEx/negspacy for negation detection
4. FHIR Clinical Document specifications
## Author
Skill ID: 213
Category: Medical Data Mining
Complexity: Advanced
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `unstructured-medical-text-miner` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `unstructured-medical-text-miner` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## References
- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:config.yaml
# Unstructured Medical Text Miner Configuration
extraction:
# 要提取的实体类型
entity_types:
- DISEASE
- SYMPTOM
- MEDICATION
- PROCEDURE
- ANATOMY
- LAB_VALUE
# 要提取的关系类型
relation_types:
- TREATS
- CAUSES
- CONTRAINDICATED_WITH
- INDICATES
# 启用否定检测
enable_negation_detection: true
# 文本预处理选项
preprocessing:
lowercase: false
remove_deidentification_brackets: true # 移除 [**Name**] 等标记
normalize_whitespace: true
models:
# spaCy模型选择
# 可选: en_core_web_sm, en_core_web_md, en_core_web_lg
# 医疗专用: en_core_sci_lg, en_core_sci_scibert
ner_model: "en_core_sci_lg"
# 关系抽取模型(自定义)
relation_model: null
output:
# 输出格式: json, fhir, kg (knowledge graph)
format: "json"
# 是否包含原始文本
include_raw_text: false
# 是否包含位置信息
include_spans: true
# 置信度阈值
confidence_threshold: 0.5
# 特定笔记类型的处理规则
note_type_rules:
DS: # Discharge Summary
sections_to_extract:
- "Chief Complaint"
- "History of Present Illness"
- "Hospital Course"
- "Discharge Diagnosis"
- "Discharge Medications"
RR: # Radiology Report
sections_to_extract:
- "Clinical History"
- "Findings"
- "Impression"
ECG: # ECG Report
sections_to_extract:
- "Interpretation"
- "Conclusion"
# 性能设置
performance:
# 批处理大小
batch_size: 100
# 并行工作线程
n_workers: 4
# 内存限制(MB)
memory_limit: 4096
FILE:example.py
"""Example usage of Unstructured Medical Text Miner
Demonstrate how to use a medical text miner to process MIMIC-IV data"""
from scripts.main import MedicalTextMiner
# Example 1: Processing a single text fragment
sample_discharge_summary = """
DISCHARGE SUMMARY
Patient: [**Name**] (M, 67 y/o)
Admission Date: [**Date**]
Discharge Date: [**Date**]
CHIEF COMPLAINT:
Chest pain for 3 hours
HISTORY OF PRESENT ILLNESS:
Patient presented with acute onset chest pain radiating to left arm.
No prior history of coronary artery disease. Denies dyspnea, nausea, or diaphoresis.
Vital signs on admission: BP 140/90, HR 98, RR 18, SpO2 97% on room air.
HOSPITAL COURSE:
ECG showed ST elevation in V1-V4 consistent with anterior STEMI.
Troponin I peaked at 15.6 ng/mL.
Patient underwent emergent cardiac catheterization with PCI to LAD.
Started on aspirin 81mg daily, clopidogrel 75mg daily, atorvastatin 40mg daily.
No complications post-procedure. Chest pain resolved.
DISCHARGE DIAGNOSIS:
1. Acute anterior ST-elevation myocardial infarction (STEMI), successfully treated with PCI
2. Hypertension
DISCHARGE MEDICATIONS:
1. Aspirin 81mg PO daily
2. Clopidogrel 75mg PO daily
3. Atorvastatin 40mg PO daily
4. Lisinopril 10mg PO daily
FOLLOW UP:
Cardiology clinic in 1 week
"""
def demo_single_text():
"""Demonstrate entity and relationship extraction from a single text"""
print("=" * 60)
print("DEMO 1: Single Text Extraction")
print("=" * 60)
miner = MedicalTextMiner()
# Extract complete insights
insights = miner.extract_insights(
sample_discharge_summary,
extract_entities=True,
extract_relations=True,
extract_timeline=True,
extract_logic=True
)
print(f"\nExtracted {insights['entity_count']} entities:")
for entity_type, entities in insights.get('entities_by_type', {}).items():
print(f"\n {entity_type}:")
for e in entities[:3]: # Only show first 3
neg_mark = " [NEGATED]" if e.get('negated') else ""
print(f" - {e['text']}{neg_mark}")
print(f"\nExtracted {insights.get('relation_count', 0)} relations:")
for r in insights.get('relations', [])[:3]:
print(f" - {r['subject']} --{r['predicate']}--> {r['object']}")
print("\nClinical Logic:")
logic = insights.get('clinical_logic', {})
if logic.get('presenting_complaint'):
print(f" Chief Complaint: {logic['presenting_complaint']}")
if logic.get('workup'):
print(f" Workup: {', '.join(logic['workup'])}")
return insights
def demo_patient_processing():
"""Demonstrates patient-level processing (requires MIMIC-IV data files)"""
print("\n" + "=" * 60)
print("DEMO 2: Patient Processing (requires MIMIC-IV data)")
print("=" * 60)
miner = MedicalTextMiner()
# NOTE: This requires actual MIMIC-IV data files
# Uncomment below and provide correct file path to run
# miner.load_notes("path/to/noteevents.csv", note_types=["DS", "RR"])
# patient_insights = miner.process_patient(subject_id=10000032)
#
# print(miner.generate_summary_report(patient_insights))
# miner.export_to_json(patient_insights, "patient_10000032_insights.json")
print("""
To run patient processing:
1. Download MIMIC-IV NOTEEVENTS data from PhysioNet
2. Run:
miner = MedicalTextMiner()
miner.load_notes("noteevents.csv", note_types=["DS"])
insights = miner.process_patient(subject_id=10000032)
miner.export_to_json(insights, "output.json")
""")
def demo_regex_patterns():
"""Demonstrates regular expression pattern matching"""
print("\n" + "=" * 60)
print("DEMO 3: Regex Pattern Matching")
print("=" * 60)
test_text = """
The patient was diagnosed with acute myocardial infarction.
Started on heparin drip and metformin 500mg twice daily.
No evidence of pneumonia. Troponin elevated at 5.2 ng/mL.
"""
miner = MedicalTextMiner()
entities = miner.extract_entities_regex(test_text)
print(f"\nFound {len(entities)} entities:")
for e in entities:
neg_mark = " [NEGATED]" if e.get('negated') else ""
print(f" - [{e['type']}] {e['text']}{neg_mark}")
if __name__ == "__main__":
# Run the demo
demo_single_text()
demo_regex_patterns()
demo_patient_processing()
FILE:references/audit-reference.md
# Audit Reference
## Scope
- Skill directory: `unstructured-medical-text-miner`
- Core purpose: Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.
- Use only within the documented workflow and category boundary defined in `SKILL.md`
## Supported Audit Paths
- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`
- `python scripts/main.py -h`
## Fallback Boundary
If required inputs are incomplete, the skill should still return:
- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable
FILE:requirements.txt
negspacy
numpy
pandas
pyarrow
pyyaml
scispacy
spacy
tqdm
FILE:scripts/main.py
#!/usr/bin/env python3
"""Unstructured Medical Text Miner (Skill ID: 213)
Mining MIMIC-IV medical text data and extracting unstructured diagnosis and treatment information.
Supports entity recognition, relationship extraction, timeline analysis and diagnosis and treatment logic extraction."""
import argparse
import json
import re
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Union, Any
import warnings
import pandas as pd
# Try importing optional dependencies
try:
import spacy
SPACY_AVAILABLE = True
except ImportError:
SPACY_AVAILABLE = False
warnings.warn("spacy not installed. NER functionality will be limited.")
try:
from negspacy.negation import Negex
NEGSPACY_AVAILABLE = True
except ImportError:
NEGSPACY_AVAILABLE = False
class MedicalTextMiner:
"""MIMIC-IV unstructured medical text miner
Main functions:
1. Load and preprocess text data such as NOTEEVENTS
2. Entity recognition (diseases, symptoms, drugs, surgeries, etc.)
3. Negative testing (identification of clinical findings that are negated)
4. Relationship extraction (drug-disease relationship, etc.)
5. Timeline extraction (disease progression sequence)
6. Analysis of diagnosis and treatment logic"""
# Medical entity type definition
ENTITY_PATTERNS = {
"MEDICATION": [
r"\b(aspirin|warfarin|heparin|metformin|insulin|amoxicillin|"
r"azithromycin|morphine|fentanyl|propofol|norepinephrine|"
r"phenylephrine|dopamine|epinephrine)\b",
r"\b\w+\s*(?:mg|mcg|g|units?)\b" # dose mode
],
"DISEASE": [
r"\b(acute myocardial infarction|AMI|STEMI|NSTEMI|heart failure|"
r"pneumonia|sepsis|ARDS|COPD|asthma|diabetes mellitus|"
r"hypertension|atrial fibrillation|AFib|stroke|DVT|PE)\b",
r"\b\w+(?:itis|osis|emia|oma|pathy| syndrome)\b"
],
"SYMPTOM": [
r"\b(chest pain|dyspnea|shortness of breath|fatigue|nausea|"
r"vomiting|fever|chills|headache|dizziness|syncope|"
r"palpitations|cough|hemoptysis)\b"
],
"PROCEDURE": [
r"\b(PCI|CABG|ECG|EKG|echocardiogram|CT|MRI|chest X-ray|"
r"bronchoscopy|intubation|extubation|thoracentesis|"
r"paracentesis|lumbar puncture)\b"
],
"ANATOMY": [
r"\b(left ventricle|right ventricle|left atrium|right atrium|"
r"mitral valve|aortic valve|tricuspid valve|pulmonary valve|"
r"coronary artery|LAD|LCx|RCA|lung|liver|kidney|spleen)\b"
],
"LAB_VALUE": [
r"\b(troponin|creatinine|BNP|NT-proBNP|WBC|hemoglobin|platelet|"
r"INR|PTT|PT|glucose|HbA1c|lactate|pH|PCO2|PO2)\b"
]
}
# negative word pattern
NEGATION_CUES = [
r"\b(no|not|denies|denied|without|absent|negative for|"
r"ruled out|no evidence of|non-)\b"
]
# time pattern
TIME_PATTERNS = [
r"\b(\d{1,2}/\d{1,2}/\d{2,4})\b", # MM/DD/YYYY
r"\b(\d{4}-\d{2}-\d{2})\b", # YYYY-MM-DD
r"\b(\d{1,2}:\d{2}(?::\d{2})?\s*(?:AM|PM|am|pm)?)\b", # time
r"\b(on admission|at presentation|post-operatively|"
r"pre-operatively|on day \d+|POD \#?\d+)\b"
]
def __init__(self, config: Optional[Dict] = None):
"""Initialize MedicalTextMiner
Args:
config: configuration dictionary, which can include model path, output format and other settings"""
self.config = config or {}
self.notes_df: Optional[pd.DataFrame] = None
self.nlp = None
# Initialize spaCy model (if available)
self._init_nlp()
def _init_nlp(self):
"""Initialize NLP model"""
if not SPACY_AVAILABLE:
return
model_name = self.config.get("ner_model", "en_core_web_sm")
try:
self.nlp = spacy.load(model_name)
# Add negative detection component
if NEGSPACY_AVAILABLE:
negex = Negex(self.nlp, name="negex")
self.nlp.add_pipe("negex", last=True)
except OSError:
warnings.warn(f"Model {model_name} not found. Running in regex-only mode.")
self.nlp = None
def load_notes(self, notes_path: Union[str, Path],
note_types: Optional[List[str]] = None) -> pd.DataFrame:
"""Load MIMIC-IV NOTEEVENTS data
Args:
notes_path: NOTEEVENTS file path (CSV or Parquet)
note_types: filter specific note types (such as ['DS', 'RR'])
Returns:
Loaded DataFrame"""
path = Path(notes_path)
if path.suffix == '.csv':
df = pd.read_csv(path)
elif path.suffix in ['.parquet', '.pq']:
df = pd.read_parquet(path)
else:
raise ValueError(f"Unsupported file format: {path.suffix}")
# Standardized column names
df.columns = df.columns.str.lower()
# Filter note types
if note_types and 'note_type' in df.columns:
df = df[df['note_type'].isin(note_types)]
# Convert time column
if 'charttime' in df.columns:
df['charttime'] = pd.to_datetime(df['charttime'], errors='coerce')
self.notes_df = df
print(f"Loaded {len(df)} notes from {path}")
return df
def get_patient_texts(self, subject_id: int,
hadm_id: Optional[int] = None) -> List[Dict]:
"""Get all text records for a specific patient
Args:
subject_id: patient ID
hadm_id: Optional hospitalization record ID
Returns:
List of text records for this patient"""
if self.notes_df is None:
raise ValueError("No notes loaded. Call load_notes() first.")
mask = self.notes_df['subject_id'] == subject_id
if hadm_id is not None:
mask = mask & (self.notes_df['hadm_id'] == hadm_id)
patient_notes = self.notes_df[mask].sort_values('charttime')
return patient_notes.to_dict('records')
def extract_entities_regex(self, text: str) -> List[Dict]:
"""Extract medical entities using regular expressions
Args:
text: input text
Returns:
Entity list"""
entities = []
for entity_type, patterns in self.ENTITY_PATTERNS.items():
for pattern in patterns:
for match in re.finditer(pattern, text, re.IGNORECASE):
entity = {
"text": match.group(),
"type": entity_type,
"start": match.start(),
"end": match.end(),
"source": "regex"
}
entities.append(entity)
# check negative
entities = self._detect_negation(text, entities)
return entities
def extract_entities_spacy(self, text: str) -> List[Dict]:
"""Extract medical entities using spaCy
Args:
text: input text
Returns:
Entity list"""
if self.nlp is None:
return []
doc = self.nlp(text)
entities = []
for ent in doc.ents:
entity = {
"text": ent.text,
"type": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
"source": "spacy"
}
# Check for negation (using negspacy)
if NEGSPACY_AVAILABLE and hasattr(ent._, "negex"):
entity["negated"] = ent._.negex
entities.append(entity)
return entities
def _detect_negation(self, text: str, entities: List[Dict]) -> List[Dict]:
"""Check if an entity is negated
Args:
text: original text
entities: list of entities
Returns:
Entity list after adding negative tag"""
for entity in entities:
# Get the text window in front of the entity (up to 50 characters)
start_pos = max(0, entity["start"] - 50)
context = text[start_pos:entity["start"]]
# Check for negative words
is_negated = any(
re.search(pattern, context, re.IGNORECASE)
for pattern in self.NEGATION_CUES
)
entity["negated"] = is_negated
return entities
def extract_relations(self, text: str, entities: List[Dict]) -> List[Dict]:
"""Extract relationships between entities from text
Args:
text: input text
entities: list of recognized entities
Returns:
List of relational triples"""
relations = []
# Simple pattern-based relationship extraction
# Drug-Disease Treatment Relationship
med_disease_pattern = r"(\w+)\s+(?:for|to treat|management of)\s+([\w\s]+?)(?:\.|,|;|$)"
for match in re.finditer(med_disease_pattern, text, re.IGNORECASE):
relations.append({
"subject": match.group(1),
"predicate": "TREATS",
"object": match.group(2).strip(),
"confidence": 0.7,
"source": "pattern"
})
# Symptom-disease relationship
symptom_disease_pattern = r"(\w+)\s+(?:due to|caused by|secondary to)\s+([\w\s]+?)(?:\.|,|;|$)"
for match in re.finditer(symptom_disease_pattern, text, re.IGNORECASE):
relations.append({
"subject": match.group(1),
"predicate": "CAUSED_BY",
"object": match.group(2).strip(),
"confidence": 0.6,
"source": "pattern"
})
return relations
def extract_timeline(self, text: str) -> List[Dict]:
"""Extract timing events from text
Args:
text: input text
Returns:
event timeline"""
events = []
sentences = re.split(r'[.!?]\s+', text)
event_indicators = [
"admitted", "presented", "developed", "underwent",
"started", "discontinued", "noted", "observed",
"worsening", "improvement", "resolved"
]
for sent in sentences:
# Detect time expression
time_match = None
for pattern in self.TIME_PATTERNS:
match = re.search(pattern, sent, re.IGNORECASE)
if match:
time_match = match.group()
break
# detect event verb
for indicator in event_indicators:
if re.search(rf"\b{indicator}\b", sent, re.IGNORECASE):
events.append({
"time_expression": time_match,
"event_verb": indicator,
"description": sent.strip(),
"raw_time": time_match
})
break
return events
def parse_clinical_logic(self, text: str) -> Dict:
"""Analyze the diagnosis and treatment logic chain
Args:
text: medical text
Returns:
Logical structure of diagnosis and treatment"""
logic = {
"presenting_complaint": None,
"differential_diagnoses": [],
"workup": [],
"final_diagnosis": None,
"treatment_plan": [],
"clinical_course": []
}
# Extract chief complaint
chief_complaint_patterns = [
r"chief complaint[:\s]+([^.]+)",
r"presented with[:\s]+([^.]+)",
r"\bPC[:\s]+([^.]+)"
]
for pattern in chief_complaint_patterns:
match = re.search(pattern, text, re.IGNORECASE)
if match:
logic["presenting_complaint"] = match.group(1).strip()
break
# Extract differential diagnosis
dd_pattern = r"(?:differential diagnosis|DDx)[:\s]+([^.]+(?:,[^.]+)*)"
match = re.search(dd_pattern, text, re.IGNORECASE)
if match:
logic["differential_diagnoses"] = [
d.strip() for d in match.group(1).split(",")
]
# Extract check
workup_indicators = ["obtained", "performed", "ordered", "showed"]
for indicator in workup_indicators:
pattern = rf"(\w+(?:\s\w+)?)\s+{indicator}"
for match in re.finditer(pattern, text, re.IGNORECASE):
procedure = match.group(1)
if procedure.lower() in [
"ecg", "ekg", "ct", "mri", "x-ray", "echo",
"troponin", "bnp", "wbc"
]:
logic["workup"].append(procedure)
# Remove duplicates
logic["workup"] = list(set(logic["workup"]))
return logic
def extract_insights(self, text: str,
extract_entities: bool = True,
extract_relations: bool = True,
extract_timeline: bool = True,
extract_logic: bool = True) -> Dict:
"""Perform complete medical text insight extraction
Args:
text: Enter medical text
extract_entities: whether to extract entities
extract_relations: whether to extract relationships
extract_timeline: whether to extract timeline
extract_logic: whether to parse diagnosis and treatment logic
Returns:
Comprehensive insights"""
insights = {
"raw_text_length": len(text),
"extraction_timestamp": datetime.now().isoformat()
}
# Entity extraction
if extract_entities:
entities_regex = self.extract_entities_regex(text)
entities_spacy = self.extract_entities_spacy(text)
# Merge results (remove duplicates)
all_entities = entities_regex + [
e for e in entities_spacy
if not any(e["start"] == ex["start"] for ex in entities_regex)
]
# Group by type
entities_by_type = {}
for e in all_entities:
etype = e["type"]
if etype not in entities_by_type:
entities_by_type[etype] = []
entities_by_type[etype].append(e)
insights["entities"] = all_entities
insights["entities_by_type"] = entities_by_type
insights["entity_count"] = len(all_entities)
# Relation extraction
if extract_relations and extract_entities:
relations = self.extract_relations(text, insights.get("entities", []))
insights["relations"] = relations
insights["relation_count"] = len(relations)
# Timeline extraction
if extract_timeline:
timeline = self.extract_timeline(text)
insights["timeline"] = timeline
insights["event_count"] = len(timeline)
# Diagnosis and treatment logic analysis
if extract_logic:
logic = self.parse_clinical_logic(text)
insights["clinical_logic"] = logic
return insights
def process_patient(self, subject_id: int,
hadm_id: Optional[int] = None) -> Dict:
"""Process all text records for a single patient
Args:
subject_id: patient ID
hadm_id: Optional hospitalization record ID
Returns:
Comprehensive Patient Insights"""
notes = self.get_patient_texts(subject_id, hadm_id)
patient_insights = {
"subject_id": subject_id,
"hadm_id": hadm_id,
"note_count": len(notes),
"notes": []
}
for note in notes:
text = note.get('note_text', '')
if not text:
continue
note_insights = self.extract_insights(text)
note_insights.update({
"note_id": note.get('note_id'),
"note_type": note.get('note_type'),
"charttime": note.get('charttime')
})
patient_insights["notes"].append(note_insights)
# Aggregate all entities
all_entities = []
for note in patient_insights["notes"]:
all_entities.extend(note.get("entities", []))
patient_insights["aggregated_entities"] = self._aggregate_entities(all_entities)
return patient_insights
def _aggregate_entities(self, entities: List[Dict]) -> Dict:
"""Aggregate entities (deduplication and counting)"""
entity_map = {}
for e in entities:
key = (e["text"].lower(), e["type"])
if key not in entity_map:
entity_map[key] = {
"text": e["text"],
"type": e["type"],
"count": 0,
"negated_count": 0,
"instances": []
}
entity_map[key]["count"] += 1
if e.get("negated"):
entity_map[key]["negated_count"] += 1
entity_map[key]["instances"].append({
"start": e["start"],
"end": e["end"]
})
return {
"unique_count": len(entity_map),
"entities": list(entity_map.values())
}
def export_to_json(self, insights: Dict, output_path: str):
"""Export insights results to JSON file"""
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(insights, f, indent=2, ensure_ascii=False, default=str)
print(f"Exported insights to {output_path}")
def generate_summary_report(self, insights: Dict) -> str:
"""Generate text summary report"""
report = []
report.append("=" * 60)
report.append("MEDICAL TEXT MINING SUMMARY REPORT")
report.append("=" * 60)
if "subject_id" in insights:
report.append(f"\nPatient ID: {insights['subject_id']}")
if "note_count" in insights:
report.append(f"Total Notes Processed: {insights['note_count']}")
if "aggregated_entities" in insights:
agg = insights["aggregated_entities"]
report.append(f"\nUnique Entities Found: {agg['unique_count']}")
# Display grouped by type
by_type = {}
for e in agg["entities"]:
t = e["type"]
if t not in by_type:
by_type[t] = []
by_type[t].append(e)
for etype, entities in sorted(by_type.items()):
report.append(f"\n{etype} ({len(entities)} unique):")
for e in sorted(entities, key=lambda x: -x["count"])[:5]:
neg_info = f", {e['negated_count']} negated" if e['negated_count'] > 0 else ""
report.append(f" - {e['text']} ({e['count']} mentions{neg_info})")
return "\n".join(report)
def main():
"""CLI entry"""
parser = argparse.ArgumentParser(
description="MIMIC-IV Unstructured Medical Text Miner"
)
parser.add_argument(
"--input", "-i",
help="Path to NOTEEVENTS CSV/Parquet file"
)
parser.add_argument(
"--subject-id", "-s", type=int,
help="Process specific patient by subject_id"
)
parser.add_argument(
"--hadm-id", type=int,
help="Filter by hospital admission ID"
)
parser.add_argument(
"--output", "-o",
help="Output JSON file path"
)
parser.add_argument(
"--extract", choices=["entities", "relations", "timeline", "logic", "all"],
default="all",
help="What to extract from the text"
)
parser.add_argument(
"--note-types", nargs="+",
help="Filter by note types (e.g., DS RR ECG)"
)
parser.add_argument(
"--sample-text",
help="Process a single text snippet directly"
)
args = parser.parse_args()
miner = MedicalTextMiner()
# Process individual text fragments
if args.sample_text:
insights = miner.extract_insights(args.sample_text)
print(json.dumps(insights, indent=2, default=str))
return
# Requires input files
if not args.input:
parser.error("--input is required (unless using --sample-text)")
# Load data
miner.load_notes(args.input, args.note_types)
# Process specific patients or batch processing
if args.subject_id:
insights = miner.process_patient(args.subject_id, args.hadm_id)
else:
# Handle all patients (simplified version)
print("Processing all patients...")
all_insights = []
for sid in miner.notes_df['subject_id'].unique()[:10]: # Limit to first 10 patients
patient_insights = miner.process_patient(sid)
all_insights.append(patient_insights)
insights = {
"patient_count": len(all_insights),
"patients": all_insights
}
# Generate summary
print("\n" + miner.generate_summary_report(insights))
# Export results
if args.output:
miner.export_to_json(insights, args.output)
else:
# Default output
default_output = "medical_text_insights.json"
miner.export_to_json(insights, default_output)
if __name__ == "__main__":
main()
FILE:scripts/__init__.py
"""Unstructured Medical Text Miner (Skill ID: 213)
Python package for mining MIMIC-IV medical text data"""
from .main import MedicalTextMiner
__version__ = "0.1.0"
__all__ = ["MedicalTextMiner"]
Generate USMLE Step 1/2 style clinical cases with patient history, physical.
---
name: usmle-case-generator
description: Generate USMLE Step 1/2 style clinical cases with patient history, physical.
license: MIT
skill-author: AIPOCH
---
# USMLE Case Generator
Generate USMLE Step 1 and Step 2 CK style clinical cases for medical education and board exam preparation.
## When to Use
- Use this skill when the task is to Generate USMLE Step 1/2 style clinical cases with patient history, physical.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
See `## Features` above for related details.
- Scope-focused workflow aligned to: Generate USMLE Step 1/2 style clinical cases with patient history, physical.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
- Python 3.8+
- No external API dependencies (template-based generation)
- Optional: LLM integration for case variation
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Academic Writing/usmle-case-generator"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Features
- **Step 1 Cases**: Basic science concepts, pathophysiology, pharmacology
- **Step 2 Cases**: Clinical diagnosis, management, next best steps
- **Complete Vignettes**: History, physical exam, labs, imaging
- **Multiple Choice Questions**: Single best answer format
- **Answer Explanations**: Detailed rationale for learning
## Usage
```python
# Generate a Step 1 case (pathophysiology focus)
python scripts/main.py --step 1 --topic cardiology --difficulty medium
# Generate a Step 2 case (clinical management focus)
python scripts/main.py --step 2 --topic nephrology --include-diagnosis
# Generate case with specific conditions
python scripts/main.py --step 2 --condition "diabetic ketoacidosis" --format json
```
## Parameters
| Parameter | Options | Description |
|-----------|---------|-------------|
| `--step` | 1, 2 | USMLE Step level |
| `--topic` | See references/topics.json | Medical specialty |
| `--condition` | Any condition | Specific disease/condition |
| `--difficulty` | easy, medium, hard | Case complexity |
| `--format` | text, json, markdown | Output format |
| `--include-diagnosis` | flag | Include answer key |
| `--count` | 1-10 | Number of cases to generate |
## Topics Covered
- Cardiology
- Pulmonology
- Gastroenterology
- Nephrology
- Endocrinology
- Hematology/Oncology
- Infectious Disease
- Neurology
- Psychiatry
- Musculoskeletal
- Dermatology
- Obstetrics/Gynecology
- Pediatrics
- Surgery
## Case Structure
Each generated case includes:
1. **Patient Demographics**: Age, gender, relevant background
2. **Chief Complaint**: Presenting problem
3. **History of Present Illness**: Detailed symptom timeline
4. **Past Medical History**: Relevant comorbidities
5. **Medications**: Current drug regimen
6. **Allergies**: Drug/environmental allergies
7. **Family History**: Genetic conditions
8. **Social History**: Smoking, alcohol, occupation
9. **Physical Examination**: Vital signs, relevant findings
10. **Laboratory Studies**: CBC, CMP, specific markers
11. **Imaging/Diagnostics**: X-ray, CT, ECG, etc.
12. **Question**: USMLE-style multiple choice
13. **Answer Options**: 5 choices (A-E)
14. **Correct Answer**: With detailed explanation
15. **Educational Objectives**: Key learning points
## Output Formats
### Text Format (Default)
Plain text suitable for printing or reading.
### JSON Format
Structured data for integration with applications.
### Markdown Format
Formatted for documentation or web display.
## Technical Difficulty
**High** - Requires medical knowledge validation and clinical accuracy.
⚠️ **Manual Review Required**: Generated cases should be reviewed by medical professionals before use in high-stakes educational settings.
## References
- `references/topics.json` - Medical specialty taxonomy
- `references/case_templates.json` - Case structure templates
- `references/usmle_patterns.md` - USMLE question patterns
- `references/conditions/` - Condition-specific case data
## Example Output
```
Case: A 58-year-old male with chest pain
A 58-year-old man presents to the emergency department with
crushing substernal chest pain radiating to his left arm,
beginning 2 hours ago at rest...
[History, physical, labs, ECG findings...]
Question: What is the most appropriate next step in management?
A. Administer aspirin and nitroglycerin
B. Order CT pulmonary angiography
C. Perform immediate synchronized cardioversion
D. Start heparin drip and call cardiology
E. Discharge with outpatient stress test
Correct Answer: D
Explanation: [Detailed rationale...]
```
## Safety & Limitations
- Cases are AI-generated and may contain inaccuracies
- Not a substitute for professional medical education
- Always verify clinical details with authoritative sources
- Intended for educational purposes only
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `usmle-case-generator` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `usmle-case-generator` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/case_templates.json
{
"case_structures": {
"step1_vignette": {
"name": "Step 1 - Basic Science Vignette",
"focus": "Pathophysiology, mechanisms, pharmacology",
"structure": [
"Patient Demographics (age, gender)",
"Brief Presentation (1-2 sentences)",
"Key Clinical Finding",
"Question about mechanism/pathogen"
],
"example_question_types": [
"What is the most likely mechanism?",
"Which pathogen is most likely responsible?",
"What is the pathophysiology?",
"Which enzyme is deficient?",
"What is the mode of inheritance?"
]
},
"step2_vignette": {
"name": "Step 2 - Clinical Vignette",
"focus": "Diagnosis, management, next steps",
"structure": [
"Patient Demographics",
"Chief Complaint",
"History of Present Illness",
"Past Medical History",
"Medications",
"Allergies",
"Family History",
"Social History",
"Physical Examination",
"Laboratory/Diagnostic Studies",
"Question"
],
"example_question_types": [
"What is the most likely diagnosis?",
"What is the most appropriate next step?",
"What is the best initial test?",
"What is the most accurate test?",
"What is the most appropriate treatment?"
]
}
},
"vital_signs_template": {
"temperature": {
"normal": "98.6°F (37.0°C)",
"low_grade_fever": "100-101°F",
"moderate_fever": "101-103°F",
"high_fever": "103-105°F",
"hypothermia": "<95°F"
},
"heart_rate": {
"bradycardia": "<60 bpm",
"normal": "60-100 bpm",
"mild_tachycardia": "100-120 bpm",
"moderate_tachycardia": "120-140 bpm",
"severe_tachycardia": ">140 bpm"
},
"blood_pressure": {
"hypotension": "<90/60 mmHg",
"low_normal": "90-100/60-70 mmHg",
"normal": "120/80 mmHg",
"elevated": "120-139/80-89 mmHg",
"stage1_htn": "140-159/90-99 mmHg",
"stage2_htn": "≥160/≥100 mmHg",
"hypertensive_emergency": ">180/120 mmHg"
},
"respiratory_rate": {
"bradypnea": "<12/min",
"normal": "12-16/min",
"tachypnea": ">20/min",
"severe": ">30/min"
},
"oxygen_saturation": {
"normal": "95-100%",
"mild_hypoxemia": "90-94%",
"moderate_hypoxemia": "85-89%",
"severe_hypoxemia": "<85%"
}
},
"history_components": {
"hpi_elements": [
"Location",
"Quality",
"Severity",
"Duration",
"Timing",
"Context",
"Modifying factors",
"Associated symptoms"
],
"ros_categories": [
"Constitutional",
"HEENT",
"Cardiovascular",
"Respiratory",
"Gastrointestinal",
"Genitourinary",
"Musculoskeletal",
"Integumentary",
"Neurological",
"Psychiatric",
"Endocrine",
"Hematologic/Lymphatic",
"Allergic/Immunologic"
],
"social_history_elements": [
"Tobacco use",
"Alcohol use",
"Drug use",
"Sexual history",
"Occupation",
"Living situation",
"Exercise",
"Diet",
"Travel history"
]
},
"physical_exam_template": {
"general": ["Appearance", "Level of consciousness", "Nutritional status"],
"vital_signs": ["Temperature", "Heart rate", "Blood pressure", "Respiratory rate", "O2 saturation"],
"skin": ["Color", "Turgor", "Lesions", "Rashes"],
"head": ["Normocephalic", "Trauma", "Tenderness"],
"eyes": ["Pupils", "Extraocular movements", "Fundoscopic exam"],
"ears": ["Canals", "Tympanic membranes"],
"nose": ["Septum", "Mucosa"],
"throat": ["Oropharynx", "Dentition"],
"neck": ["JVD", "Thyroid", "Lymph nodes", "Carotid bruits"],
"cardiovascular": ["Rate", "Rhythm", "Murmurs", "S1/S2", "Edema"],
"pulmonary": ["Inspection", "Palpation", "Percussion", "Auscultation"],
"abdominal": ["Inspection", "Auscultation", "Percussion", "Palpation"],
"musculoskeletal": ["Range of motion", "Strength", "Tenderness", "Deformity"],
"neurological": ["Mental status", "Cranial nerves", "Motor", "Sensory", "Reflexes", "Gait"]
},
"lab_reference_ranges": {
"cbc": {
"wbc": {"normal": "4,500-11,000/μL", "low": "<4,500", "high": ">11,000"},
"hemoglobin": {"male": "13.5-17.5 g/dL", "female": "12.0-16.0 g/dL"},
"hematocrit": {"male": "38.8-50.0%", "female": "34.9-44.5%"},
"platelets": {"normal": "150,000-400,000/μL", "low": "<150,000", "high": ">400,000"}
},
"cmp": {
"sodium": {"normal": "135-145 mEq/L"},
"potassium": {"normal": "3.5-5.0 mEq/L"},
"chloride": {"normal": "96-106 mEq/L"},
"co2": {"normal": "23-29 mEq/L"},
"bun": {"normal": "7-20 mg/dL"},
"creatinine": {"normal": "0.6-1.2 mg/dL"},
"glucose": {"normal": "70-100 mg/dL (fasting)"},
"calcium": {"normal": "8.5-10.5 mg/dL"},
"protein": {"normal": "6.0-8.3 g/dL"},
"albumin": {"normal": "3.5-5.0 g/dL"},
"bilirubin": {"normal": "0.1-1.2 mg/dL"},
"ast": {"normal": "10-40 U/L"},
"alt": {"normal": "7-56 U/L"},
"alkaline_phosphatase": {"normal": "44-147 U/L"}
},
"coagulation": {
"pt": {"normal": "11-13.5 seconds", "inr": "0.9-1.1"},
"ptt": {"normal": "25-35 seconds"}
},
"abg": {
"ph": {"normal": "7.35-7.45", "acidemia": "<7.35", "alkalemia": ">7.45"},
"pco2": {"normal": "35-45 mmHg"},
"po2": {"normal": "80-100 mmHg"},
"hco3": {"normal": "22-26 mEq/L"},
"base_excess": {"normal": "-2 to +2 mEq/L"}
},
"cardiac": {
"troponin_i": {"normal": "<0.04 ng/mL"},
"troponin_t": {"normal": "<0.01 ng/mL"},
"bnp": {"normal": "<100 pg/mL"},
"nt_probnp": {"normal": "<300 pg/mL"}
},
"urinalysis": {
"ph": {"normal": "4.6-8.0"},
"specific_gravity": {"normal": "1.005-1.030"},
"protein": {"normal": "negative"},
"glucose": {"normal": "negative"},
"ketones": {"normal": "negative"},
"blood": {"normal": "negative"},
"wbc": {"normal": "0-5 per HPF"},
"rbc": {"normal": "0-2 per HPF"}
}
},
"diagnostic_studies": {
"imaging": {
"chest_xray": ["PA and lateral views", "Specific findings by pathology"],
"ct_scan": ["With/without contrast", "Indications", "Findings"],
"mri": ["With/without contrast", "Indications", "Findings"],
"ultrasound": ["Type of scan", "Indications", "Findings"],
"echocardiogram": ["Transthoracic vs TEE", "Measurements", "Findings"],
"angiography": ["Type", "Indications", "Findings"]
},
"procedures": {
"ekg_ecg": ["Rate", "Rhythm", "Axis", "Intervals", "Morphology", "Findings"],
"lumbar_puncture": ["Opening pressure", "CSF analysis", "Cell count", "Glucose", "Protein", "Culture"],
"biopsy": ["Type", "Findings", "Pathology"],
"bronchoscopy": ["Indications", "Findings", "BAL results"],
"colonoscopy": ["Indications", "Findings", "Biopsy results"],
"egd": ["Indications", "Findings", "Biopsy results"]
}
}
}
FILE:references/guidelines.md
# USMLE Case Generator Guidelines
## Purpose
This skill generates realistic USMLE Step 1 and Step 2 style clinical case scenarios for medical education and exam preparation.
## Supported Medical Specialties
### Core Specialties
1. **Cardiology** - Heart failure, MI, arrhythmias
2. **Neurology** - Stroke, seizure, headache
3. **Gastroenterology** - Pancreatitis, liver disease, IBD
4. **Pulmonology** - COPD, asthma, pneumonia
5. **Nephrology** - AKI, CKD, electrolyte disorders
6. **Endocrinology** - DKA, thyroid disorders, adrenal disease
7. **Hematology** - Anemia, coagulopathies
8. **Infectious Disease** - Pneumonia, sepsis, endocarditis
9. **Rheumatology** - RA, SLE, vasculitis
10. **Emergency Medicine** - Anaphylaxis, trauma, toxicology
## Case Structure
### 1. Patient Vignette
- Chief complaint
- History of present illness
- Past medical history
- Medications
- Physical examination findings
- Vital signs
- Diagnostic studies
### 2. Multiple-Choice Question
- Single best answer format
- 5 options (A-E)
- Clinical scenario-based
- Tests high-yield concepts
### 3. Explanation
- Detailed rationale for correct answer
- Why distractors are incorrect
- Key learning objectives
- High-yield takeaways
## Difficulty Levels
### Step 1 (Basic Science)
- Focus on pathophysiology
- Mechanism-based questions
- Laboratory interpretation
- Basic pharmacology
### Step 2 (Clinical Knowledge)
- Focus on diagnosis and management
- Next best step questions
- Treatment algorithms
- Prognosis and complications
## Usage Guidelines
### Generating Cases
```python
from scripts.main import generate_case
# Random case
case = generate_case()
# Specific specialty
case = generate_case(specialty="Cardiology")
# Specific difficulty
case = generate_case(difficulty="Step 2")
# Combined
case = generate_case(specialty="Neurology", difficulty="Step 2")
```
### Command Line
```bash
# Random case
python scripts/main.py
# Specific specialty
python scripts/main.py --specialty Cardiology
# Specific difficulty
python scripts/main.py --difficulty "Step 2"
# Save to file
python scripts/main.py --specialty Neurology --output case.json
```
## Case Template Format
Each case template includes:
- Template identifier
- Chief complaint
- History template with placeholders
- Physical examination findings
- Vital signs template
- Diagnostic studies template
- Question stem
- Options array (5 items)
- Correct answer letter
- Detailed explanation
- Learning objectives array
## Educational Use Only
Cases are generated for educational purposes and should not replace clinical judgment or actual medical advice.
FILE:references/requirements.txt
# Requirements for USMLE Case Generator
## Python Version
- Python 3.8 or higher
## Standard Library Dependencies
- json (built-in)
- random (built-in)
- argparse (built-in)
- datetime (built-in)
- typing (built-in)
- dataclasses (built-in)
- enum (built-in)
- unittest (built-in) - for testing
## No External Dependencies Required
This skill uses only Python standard library modules for maximum portability.
## Installation
No installation required. Simply run:
```bash
python scripts/main.py
```
## Testing
```bash
cd scripts
python test_main.py
```
FILE:references/sample_input.json
{
"specialty": "Cardiology",
"difficulty": "Step 2",
"focus_area": "heart failure"
}
FILE:references/sample_output.json
{
"status": "success",
"case": {
"metadata": {
"specialty": "Cardiology",
"difficulty": "Step 2",
"topic": "Heart Failure",
"generated_at": "2026-02-05T10:30:00Z"
},
"vignette": {
"chief_complaint": "Progressive shortness of breath and bilateral leg swelling",
"history_present_illness": "68-year-old male with a history of hypertension and coronary artery disease presents with 3 days of worsening dyspnea on exertion. Patient reports orthopnea and paroxysmal nocturnal dyspnea. No prior similar episodes.",
"past_medical_history": "Hypertension, cardiology history as relevant",
"medications": "As clinically appropriate for case",
"physical_examination": "JVP elevated at 8 cm, bilateral basilar crackles, S3 gallop, 2+ pitting edema bilaterally",
"vitals": "BP 150/90, HR 92, RR 22, SpO2 92% on room air",
"diagnostic_studies": "Chest X-ray: cardiomegaly, pulmonary edema. BNP: 1250 pg/mL. Echo: EF 35%"
},
"question": {
"stem": "What is the most appropriate initial medication to administer?",
"options": [
"A. Lisinopril",
"B. Metoprolol",
"C. Furosemide",
"D. Digoxin",
"E. Amlodipine"
],
"correct_answer": "C",
"explanation": "Furosemide is the correct initial therapy for acute decompensated heart failure with volume overload. It provides rapid diuresis, reducing preload and relieving pulmonary congestion. While ACE inhibitors and beta-blockers are important for chronic management, diuresis is the priority in acute decompensation.",
"learning_objectives": [
"Acute heart failure management",
"Diuretic therapy",
"Volume overload"
]
}
}
}
FILE:references/topics.json
{
"medical_specialties": {
"cardiology": {
"display_name": "Cardiology",
"subtopics": [
"Acute Coronary Syndromes",
"Heart Failure",
"Arrhythmias",
"Valvular Heart Disease",
"Hypertension",
"Pericardial Disease",
"Congenital Heart Disease",
"Vascular Disease"
],
"step1_focus": ["Electrophysiology", "Hemodynamics", "Pharmacology", "Pathophysiology"],
"step2_focus": ["Diagnosis", "Management", "Acute Care", "Chronic Management"]
},
"pulmonology": {
"display_name": "Pulmonology",
"subtopics": [
"Obstructive Disease",
"Restrictive Disease",
"Infections",
"Vascular Disease",
"Sleep Disorders",
"Pleural Disease",
"Neoplasms"
],
"step1_focus": ["Ventilation/Perfusion", "Gas Exchange", "Respiratory Physiology"],
"step2_focus": ["Pulmonary Function Tests", "Imaging", "Treatment"]
},
"gastroenterology": {
"display_name": "Gastroenterology",
"subtopics": [
"Esophageal Disorders",
"Gastric Disorders",
"Small Intestine",
"Colorectal Disorders",
"Hepatobiliary",
"Pancreas",
"GI Emergencies"
],
"step1_focus": ["Digestive Physiology", "Enzymes", "Absorption"],
"step2_focus": ["Endoscopy", "Imaging", "Surgical vs Medical Management"]
},
"nephrology": {
"display_name": "Nephrology",
"subtopics": [
"Acute Kidney Injury",
"Chronic Kidney Disease",
"Glomerular Disease",
"Tubulointerstitial Disease",
"Electrolyte Disorders",
"Acid-Base Disorders",
"Genetic Kidney Disease"
],
"step1_focus": ["Renal Physiology", "Fluid/Electrolyte Balance", "Acid-Base"],
"step2_focus": ["Dialysis", "Transplant", "Nephrotic/Nephritic Syndromes"]
},
"endocrinology": {
"display_name": "Endocrinology",
"subtopics": [
"Diabetes Mellitus",
"Thyroid Disorders",
"Adrenal Disorders",
"Pituitary Disorders",
"Parathyroid Disorders",
"Gonadal Disorders",
"Bone Metabolism"
],
"step1_focus": ["Hormone Pathways", "Feedback Loops", "Receptors"],
"step2_focus": ["Replacement Therapy", "Suppression Therapy", "Monitoring"]
},
"hematology_oncology": {
"display_name": "Hematology/Oncology",
"subtopics": [
"Anemias",
"Coagulation Disorders",
"Leukemias",
"Lymphomas",
"Plasma Cell Disorders",
"Solid Tumors",
"Transfusion Medicine"
],
"step1_focus": ["Coagulation Cascade", "Cell Biology", "Oncogenes"],
"step2_focus": ["Staging", "Treatment Protocols", "Complications"]
},
"infectious_disease": {
"display_name": "Infectious Disease",
"subtopics": [
"Bacterial Infections",
"Viral Infections",
"Fungal Infections",
"Parasitic Infections",
"HIV/AIDS",
"Sepsis",
"Fever of Unknown Origin"
],
"step1_focus": ["Microbiology", "Antibiotic Mechanisms", "Immunology"],
"step2_focus": ["Empiric Therapy", "Prophylaxis", "Infection Control"]
},
"neurology": {
"display_name": "Neurology",
"subtopics": [
"Cerebrovascular Disease",
"Seizure Disorders",
"Neurodegenerative Disease",
"Demyelinating Disease",
"Neuromuscular Disease",
"Headache",
"Movement Disorders"
],
"step1_focus": ["Neuroanatomy", "Neurophysiology", "Neurotransmitters"],
"step2_focus": ["Imaging", "Lumbar Puncture", "Emergency Management"]
},
"psychiatry": {
"display_name": "Psychiatry",
"subtopics": [
"Mood Disorders",
"Anxiety Disorders",
"Psychotic Disorders",
"Personality Disorders",
"Substance Use Disorders",
"Eating Disorders",
"Neurocognitive Disorders"
],
"step1_focus": ["Neurotransmitters", "Pharmacology", "Development"],
"step2_focus": ["Diagnosis Criteria", "Psychotherapy", "Medication Management"]
},
"dermatology": {
"display_name": "Dermatology",
"subtopics": [
"Inflammatory Conditions",
"Infections",
"Neoplasms",
"Autoimmune Disease",
"Drug Reactions",
"Genetic Disorders"
],
"step1_focus": ["Skin Histology", "Immunology", "Pathology"],
"step2_focus": ["Visual Diagnosis", "Biopsy", "Treatment"]
},
"obstetrics_gynecology": {
"display_name": "Obstetrics & Gynecology",
"subtopics": [
"Prenatal Care",
"Labor and Delivery",
"Postpartum",
"Gynecologic Oncology",
"Reproductive Endocrinology",
"Family Planning"
],
"step1_focus": ["Embryology", "Hormonal Cycles", "Physiology"],
"step2_focus": ["Complications", "Screening", "Surgical Management"]
},
"pediatrics": {
"display_name": "Pediatrics",
"subtopics": [
"Newborn Care",
"Growth and Development",
"Genetic Disorders",
"Infectious Disease",
"Behavioral Issues",
"Adolescent Medicine"
],
"step1_focus": ["Developmental Milestones", "Congenital Anomalies", "Immunology"],
"step2_focus": ["Vaccination", "Screening", "Age-Appropriate Care"]
},
"surgery": {
"display_name": "Surgery",
"subtopics": [
"General Surgery",
"Trauma",
"Surgical Complications",
"Preoperative Evaluation",
"Postoperative Care",
"Surgical Emergencies"
],
"step1_focus": ["Anatomy", "Wound Healing", "Immunology"],
"step2_focus": ["Indications", "Contraindications", "Complications"]
},
"musculoskeletal": {
"display_name": "Musculoskeletal",
"subtopics": [
"Fractures",
"Arthritis",
"Muscle Disorders",
"Bone Disorders",
"Soft Tissue Injury",
"Rheumatologic Disease"
],
"step1_focus": ["Anatomy", "Biomechanics", "Connective Tissue"],
"step2_focus": ["Imaging", "Physical Exam", "Rehabilitation"]
},
"emergency_medicine": {
"display_name": "Emergency Medicine",
"subtopics": [
"Cardiac Emergencies",
"Respiratory Emergencies",
"Trauma",
"Toxicology",
"Environmental",
"Psychiatric Emergencies"
],
"step1_focus": ["Pathophysiology", "Pharmacology"],
"step2_focus": ["Triage", "Stabilization", "Procedures"]
}
},
"case_complexity": {
"easy": {
"description": "Classic presentation, single system involvement",
"vital_signs": 1-2 abnormal,
"diagnostics": 1-2 key findings,
"distractors": "Minimal"
},
"medium": {
"description": "Typical presentation with complicating factors",
"vital_signs": 2-3 abnormal,
"diagnostics": 2-3 key findings,
"distractors": "Moderate"
},
"hard": {
"description": "Atypical presentation, multisystem, confounders",
"vital_signs": "Multiple abnormalities",
"diagnostics": "Multiple findings requiring synthesis",
"distractors": "Challenging"
}
}
}
FILE:references/usmle_patterns.md
# USMLE Question Patterns and Strategies
## Step 1 Question Patterns
### Mechanism-Based Questions
**Format**: "What is the most likely mechanism/pathophysiology?"
**Key Features**:
- Focus on underlying disease processes
- Require understanding of cellular/molecular mechanisms
- Often test enzyme deficiencies or receptor abnormalities
**Example**:
> A patient with hemolytic anemia and dark urine after eating fava beans...
> Question: What enzyme deficiency is responsible?
> Answer: Glucose-6-phosphate dehydrogenase (G6PD) deficiency
### Pharmacology Questions
**Format**: "What is the mechanism of action?" or "What is the most likely adverse effect?"
**Key Features**:
- Drug mechanisms at molecular level
- Side effect profiles
- Drug interactions
- Antidote knowledge
**Example**:
> A patient on chronic therapy for tuberculosis develops orange discoloration of bodily fluids...
> Question: Which medication causes this effect?
> Answer: Rifampin
### Microbiology Questions
**Format**: "What is the most likely causative organism?"
**Key Features**:
- Organism identification by:
- Gram stain characteristics
- Culture requirements
- Toxin production
- Disease associations
- Environmental exposures
**Example**:
> A patient with lung abscess following dental aspiration...
> Question: What organism is most likely?
> Answer: Anaerobes (Fusobacterium, Prevotella, Peptostreptococcus)
### Genetics Questions
**Format**: "What is the mode of inheritance?" or "What is the risk to offspring?"
**Key Features**:
- Mendelian inheritance patterns
- Chromosomal abnormalities
- Trinucleotide repeat disorders
- Imprinting disorders
### Biochemistry Questions
**Format**: "What is the deficient enzyme?" or "What metabolic pathway is affected?"
**Key Features**:
- Enzyme deficiencies
- Metabolic pathway disruptions
- Accumulation of substrates/deficiency of products
## Step 2 CK Question Patterns
### Diagnosis Questions
**Format**: "What is the most likely diagnosis?"
**Approach**:
1. Identify key clinical features
2. Recognize pattern/syndrome
3. Eliminate distractors based on:
- Atypical features
- Missing key findings
- Wrong demographics
**Red Flags for Common Diagnoses**:
- STEMI: Crushing substernal pain, diaphoresis, ST elevation
- Appendicitis: Migration of pain, RLQ tenderness, anorexia
- Pancreatitis: Epigastric pain radiating to back, elevated amylase/lipase
- DKA: Hyperglycemia, anion gap metabolic acidosis, ketones
### Next Best Step Questions
**Format**: "What is the most appropriate next step in management?"
**Priority Hierarchy**:
1. **Stabilize patient first** (ABCs)
- Airway protection
- Breathing support
- Circulation/IV access
2. **Life-threatening conditions** (don't delay)
- STEMI → PCI
- Stroke → tPA (if eligible)
- Sepsis → Antibiotics + fluids
3. **Definitive diagnosis** (when needed)
- Gold standard tests
- Biopsy for tissue diagnosis
4. **Initial screening** (when appropriate)
- Best initial test (often less invasive)
- Risk stratification
### Best Initial Test Questions
**Format**: "What is the best initial test?" or "What is the best initial step in diagnosis?"
**Characteristics of Best Initial Test**:
- High sensitivity
- Non-invasive or minimally invasive
- Cost-effective
- Guides further workup
**Examples**:
- Suspected PE with normal vitals → D-dimer
- Suspected DVT → Compression ultrasound
- Suspected MI → ECG + troponins
- Suspected pneumonia → Chest X-ray
### Most Accurate Test Questions
**Format**: "What is the most accurate test?" or "What is the gold standard?"
**Characteristics**:
- Highest sensitivity AND specificity
- May be invasive
- Definitive diagnosis
**Examples**:
- PE → CTPA (CT pulmonary angiography)
- Coronary artery disease → Cardiac catheterization
- GERD → 24-hour pH monitoring
- H. pylori → Endoscopy with biopsy
### Treatment Questions
**Format**: "What is the most appropriate treatment?" or "What is the drug of choice?"
**Key Considerations**:
1. First-line therapy (usually correct answer)
2. Contraindications in vignette
3. Patient factors (pregnancy, renal function, allergies)
4. Severity of condition
**Common First-Line Treatments**:
- HTN (general): Thiazide diuretic
- Type 2 DM: Metformin
- Community-acquired pneumonia: Macrolide or doxycycline
- H. pylori: Triple therapy (PPI + clarithromycin + amoxicillin)
## Answer Choice Patterns
### The "Next Step" Traps
**Wrong answers often include**:
- Tests that are too invasive for initial screening
- Treatments before diagnosis is confirmed
- Watchful waiting for serious conditions
- Secondary prevention when primary treatment needed
### The "Most Likely" Distractors
**Common wrong answers**:
- Rare presentations of common diseases
- Common presentations of rare diseases (when classic presentation fits)
- Conditions with similar symptoms but wrong demographics
- Partial matches missing key features
### The "EXCEPT" Questions
**Strategy**:
- Three answers are true/correct
- One is false/incorrect
- Read carefully - "EXCEPT" is easy to miss
- Often tests contraindications or adverse effects
## Clinical Reasoning Strategies
### 1. Identify the Diagnosis First
Before answering, be confident in the diagnosis:
- What is the classic triad/presentation?
- Do all features fit?
- What doesn't fit (rule out)?
### 2. Know Your Vital Signs
Abnormal vitals guide urgency:
- **Hypotension**: Think shock, sepsis, hemorrhage
- **Fever + hypotension**: Sepsis until proven otherwise
- **Tachycardia + hypoxia**: PE, pneumothorax, severe pneumonia
- **Hypertensive emergency**: >180/120 with end-organ damage
### 3. Laboratory Pattern Recognition
Common patterns to recognize:
| Pattern | Possible Causes |
|---------|-----------------|
| High BUN/Cr ratio (>20:1) | Prerenal azotemia |
| Low BUN/Cr ratio (<15:1) | ATN, low protein intake |
| Anion gap metabolic acidosis | DKA, lactic acidosis, renal failure, toxins |
| Respiratory alkalosis | Anxiety, PE, hypoxemia, early salicylate toxicity |
| Elevated amylase/lipase | Pancreatitis |
| Elevated troponin | MI, myocarditis, PE, renal failure |
### 4. Imaging Findings
Key patterns:
- **Air bronchograms**: Consolidation (pneumonia, pulmonary edema)
- **Kerley B lines**: CHF
- **Ground glass opacities**: PCP, ARDS, viral pneumonia
- **Westermark sign**: Pulmonary embolism (oligemia)
- **Hampton's hump**: Pulmonary infarction (wedge-shaped)
## Special Populations
### Pediatric Considerations
- Different vital sign normal ranges
- Developmental milestones
- Vaccination schedules
- Common pediatric conditions (RSV, croup, intussusception)
### Geriatric Considerations
- Atypical disease presentations
- Polypharmacy interactions
- Falls and functional assessment
- Delirium vs dementia
### Pregnancy Considerations
- Physiologic changes (anemia, elevated WBC)
- Contraindicated medications
- Pregnancy-specific conditions (preeclampsia, gestational diabetes)
- Radiation exposure concerns
## Time-Sensitive Emergencies
### The "Golden Windows"
- **STEMI**: Door-to-balloon <90 min, door-to-needle <30 min
- **Stroke**: tPA within 4.5 hours of symptom onset
- **Sepsis**: Antibiotics within 1 hour
- **Meningitis**: Antibiotics within 30 minutes
### Prioritization Rules
When multiple actions seem reasonable:
1. ABCs always come first
2. Don't let the test die (stabilize)
3. Treat life-threatening before diagnostic
4. Definitive therapy over supportive care
## Common USMLE Associations
### Classic Triads
- **Beck's Triad** (cardiac tamponade): Hypotension, JVD, muffled heart sounds
- **Charcot's Triad** (cholangitis): Fever, RUQ pain, jaundice
- **Reynolds' Pentad**: Charcot's + shock + confusion
- **Virchow's Triad** (thrombosis): Stasis, hypercoagulability, endothelial injury
### Eponymous Signs
- **McBurney's point**: Appendicitis
- **Murphy's sign**: Cholecystitis
- **Rovsing's sign**: Appendicitis
- **Psoas sign**: Appendicitis or diverticulitis
- **Obturator sign**: Appendicitis or pelvic abscess
- **Kernig's/Brudzinski's**: Meningitis
### Drug Side Effect Patterns
- **ACE inhibitors**: Cough, hyperkalemia, angioedema
- **Statins**: Myopathy, hepatotoxicity
- **Amiodarone**: Pulmonary fibrosis, thyroid dysfunction, blue-gray skin
- **Lithium**: Tremor, diabetes insipidus, hypothyroidism
- **Corticosteroids**: Cushings, osteoporosis, hyperglycemia, immunosuppression
## Question Analysis Checklist
Before selecting an answer:
1. ✓ What is the diagnosis?
2. ✓ Is the patient stable or unstable?
3. ✓ Is this an emergency?
4. ✓ What is being asked (diagnosis, next step, best test, treatment)?
5. ✓ Are there contraindications in the vignette?
6. ✓ Does the answer make sense with the patient's presentation?
7. ✓ Have I ruled out the distractors?
## Practice Tips
1. **Read the last line first** - Know what is being asked
2. **Generate answer before looking** - Prevents bias from options
3. **Don't add information** - Work only with what's given
4. **Watch for distractors** - Common conditions that "sort of" fit
5. **Know normal values** - Essential for interpretation
6. **Trust your instincts** - First answer is often correct
## Score Interpretation Guide
### For Self-Assessment
| Performance | Percentage | Action |
|-------------|------------|--------|
| Excellent | >85% | Ready for exam |
| Good | 70-85% | Review weak areas |
| Fair | 60-70% | Significant study needed |
| Poor | <60% | Major content review required |
### NBME Score Correlation
- Average Step 1: ~230-232
- Average Step 2 CK: ~245-247
- Target for competitive specialties: >250
---
*Note: This is for educational purposes. Always consult current USMLE content outlines and official resources for exam preparation.*
FILE:requirements.txt
main
FILE:scripts/main.py
#!/usr/bin/env python3
"""
USMLE Case Generator
Generates Step 1/2 style clinical cases for medical education.
"""
import argparse
import json
import random
import sys
from datetime import datetime
from pathlib import Path
class USMLECaseGenerator:
"""Generator for USMLE-style clinical cases."""
def __init__(self):
self.case_templates = self._load_case_templates()
self.vital_signs = self._load_vital_signs()
self.lab_ranges = self._load_lab_ranges()
self.demographics = self._load_demographics()
def _load_case_templates(self):
"""Load case structure templates."""
return {
"step1": {
"focus": "pathophysiology, mechanism, basic science",
"question_type": "What is the most likely mechanism/pathogen/findings?",
"style": "basic_science_focused"
},
"step2": {
"focus": "diagnosis, management, next best step",
"question_type": "What is the most likely diagnosis/next step/best initial test?",
"style": "clinical_focused"
}
}
def _load_vital_signs(self):
"""Load vital sign abnormalities by condition."""
return {
"normal": {"temp": "98.6°F", "hr": "72 bpm", "bp": "120/80 mmHg", "rr": "16/min", "o2": "98%"},
"fever": {"temp": "101.5°F", "hr": "95 bpm", "bp": "118/75 mmHg", "rr": "18/min", "o2": "97%"},
"shock": {"temp": "97.2°F", "hr": "120 bpm", "bp": "85/50 mmHg", "rr": "24/min", "o2": "92%"},
"hypertensive": {"temp": "98.8°F", "hr": "78 bpm", "bp": "175/105 mmHg", "rr": "16/min", "o2": "98%"},
"tachycardic": {"temp": "99.1°F", "hr": "115 bpm", "bp": "125/82 mmHg", "rr": "20/min", "o2": "96%"},
}
def _load_lab_ranges(self):
"""Load normal and abnormal lab values."""
return {
"normal": {
"wbc": "7,500/μL",
"hgb": "14.0 g/dL",
"plt": "250,000/μL",
"na": "140 mEq/L",
"k": "4.0 mEq/L",
"cl": "100 mEq/L",
"co2": "24 mEq/L",
"bun": "15 mg/dL",
"cr": "1.0 mg/dL",
"glucose": "90 mg/dL",
"ph": "7.40"
},
"leukocytosis": {"wbc": "15,000/μL"},
"anemia": {"hgb": "8.5 g/dL"},
"thrombocytopenia": {"plt": "80,000/μL"},
"hyponatremia": {"na": "128 mEq/L"},
"hyperkalemia": {"k": "5.8 mEq/L"},
"hypokalemia": {"k": "2.8 mEq/L"},
"azotemia": {"bun": "45 mg/dL", "cr": "2.5 mg/dL"},
"hyperglycemia": {"glucose": "320 mg/dL"},
"metabolic_acidosis": {"ph": "7.28", "co2": "18 mEq/L"},
}
def _load_demographics(self):
"""Load patient demographic options."""
return {
"adult_male": {"age_range": (25, 75), "gender": "male"},
"adult_female": {"age_range": (25, 75), "gender": "female"},
"elderly_male": {"age_range": (65, 85), "gender": "male"},
"elderly_female": {"age_range": (65, 85), "gender": "female"},
"young_adult_male": {"age_range": (18, 35), "gender": "male"},
"young_adult_female": {"age_range": (18, 35), "gender": "female"},
"child": {"age_range": (5, 12), "gender": "any"},
"adolescent": {"age_range": (13, 17), "gender": "any"},
}
def get_condition_database(self):
"""Get database of medical conditions for case generation."""
return {
"cardiology": [
{
"name": "Acute Myocardial Infarction",
"step": 2,
"demographics": "elderly_male",
"chief_complaint": "Chest pain",
"hpi": "A {age}-year-old {gender} presents with crushing substernal chest pain radiating to the left arm and jaw, onset 2 hours ago at rest. Associated with diaphoresis, nausea, and shortness of breath.",
"vitals": "tachycardic",
"pe_findings": "Diaphoretic, anxious. Heart: tachycardic, regular rhythm. Lungs: clear bilaterally.",
"ecg": "ST-segment elevation in leads II, III, and aVF",
"troponin": "Elevated",
"labs": "normal",
"question": "What is the most appropriate next step in management?",
"options": [
"A. Administer aspirin and obtain cardiac enzymes in 6 hours",
"B. Give nitroglycerin and discharge home if pain resolves",
"C. Start heparin and arrange for emergent cardiac catheterization",
"D. Perform CT pulmonary angiography to rule out PE",
"E. Order stress echocardiography as outpatient"
],
"correct": "C",
"explanation": "This patient presents with STEMI (ST-elevation MI) based on ST elevations in inferior leads (II, III, aVF). The standard of care is emergent reperfusion therapy - primary PCI (percutaneous coronary intervention) is preferred if available within 90 minutes. Heparin anticoagulation is initiated while awaiting catheterization.",
"learning_objectives": ["STEMI diagnosis", "Emergent reperfusion therapy", "Primary PCI indications"]
},
{
"name": "Atrial Fibrillation",
"step": 2,
"demographics": "elderly_male",
"chief_complaint": "Palpitations",
"hpi": "A {age}-year-old {gender} with history of hypertension presents with sudden onset irregular heartbeat and palpitations for 3 days. Denies chest pain but reports mild dyspnea on exertion.",
"vitals": "tachycardic",
"pe_findings": "Irregularly irregular pulse. Heart rate approximately 110 bpm. No murmurs.",
"ecg": "Irregularly irregular rhythm, no discernible P waves, narrow QRS complexes",
"labs": "normal",
"question": "What is the most appropriate next step in management?",
"options": [
"A. Immediate synchronized cardioversion",
"B. Rate control with beta-blocker and anticoagulation assessment",
"C. Start aspirin daily",
"D. Schedule outpatient Holter monitor",
"E. Initiate amiodarone for rhythm control"
],
"correct": "B",
"explanation": "For hemodynamically stable atrial fibrillation (>48 hours duration), the first step is rate control (beta-blocker or calcium channel blocker) and assessment for anticoagulation using CHA2DS2-VASc score. Cardioversion without anticoagulation or TEE increases stroke risk.",
"learning_objectives": ["AFib management", "Rate vs rhythm control", "CHA2DS2-VASc score"]
},
{
"name": "Heart Failure Exacerbation",
"step": 2,
"demographics": "elderly_female",
"chief_complaint": "Shortness of breath",
"hpi": "A {age}-year-old {gender} with known heart failure presents with worsening dyspnea over 3 days, orthopnea, and bilateral leg swelling. Non-compliant with sodium restriction and medications.",
"vitals": "tachycardic",
"pe_findings": "Jugular venous distention to jaw at 45 degrees. Bilateral crackles in lung bases. 2+ pitting edema in bilateral lower extremities.",
"cxr": "Cardiomegaly, pulmonary edema, pleural effusions",
"bnp": "Elevated",
"labs": "normal",
"question": "What is the most appropriate initial therapy?",
"options": [
"A. Start ACE inhibitor and beta-blocker immediately",
"B. IV loop diuretics and supplemental oxygen",
"C. Emergent hemodialysis",
"D. High-dose aspirin therapy",
"E. Immediate cardiac catheterization"
],
"correct": "B",
"explanation": "This patient presents with acute decompensated heart failure with volume overload. The initial management focuses on diuresis (IV loop diuretics) to relieve congestion and supplemental oxygen for hypoxemia. ACE inhibitors and beta-blockers are important chronic therapies but not for acute decompensation.",
"learning_objectives": ["Acute heart failure management", "Volume overload treatment", "Diuretic therapy"]
}
],
"pulmonology": [
{
"name": "Community Acquired Pneumonia",
"step": 2,
"demographics": "adult_male",
"chief_complaint": "Fever and cough",
"hpi": "A {age}-year-old {gender} presents with 4 days of productive cough with rust-colored sputum, fever to 102°F, and pleuritic chest pain. Recently had URI symptoms.",
"vitals": "fever",
"pe_findings": "Decreased breath sounds and dullness to percussion in right lower lobe. Bronchial breath sounds present. Tachycardic.",
"cxr": "Right lower lobe consolidation with air bronchograms",
"labs": "leukocytosis",
"question": "What is the most likely causative organism?",
"options": [
"A. Pseudomonas aeruginosa",
"B. Staphylococcus aureus",
"C. Streptococcus pneumoniae",
"D. Mycoplasma pneumoniae",
"E. Haemophilus influenzae"
],
"correct": "C",
"explanation": "Streptococcus pneumoniae is the most common cause of community-acquired pneumonia in adults, especially with classic presentation of sudden onset, high fever, productive cough with rust-colored sputum, and lobar consolidation on CXR. CURB-65 score helps determine need for hospitalization.",
"learning_objectives": ["CAP microbiology", "S. pneumoniae presentation", "CURB-65 criteria"]
},
{
"name": "Pulmonary Embolism",
"step": 2,
"demographics": "adult_female",
"chief_complaint": "Sudden dyspnea",
"hpi": "A {age}-year-old {gender} presents with sudden onset shortness of breath and pleuritic chest pain. Recently returned from 8-hour flight 2 days ago. History of oral contraceptive use.",
"vitals": "tachycardic",
"pe_findings": "Tachypneic, using accessory muscles. Clear lung sounds. Right calf swollen and tender.",
"ecg": "Sinus tachycardia, S1Q3T3 pattern",
"d_dimer": "Elevated",
"ctpa": "Filling defect in right pulmonary artery",
"question": "What is the most appropriate next diagnostic step?",
"options": [
"A. Start empiric heparin and obtain CT pulmonary angiography",
"B. Order ventilation-perfusion scan",
"C. Perform lower extremity Doppler ultrasound",
"D. Start warfarin and discharge home",
"E. Obtain cardiac catheterization"
],
"correct": "A",
"explanation": "This patient has high pretest probability for PE (Wells criteria: tachycardia, clinical signs of DVT, immobilization, hemoptysis). The diagnostic test of choice is CT pulmonary angiography (CTPA). Empiric anticoagulation should be started if high suspicion while awaiting imaging.",
"learning_objectives": ["PE diagnosis", "Wells criteria", "CTPA as gold standard"]
},
{
"name": "Asthma Exacerbation",
"step": 1,
"demographics": "young_adult_female",
"chief_complaint": "Wheezing",
"hpi": "A {age}-year-old {gender} with asthma presents with worsening wheezing and dyspnea for 2 days after exposure to cat dander. Has required rescue inhaler every 2 hours.",
"vitals": "normal",
"pe_findings": "Audible wheezing bilaterally. Prolonged expiratory phase. Using accessory muscles.",
"pfts": "FEV1/FVC ratio 0.65, increased by 15% after bronchodilator",
"question": "What is the mechanism of bronchial hyperresponsiveness in asthma?",
"options": [
"A. IgE-mediated mast cell degranulation and inflammation",
"B. Direct bacterial infection of bronchial epithelium",
"C. Autoimmune destruction of alveolar walls",
"D. Excessive parasympathetic innervation only",
"E. Genetic defect in surfactant production"
],
"correct": "A",
"explanation": "Asthma is characterized by chronic airway inflammation mediated by type I hypersensitivity (IgE-mediated). Exposure to allergens triggers mast cell degranulation, releasing histamine, leukotrienes, and prostaglandins causing bronchoconstriction, mucus production, and airway edema.",
"learning_objectives": ["Asthma pathophysiology", "Type I hypersensitivity", "Mast cell mediators"]
}
],
"gastroenterology": [
{
"name": "Acute Appendicitis",
"step": 2,
"demographics": "young_adult_male",
"chief_complaint": "Abdominal pain",
"hpi": "A {age}-year-old {gender} presents with periumbilical pain that migrated to RLQ over 12 hours. Nausea, vomiting, and anorexia. Low-grade fever.",
"vitals": "fever",
"pe_findings": "Tenderness at McBurney's point. Positive rebound tenderness and guarding. Rovsing's sign positive.",
"wbc": "13,500 with left shift",
"ct": "Dilated appendix with wall enhancement and periappendiceal fat stranding",
"question": "What is the most appropriate next step?",
"options": [
"A. Abdominal ultrasound",
"B. Emergent laparoscopic appendectomy",
"C. IV antibiotics and observation for 24 hours",
"D. Colonoscopy",
"E. CT chest/abdomen/pelvis with contrast"
],
"correct": "B",
"explanation": "Acute appendicitis is a surgical emergency. Once diagnosed clinically or confirmed with imaging, the definitive treatment is appendectomy. Delay increases risk of perforation. CT imaging (if obtained) shows the characteristic findings described.",
"learning_objectives": ["Appendicitis diagnosis", "Migration of pain", "Surgical management"]
},
{
"name": "Acute Pancreatitis",
"step": 2,
"demographics": "adult_male",
"chief_complaint": "Epigastric pain",
"hpi": "A {age}-year-old {gender} presents with severe epigastric pain radiating to the back after heavy alcohol use and large meal. Nausea and vomiting. History of gallstones.",
"vitals": "fever",
"pe_findings": "Severe epigastric tenderness with guarding. Decreased bowel sounds. Grey Turner's sign negative.",
"amylase": "850 U/L",
"lipase": "1,200 U/L",
"ct": "Enlarged pancreas with peripancreatic fat stranding",
"question": "What is the most common cause of acute pancreatitis in the United States?",
"options": [
"A. Hypertriglyceridemia",
"B. Gallstones",
"C. Alcohol abuse",
"D. Abdominal trauma",
"E. Medications"
],
"correct": "B",
"explanation": "In the United States, gallstones are the most common cause of acute pancreatitis (40-70%), followed by alcohol abuse (25-35%). The classic presentation is epigastric pain radiating to the back with elevated amylase and lipase (>3x upper limit of normal).",
"learning_objectives": ["Pancreatitis etiology", "Gallstone pancreatitis", "Diagnostic criteria"]
},
{
"name": "GERD Pathophysiology",
"step": 1,
"demographics": "adult_female",
"chief_complaint": "Heartburn",
"hpi": "A {age}-year-old {gender} presents with burning retrosternal discomfort after meals and when lying down, occurring 3-4 times per week for 2 months. Worse after spicy foods.",
"vitals": "normal",
"pe_findings": "Normal physical examination",
"question": "What is the primary mechanism of gastroesophageal reflux disease?",
"options": [
"A. Excessive gastric acid production only",
"B. Incompetent lower esophageal sphincter and transient relaxations",
"C. H. pylori infection of the esophageal mucosa",
"D. Autoimmune destruction of esophageal smooth muscle",
"E. Excessive histamine release from gastric G cells"
],
"correct": "B",
"explanation": "GERD is primarily caused by dysfunction of the lower esophageal sphincter (LES), including transient LES relaxations and decreased basal LES tone. This allows gastric contents to reflux into the esophagus. Contributing factors include hiatal hernia, obesity, and certain foods.",
"learning_objectives": ["GERD pathophysiology", "LES dysfunction", "Transient LES relaxations"]
}
],
"nephrology": [
{
"name": "Acute Kidney Injury",
"step": 2,
"demographics": "elderly_male",
"chief_complaint": "Decreased urine output",
"hpi": "A {age}-year-old {gender} underwent elective hip surgery 2 days ago. Now with decreased urine output (300 mL in 24h) and elevated creatinine from 1.0 to 2.8 mg/dL.",
"vitals": "normal",
"pe_findings": "Dry mucous membranes. Flat neck veins. No peripheral edema.",
"labs": "azotemia",
"bun_cr_ratio": ">20:1",
"fena": "<1%",
"question": "What is the most likely cause of this patient's acute kidney injury?",
"options": [
"A. Acute tubular necrosis",
"B. Prerenal azotemia from hypovolemia",
"C. Postrenal obstruction",
"D. Rhabdomyolysis",
"E. Acute interstitial nephritis"
],
"correct": "B",
"explanation": "This patient has prerenal azotemia secondary to hypovolemia from surgery and poor intake. The BUN/Cr ratio >20:1 and FENa <1% are classic findings of prerenal azotemia. Physical exam findings of volume depletion (dry mucous membranes, flat JVP) support this diagnosis.",
"learning_objectives": ["AKI classification", "Prerenal vs intrinsic", "FENa interpretation"]
},
{
"name": "Nephrotic Syndrome",
"step": 2,
"demographics": "child",
"chief_complaint": "Swelling",
"hpi": "A {age}-year-old child presents with periorbital edema in the morning and leg swelling. Parents report foamy urine. No recent illness.",
"vitals": "normal",
"pe_findings": "Periorbital edema, 2+ pitting edema in lower extremities. Ascites present.",
"urinalysis": "4+ protein, no RBCs",
"protein_creatinine_ratio": ">3.5",
"albumin": "2.1 g/dL",
"lipids": "Elevated cholesterol and triglycerides",
"question": "What is the most likely diagnosis?",
"options": [
"A. Acute post-streptococcal glomerulonephritis",
"B. Minimal change disease",
"C. IgA nephropathy",
"D. Rapidly progressive glomerulonephritis",
"E. Lupus nephritis"
],
"correct": "B",
"explanation": "Minimal change disease is the most common cause of nephrotic syndrome in children (80-90% of cases <10 years). The classic tetrad of nephrotic syndrome is heavy proteinuria (>3.5 g/day), hypoalbuminemia, edema, and hyperlipidemia. Minimal change disease typically has normal appearing glomeruli on light microscopy.",
"learning_objectives": ["Nephrotic syndrome", "Minimal change disease", "Nephrotic vs nephritic"]
}
],
"endocrinology": [
{
"name": "Diabetic Ketoacidosis",
"step": 2,
"demographics": "young_adult_female",
"chief_complaint": "Nausea and confusion",
"hpi": "A {age}-year-old {gender} with type 1 diabetes presents with nausea, vomiting, and confusion for 24 hours. Ran out of insulin 3 days ago. Also reports polyuria and polydipsia.",
"vitals": "tachycardic",
"pe_findings": "Kussmaul respirations (deep, rapid breathing). Dry mucous membranes. Fruity breath odor. Altered mental status.",
"labs": "hyperglycemia",
"ph": "7.15",
"bicarb": "12 mEq/L",
"anion_gap": "22",
"ketones": "Positive serum and urine",
"question": "What is the most appropriate initial management?",
"options": [
"A. Subcutaneous regular insulin only",
"B. IV fluid resuscitation followed by IV insulin drip",
"C. Oral hypoglycemic agents",
"D. Sodium bicarbonate infusion",
"E. Glucagon injection"
],
"correct": "B",
"explanation": "DKA management priorities: 1) Fluid resuscitation with isotonic saline (often 1-2L in first hour), 2) IV regular insulin (0.1 U/kg bolus then 0.1 U/kg/hr drip), 3) Potassium replacement when K+ <5.3, 4) Bicarbonate only if pH <6.9. Glucose monitoring with dextrose added when glucose <250.",
"learning_objectives": ["DKA management", "Fluid resuscitation priority", "Insulin therapy protocol"]
},
{
"name": "Hyperthyroidism",
"step": 2,
"demographics": "young_adult_female",
"chief_complaint": "Weight loss and tremor",
"hpi": "A {age}-year-old {gender} presents with 15-pound weight loss over 2 months despite increased appetite, heat intolerance, palpitations, and anxiety. Also notes tremor in hands.",
"vitals": "tachycardic",
"pe_findings": "Tremor in outstretched hands. Warm, moist skin. Lid lag present. Diffuse thyroid enlargement without nodules.",
"tsh": "0.01 mIU/L",
"free_t4": "Elevated",
"tsi": "Elevated",
"question": "What is the most likely diagnosis?",
"options": [
"A. Hashimoto's thyroiditis",
"B. Graves' disease",
"C. Toxic multinodular goiter",
"D. Subacute thyroiditis",
"E. Iatrogenic hyperthyroidism"
],
"correct": "B",
"explanation": "Graves' disease is the most common cause of hyperthyroidism. The combination of hyperthyroid symptoms (weight loss, heat intolerance, palpitations) with diffuse goiter and elevated TSI (thyroid-stimulating immunoglobulins) is diagnostic. Eye findings (lid lag, exophthalmos) may also be present.",
"learning_objectives": ["Graves' disease diagnosis", "TSI antibodies", "Hyperthyroidism workup"]
}
],
"infectious_disease": [
{
"name": "Meningitis",
"step": 2,
"demographics": "young_adult_male",
"chief_complaint": "Headache and fever",
"hpi": "A {age}-year-old {gender} presents with sudden onset severe headache, fever to 103°F, and neck stiffness. Photophobia and confusion noted by roommate.",
"vitals": "fever",
"pe_findings": "Nuchal rigidity. Positive Kernig's sign. Positive Brudzinski's sign. Altered mental status.",
"wbc": "15,000 with left shift",
"lp": "CSF: Elevated WBC, low glucose, high protein, gram-positive diplococci",
"question": "What is the most appropriate empiric treatment?",
"options": [
"A. Vancomycin + Ceftriaxone + Ampicillin",
"B. Azithromycin monotherapy",
"C. Metronidazole alone",
"D. Amoxicillin only",
"E. Ciprofloxacin"
],
"correct": "A",
"explanation": "Empiric therapy for suspected bacterial meningitis in adults (18-50): Vancomycin (for possible resistant pneumococcus) + Ceftriaxone (or cefotaxime) + Ampicillin (for Listeria coverage in adults >50 or immunocompromised). Dexamethasone should be given before or with first antibiotic dose.",
"learning_objectives": ["Meningitis empiric therapy", "CSF analysis", "Kernig/Brudzinski signs"]
},
{
"name": "HIV/AIDS Opportunistic Infection",
"step": 2,
"demographics": "adult_male",
"chief_complaint": "Dyspnea",
"hpi": "A {age}-year-old {gender} with HIV (last CD4 unknown, not on HAART) presents with 2 weeks of progressive dyspnea, nonproductive cough, and low-grade fever.",
"vitals": "fever",
"pe_findings": "Tachypneic. Decreased oxygen saturation to 88% on room air. Clear lung sounds bilaterally.",
"cd4": "45 cells/μL",
"cxr": "Bilateral diffuse interstitial infiltrates",
"abg": "PaO2 58 mmHg, increased A-a gradient",
"question": "What is the most likely diagnosis?",
"options": [
"A. Tuberculosis",
"B. Pneumocystis jirovecii pneumonia (PCP)",
"C. Bacterial community-acquired pneumonia",
"D. Kaposi's sarcoma",
"E. CMV pneumonitis"
],
"correct": "B",
"explanation": "PCP is the most common opportunistic infection in AIDS with CD4 <200. The classic presentation is subacute dyspnea, nonproductive cough, fever, hypoxemia with normal or minimal auscultatory findings, and diffuse interstitial infiltrates on CXR. Diagnosis confirmed by silver stain of induced sputum or BAL.",
"learning_objectives": ["PCP presentation", "AIDS opportunistic infections", "CD4 thresholds"]
}
],
"neurology": [
{
"name": "Acute Ischemic Stroke",
"step": 2,
"demographics": "elderly_male",
"chief_complaint": "Weakness",
"hpi": "A {age}-year-old {gender} with atrial fibrillation presents with sudden onset right-sided weakness and difficulty speaking. Last known well 2 hours ago.",
"vitals": "hypertensive",
"pe_findings": "Right facial droop, right arm and leg weakness (3/5), expressive aphasia present. Left gaze preference.",
"nihss": "12",
"ct_head": "No hemorrhage, early ischemic changes in left MCA territory",
"question": "What is the most appropriate next step?",
"options": [
"A. Start aspirin immediately",
"B. Administer IV tPA (thrombolytics)",
"C. Anticoagulate with heparin",
"D. Perform CT angiography only and observe",
"E. Aspirin and clopidogrel dual therapy"
],
"correct": "B",
"explanation": "For acute ischemic stroke presenting within 4.5 hours of symptom onset, IV tPA (alteplase) is indicated if no contraindications (normal CT ruling out hemorrhage, BP <185/110, no recent surgery/ bleeding). Mechanical thrombectomy may be considered for large vessel occlusion.",
"learning_objectives": ["Acute stroke management", "tPA indications", "Time window for thrombolysis"]
},
{
"name": "Multiple Sclerosis",
"step": 1,
"demographics": "young_adult_female",
"chief_complaint": "Vision loss",
"hpi": "A {age}-year-old {gender} presents with painful vision loss in right eye over 3 days. Had episode of leg weakness 6 months ago that resolved spontaneously. Lives in northern climate.",
"vitals": "normal",
"pe_findings": "Decreased visual acuity OD. Afferent pupillary defect (Marcus Gunn pupil) on right. Internuclear ophthalmoplegia noted on left gaze.",
"mri_brain": "Multiple periventricular white matter lesions, ovoid, perpendicular to ventricles (Dawson's fingers)",
"csf": "Oligoclonal bands present",
"question": "What is the pathophysiology of this disease?",
"options": [
"A. Autoimmune demyelination of CNS white matter",
"B. Direct viral infection of oligodendrocytes",
"C. Vasculitic occlusion of cerebral vessels",
"D. Autoantibody attack on neuromuscular junction",
"E. Progressive neuronal apoptosis"
],
"correct": "A",
"explanation": "MS is an autoimmune demyelinating disease of the CNS characterized by T-cell mediated attack on myelin basic protein. Features include dissemination in time and space (multiple episodes, multiple locations), periventricular white matter lesions (Dawson's fingers), oligoclonal bands in CSF, and association with HLA-DR2 and vitamin D deficiency.",
"learning_objectives": ["MS pathophysiology", "Demyelination", "McDonald criteria"]
}
]
}
def generate_case(self, step=2, topic=None, condition=None, difficulty="medium", include_answer=True):
"""Generate a USMLE-style clinical case."""
conditions_db = self.get_condition_database()
# Filter by topic if specified
if topic and topic.lower() in conditions_db:
available_conditions = conditions_db[topic.lower()]
else:
# Use all conditions
available_conditions = []
for topic_conditions in conditions_db.values():
available_conditions.extend(topic_conditions)
# Filter by step
available_conditions = [c for c in available_conditions if c.get("step", 2) == step]
if not available_conditions:
return self._generate_generic_case(step, topic)
# Select condition
if condition:
selected = next((c for c in available_conditions if condition.lower() in c["name"].lower()), None)
if not selected:
selected = random.choice(available_conditions)
else:
selected = random.choice(available_conditions)
return self._format_case(selected, include_answer)
def _format_case(self, condition, include_answer=True):
"""Format a case with demographics filled in."""
# Get demographics
demo_key = condition.get("demographics", "adult_male")
demo = self.demographics.get(demo_key, self.demographics["adult_male"])
age = random.randint(demo["age_range"][0], demo["age_range"][1])
gender = demo["gender"] if demo["gender"] != "any" else random.choice(["male", "female"])
# Fill in HPI
hpi = condition.get("hpi", "").format(age=age, gender=gender)
# Build case
case = {
"title": f"Case: {condition['name']}",
"metadata": {
"condition": condition['name'],
"step": condition.get('step', 2),
"topic": self._get_topic_for_condition(condition['name']),
"difficulty": condition.get('difficulty', 'medium'),
"generated_at": datetime.now().isoformat()
},
"case_content": {
"chief_complaint": condition.get('chief_complaint', 'Unknown'),
"history_of_present_illness": hpi,
"vital_signs": self._get_vitals(condition.get('vitals', 'normal')),
"physical_examination": condition.get('pe_findings', 'Normal'),
"diagnostic_studies": self._get_diagnostics(condition),
"laboratory_results": self._get_labs(condition.get('labs', 'normal'))
},
"question": {
"text": condition.get('question', 'What is the diagnosis?'),
"options": condition.get('options', ['A', 'B', 'C', 'D', 'E'])
}
}
if include_answer:
case["answer"] = {
"correct": condition.get('correct', 'A'),
"explanation": condition.get('explanation', ''),
"learning_objectives": condition.get('learning_objectives', [])
}
return case
def _get_vitals(self, vitals_key):
"""Get vital signs for a case."""
return self.vital_signs.get(vitals_key, self.vital_signs["normal"])
def _get_labs(self, labs_key):
"""Get lab values for a case."""
base_labs = self.lab_ranges["normal"].copy()
if labs_key in self.lab_ranges:
base_labs.update(self.lab_ranges[labs_key])
return base_labs
def _get_diagnostics(self, condition):
"""Get diagnostic study results."""
diagnostics = {}
if 'ecg' in condition:
diagnostics['ECG'] = condition['ecg']
if 'cxr' in condition:
diagnostics['Chest X-ray'] = condition['cxr']
if 'ct' in condition:
diagnostics['CT'] = condition['ct']
if 'mri' in condition:
diagnostics['MRI'] = condition['mri']
if 'lp' in condition:
diagnostics['Lumbar Puncture'] = condition['lp']
if 'pfts' in condition:
diagnostics['Pulmonary Function Tests'] = condition['pfts']
return diagnostics
def _get_topic_for_condition(self, condition_name):
"""Get topic category for a condition."""
db = self.get_condition_database()
for topic, conditions in db.items():
if any(c['name'] == condition_name for c in conditions):
return topic
return "general"
def _generate_generic_case(self, step, topic):
"""Generate a generic case when specific one not available."""
return {
"title": "Generic Clinical Case",
"metadata": {
"step": step,
"topic": topic or "general",
"generated_at": datetime.now().isoformat()
},
"case_content": {
"chief_complaint": "Patient presents with symptoms",
"history_of_present_illness": "Please specify a condition or topic for a detailed case.",
"vital_signs": self.vital_signs["normal"],
"physical_examination": "Normal",
"diagnostic_studies": {},
"laboratory_results": self.lab_ranges["normal"]
},
"question": {
"text": "What would you like to assess?",
"options": ["A", "B", "C", "D", "E"]
}
}
def format_output(self, case, format_type="text"):
"""Format case for output."""
if format_type == "json":
return json.dumps(case, indent=2)
elif format_type == "markdown":
return self._format_markdown(case)
else:
return self._format_text(case)
def _format_text(self, case):
"""Format case as plain text."""
content = case["case_content"]
lines = [
"=" * 60,
case["title"],
"=" * 60,
"",
f"CHIEF COMPLAINT: {content['chief_complaint']}",
"",
"HISTORY OF PRESENT ILLNESS:",
content['history_of_present_illness'],
"",
"VITAL SIGNS:",
f" Temperature: {content['vital_signs']['temp']}",
f" Heart Rate: {content['vital_signs']['hr']}",
f" Blood Pressure: {content['vital_signs']['bp']}",
f" Respiratory Rate: {content['vital_signs']['rr']}",
f" O2 Saturation: {content['vital_signs']['o2']}",
"",
"PHYSICAL EXAMINATION:",
content['physical_examination'],
"",
"DIAGNOSTIC STUDIES:"
]
if content['diagnostic_studies']:
for study, result in content['diagnostic_studies'].items():
lines.append(f" {study}: {result}")
else:
lines.append(" None")
lines.extend([
"",
"LABORATORY RESULTS:",
])
for lab, value in content['laboratory_results'].items():
lines.append(f" {lab.upper()}: {value}")
lines.extend([
"",
"-" * 60,
"QUESTION:",
case["question"]["text"],
"",
])
for option in case["question"]["options"]:
lines.append(option)
if "answer" in case:
lines.extend([
"",
"-" * 60,
f"CORRECT ANSWER: {case['answer']['correct']}",
"",
"EXPLANATION:",
case['answer']['explanation'],
"",
"LEARNING OBJECTIVES:",
])
for obj in case['answer']['learning_objectives']:
lines.append(f" - {obj}")
lines.append("=" * 60)
return "\n".join(lines)
def _format_markdown(self, case):
"""Format case as markdown."""
content = case["case_content"]
lines = [
f"# {case['title']}",
"",
"## Case Information",
f"- **Condition**: {case['metadata']['condition']}",
f"- **USMLE Step**: {case['metadata']['step']}",
f"- **Topic**: {case['metadata']['topic']}",
"",
"## Chief Complaint",
content['chief_complaint'],
"",
"## History of Present Illness",
content['history_of_present_illness'],
"",
"## Vital Signs",
f"| Parameter | Value |",
f"|-----------|-------|",
f"| Temperature | {content['vital_signs']['temp']} |",
f"| Heart Rate | {content['vital_signs']['hr']} |",
f"| Blood Pressure | {content['vital_signs']['bp']} |",
f"| Respiratory Rate | {content['vital_signs']['rr']} |",
f"| O2 Saturation | {content['vital_signs']['o2']} |",
"",
"## Physical Examination",
content['physical_examination'],
"",
"## Diagnostic Studies",
]
if content['diagnostic_studies']:
for study, result in content['diagnostic_studies'].items():
lines.append(f"- **{study}**: {result}")
else:
lines.append("None")
lines.extend([
"",
"## Laboratory Results",
])
for lab, value in content['laboratory_results'].items():
lines.append(f"- **{lab.upper()}**: {value}")
lines.extend([
"",
"## Question",
case["question"]["text"],
"",
])
for option in case["question"]["options"]:
lines.append(f"{option}")
if "answer" in case:
lines.extend([
"",
"---",
"",
f"## Answer: **{case['answer']['correct']}**",
"",
"### Explanation",
case['answer']['explanation'],
"",
"### Learning Objectives",
])
for obj in case['answer']['learning_objectives']:
lines.append(f"- {obj}")
return "\n".join(lines)
def main():
parser = argparse.ArgumentParser(description="Generate USMLE-style clinical cases")
parser.add_argument("--step", type=int, choices=[1, 2], default=2, help="USMLE Step level")
parser.add_argument("--topic", type=str, help="Medical topic (cardiology, pulmonology, etc.)")
parser.add_argument("--condition", type=str, help="Specific medical condition")
parser.add_argument("--difficulty", type=str, choices=["easy", "medium", "hard"], default="medium")
parser.add_argument("--format", type=str, choices=["text", "json", "markdown"], default="text")
parser.add_argument("--include-diagnosis", action="store_true", help="Include answer key")
parser.add_argument("--count", type=int, default=1, help="Number of cases to generate")
args = parser.parse_args()
generator = USMLECaseGenerator()
for i in range(args.count):
case = generator.generate_case(
step=args.step,
topic=args.topic,
condition=args.condition,
difficulty=args.difficulty,
include_answer=args.include_diagnosis
)
output = generator.format_output(case, args.format)
print(output)
if i < args.count - 1:
print("\n" + "=" * 60 + "\n")
return 0
if __name__ == "__main__":
sys.exit(main())
Assess translational gaps between preclinical models and human diseases.
---
name: translational-gap-analyzer
description: Assess translational gaps between preclinical models and human diseases.
license: MIT
skill-author: AIPOCH
---
# Translational Gap Analyzer
**ID**: 209
## When to Use
- Use this skill when the task needs Assess translational gaps between preclinical models and human diseases.
- Use this skill for evidence insight tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
- Scope-focused workflow aligned to: Assess translational gaps between preclinical models and human diseases.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
- Python 3.8+
- Built-in libraries: argparse, json, sys
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Evidence Insight/translational-gap-analyzer"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Description
Assesses the "translational gap" between basic research models (such as mice, zebrafish, cell lines) and human diseases, providing early warning of clinical translation failure risks. This system helps researchers identify potential translational barriers in preclinical research and improve clinical trial success rates through multi-dimensional analysis.
## Capabilities
- Evaluates anatomical/physiological differences between models and humans
- Analyzes pathological similarity of disease models
- Identifies interspecies differences in molecular pathways
- Evaluates pharmacokinetic differences
- Provides early warning of clinical trial failure risk factors
- Provides improvement recommendations to increase translation success rates
## Usage
```text
# Full assessment report
python scripts/main.py --model <model_type> --disease <disease_name> --full
# Quick risk assessment
python scripts/main.py --model <model_type> --disease <disease_name> --quick
# Compare multiple models
python scripts/main.py --models mouse,rat,primate --disease <disease_name> --compare
# Specify focus areas
python scripts/main.py --model mouse --disease "Alzheimer's" --focus metabolism,immune
```
## Arguments
| Argument | Description | Required |
|----------|-------------|----------|
| `--model` | Model type (mouse, rat, zebrafish, cell_line, organoid, primate) | Yes (unless --models) |
| `--models` | Multi-model comparison mode, comma-separated | No |
| `--disease` | Disease name or MeSH ID | Yes |
| `--focus` | Focus areas, comma-separated (anatomy, physiology, metabolism, immune, genetics, behavior) | No |
| `--full` | Generate full assessment report | No |
| `--quick` | Quick risk assessment mode | No |
| `--compare` | Multi-model comparison mode | No |
| `--output` | Output file path | No |
| `--format` | Output format (json, markdown, table) | No |
## Example Output
```json
{
"model": "mouse",
"disease": "Alzheimer's Disease",
"overall_gap_score": 6.8,
"risk_level": "HIGH",
"dimensions": {
"genetics": {"score": 8.5, "concerns": ["APOE4 differences", "Different tau pathology patterns"]},
"physiology": {"score": 7.0, "concerns": ["Brain structure differences", "Lifespan differences"]},
"metabolism": {"score": 6.5, "concerns": ["Significant drug metabolism differences"]},
"immune": {"score": 5.5, "concerns": ["Microglia functional differences", "Different neuroinflammation patterns"]},
"behavior": {"score": 6.0, "concerns": ["Limitations in cognitive assessment methods"]}
},
"clinical_failure_predictors": [
"Immune-related mechanism research may not translate",
"Drug clearance rate differences may lead to inappropriate dosing"
],
"recommendations": [
"Consider using humanized mouse models",
"Add non-human primate validation experiments",
"Focus on peripheral immune and central immune interactions"
]
}
```
## Model Types
### Common Models
| Model | Applicable Scenarios | Typical Gaps |
|------|----------|----------|
| mouse | Genetic manipulation, basic research | Immune, metabolism, brain structure |
| rat | Behavioral studies, cardiovascular | Cognition, drug metabolism |
| zebrafish | Development, high-throughput screening | Anatomy, physiology |
| cell_line | Molecular mechanisms | Microenvironment, systemic |
| organoid | Human-specific research | Maturity, vascularization |
| primate | Preclinical validation | Cost, ethics |
## Gap Scoring System
- **0-3**: Low gap, good translation prospects
- **4-6**: Moderate gap, requires additional validation
- **7-8**: High gap, significant translation risks exist
- **9-10**: Extremely high gap, low translation likelihood
## Files
- `SKILL.md` - This file
- `scripts/main.py` - Main analysis script
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `translational-gap-analyzer` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `translational-gap-analyzer` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## References
- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/audit-reference.md
# Audit Reference
## Scope
- Skill: `translational-gap-analyzer`
- Core purpose: Assess translational gaps between preclinical models and human diseases.
- Use only within the documented workflow and category boundary defined in `SKILL.md`
## Supported Audit Paths
- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`
## Fallback Boundary
If required inputs are incomplete, the skill should still return:
- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable
FILE:requirements.txt
dataclasses
enum
FILE:scripts/main.py
#!/usr/bin/env python3
"""Translational Gap Analyzer (ID: 209)
Assessing the translational gap between basic research models and human disease"""
import argparse
import json
import sys
from dataclasses import dataclass, asdict
from typing import List, Dict, Optional
from enum import Enum
class RiskLevel(Enum):
LOW = "LOW"
MEDIUM = "MEDIUM"
HIGH = "HIGH"
CRITICAL = "CRITICAL"
@dataclass
class DimensionScore:
"""Ratings and concerns for each dimension"""
score: float # 0-10, the higher the value, the greater the gap.
concerns: List[str]
details: Dict[str, any]
@dataclass
class GapAnalysisReport:
"""Conversion Gap Analysis Report"""
model: str
disease: str
overall_gap_score: float
risk_level: str
dimensions: Dict[str, DimensionScore]
clinical_failure_predictors: List[str]
recommendations: List[str]
# Knowledge Base: Model-Disease Gap Data
TRANSLATIONAL_KNOWLEDGE = {
"mouse": {
"anatomy": {
"brain_structure": 7.5,
"organ_size_ratio": 6.0,
"vascular_pattern": 7.0,
"immune_organs": 5.5
},
"physiology": {
"lifespan": 9.0,
"heart_rate": 8.0,
"body_temperature": 7.5,
"reproductive_cycle": 8.5
},
"metabolism": {
"drug_clearance": 8.0,
"cytochrome_p450": 7.5,
"glucose_metabolism": 6.5,
"lipid_metabolism": 6.0
},
"immune": {
"innate_immunity": 6.0,
"adaptive_immunity": 5.5,
"cytokine_profile": 7.0,
"microglia_function": 8.5
},
"genetics": {
"gene_conservation": 4.0,
"regulatory_elements": 7.0,
"chromosome_structure": 6.5,
"splice_variants": 7.5
},
"behavior": {
"cognitive_assessment": 8.0,
"social_behavior": 7.5,
"motor_function": 5.0,
"pain_perception": 7.0
}
},
"rat": {
"anatomy": {
"brain_structure": 6.5,
"organ_size_ratio": 5.5,
"vascular_pattern": 6.0,
"immune_organs": 5.0
},
"physiology": {
"lifespan": 8.5,
"heart_rate": 7.5,
"body_temperature": 7.0,
"reproductive_cycle": 7.5
},
"metabolism": {
"drug_clearance": 7.5,
"cytochrome_p450": 7.0,
"glucose_metabolism": 6.0,
"lipid_metabolism": 5.5
},
"immune": {
"innate_immunity": 5.5,
"adaptive_immunity": 5.0,
"cytokine_profile": 6.5,
"microglia_function": 7.5
},
"genetics": {
"gene_conservation": 4.5,
"regulatory_elements": 6.5,
"chromosome_structure": 6.0,
"splice_variants": 7.0
},
"behavior": {
"cognitive_assessment": 7.0,
"social_behavior": 6.5,
"motor_function": 4.5,
"pain_perception": 6.5
}
},
"zebrafish": {
"anatomy": {
"brain_structure": 8.5,
"organ_structure": 8.0,
"vascular_pattern": 7.0,
"immune_organs": 8.5
},
"physiology": {
"water_vs_air": 9.0,
"temperature_regulation": 9.0,
"reproduction": 8.5,
"regeneration": 7.0
},
"metabolism": {
"drug_metabolism": 8.5,
"energy_metabolism": 7.5,
"xenobiotic_metabolism": 8.0
},
"immune": {
"adaptive_immunity": 9.0,
"innate_immunity": 7.0,
"inflammation": 7.5
},
"genetics": {
"gene_duplication": 8.5,
"conservation": 6.0,
"regeneration_genes": 6.5
},
"behavior": {
"cognitive_assessment": 8.5,
"social_behavior": 8.0,
"learning": 8.0
}
},
"cell_line": {
"anatomy": {
"tissue_architecture": 9.5,
"cell_interactions": 9.0,
"extracellular_matrix": 8.5,
"3d_structure": 9.0
},
"physiology": {
"systemic_regulation": 10.0,
"homeostasis": 9.5,
"stress_response": 7.0
},
"metabolism": {
"culture_medium_effects": 8.5,
"oxygen_levels": 8.0,
"nutrient_availability": 7.5
},
"immune": {
"immune_interactions": 9.5,
"inflammation_context": 9.0
},
"genetics": {
"genetic_drift": 8.0,
"passage_effects": 8.5,
"mutation_accumulation": 8.0
}
},
"organoid": {
"anatomy": {
"maturation": 7.5,
"vascularization": 9.0,
"size_limitations": 8.5,
"cell_diversity": 6.5
},
"physiology": {
"systemic_integration": 9.5,
"maturity_level": 7.5,
"functionality": 7.0
},
"metabolism": {
"nutrient_diffusion": 8.0,
"waste_removal": 8.5,
"metabolic_activity": 6.5
},
"immune": {
"immune_component": 8.5,
"inflammation_modeling": 7.5
},
"genetics": {
"genetic_stability": 6.0,
"patient_variability": 5.5
}
},
"primate": {
"anatomy": {
"brain_structure": 2.5,
"organ_structure": 3.0,
"vascular_pattern": 3.5,
"immune_organs": 3.0
},
"physiology": {
"lifespan": 4.5,
"reproductive_cycle": 5.0,
"metabolic_rate": 4.0,
"body_temperature": 3.0
},
"metabolism": {
"drug_metabolism": 4.5,
"cytochrome_p450": 4.0,
"glucose_metabolism": 3.5,
"lipid_metabolism": 3.5
},
"immune": {
"innate_immunity": 3.5,
"adaptive_immunity": 3.0,
"cytokine_profile": 4.0,
"microglia_function": 3.5
},
"genetics": {
"gene_similarity": 2.5,
"regulatory_elements": 4.0,
"chromosome_structure": 3.0
},
"behavior": {
"cognitive_assessment": 4.0,
"social_behavior": 3.5,
"emotional_response": 4.5
}
}
}
# disease-specific risk factors
DISEASE_SPECIFIC_FACTORS = {
"alzheimer": {
"mouse": {
"pathology_differences": ["Lack of natural tau pathology", "Different Aβ deposition patterns", "Differences in neurodegeneration rates"],
"genetic_risks": ["APOE4 model limited", "TREM2 functional differences"],
"immune_factors": ["Microglial reactivity differs", "Differences in neuroinflammation patterns"],
"drug_targets": ["The amyloid hypothesis has failed many times", "Tau pathology is difficult to replicate"]
},
"rat": {
"pathology_differences": ["tau pathology limited", "Aβ clearance differences"],
"genetic_risks": ["There are fewer APOE models"],
"behavioral_differences": ["Limitations of cognitive assessment methods"]
}
},
"parkinson": {
"mouse": {
"pathology_differences": ["Lack of natural Lewy bodies", "Dopamine system differences"],
"genetic_risks": ["LRRK2 mutations have different effects", "Limitations of SNCA overexpression model"],
"drug_targets": ["Dopamine replacement therapy model is effective but clinically variable"]
},
"rat": {
"toxin_models": ["MPTP model is very different from sporadic PD", "6-OHDA model limitations"],
"genetic_models": ["Genetic modeling progress is slow"]
}
},
"cancer": {
"mouse": {
"immune_factors": ["Differences in the immune system affect the effectiveness of immunotherapy", "Different tumor microenvironments"],
"metabolic_factors": ["Drug metabolism varies significantly", "Dose conversion is difficult"],
"genetic_risks": ["Driver mutations are conserved but contextually diverse"]
},
"cell_line": {
"model_limitations": ["lack of microenvironment", "2D vs 3D differences", "cell line evolution"]
}
},
"diabetes": {
"mouse": {
"metabolic_factors": ["Differences in insulin resistance mechanisms", "Different fat distribution"],
"immune_factors": ["Type 1 diabetes autoimmune differences", "Limitations of NOD mice"],
"drug_targets": ["GLP-1 analogues have been relatively successful", "SGLT2 inhibitor differences"]
},
"rat": {
"metabolic_factors": ["Differences between Zucker diabetic obese rats and human T2D"]
}
},
"autoimmune": {
"mouse": {
"immune_factors": ["Differences in immune tolerance mechanisms", "MHC systems are fundamentally different"],
"drug_targets": ["TNF inhibitors are relatively successful", "B cell targeting differences"],
"pathology_differences": ["Differences between disease-induced models and spontaneous diseases"]
}
},
"cardiovascular": {
"mouse": {
"anatomical_differences": ["Different distribution of coronary arteries", "Differences in myocardial regeneration ability"],
"physiological_differences": ["Extremely fast heart rate", "Differences in EKG Interpretation"],
"drug_targets": ["Statins are effective", "Differences in anticoagulant drug metabolism"]
},
"rat": {
"advantages": ["Cardiovascular surgery models are more mature"],
"differences": ["Still significantly different from clinical"]
}
}
}
def normalize_disease_name(disease: str) -> str:
"""standardized disease names"""
disease_lower = disease.lower().replace("'", "").replace(" ", "_")
disease_mapping = {
"alzheimer": "alzheimer",
"alzheimers": "alzheimer",
"alzheimers_disease": "alzheimer",
"parkinson": "parkinson",
"parkinsons": "parkinson",
"parkinsons_disease": "parkinson",
"cancer": "cancer",
"tumor": "cancer",
"diabetes": "diabetes",
"autoimmune": "autoimmune",
"cardiovascular": "cardiovascular",
"heart_disease": "cardiovascular"
}
return disease_mapping.get(disease_lower, disease_lower)
def get_disease_specific_risks(model: str, disease: str) -> List[str]:
"""Get disease-specific risks"""
normalized_disease = normalize_disease_name(disease)
risks = []
if normalized_disease in DISEASE_SPECIFIC_FACTORS:
disease_data = DISEASE_SPECIFIC_FACTORS[normalized_disease]
if model in disease_data:
model_data = disease_data[model]
for category, items in model_data.items():
if isinstance(items, list):
risks.extend(items)
return risks
def calculate_dimension_score(model: str, dimension: str, focus_areas: List[str]) -> DimensionScore:
"""Calculate the score for a specific dimension"""
if model not in TRANSLATIONAL_KNOWLEDGE:
return DimensionScore(score=5.0, concerns=["Unknown model type"], details={})
model_data = TRANSLATIONAL_KNOWLEDGE[model]
if dimension not in model_data:
return DimensionScore(score=5.0, concerns=["The dimension data is not available"], details={})
dim_data = model_data[dimension]
scores = list(dim_data.values())
avg_score = sum(scores) / len(scores) if scores else 5.0
# Identify key concerns
concerns = []
for key, value in dim_data.items():
if value >= 7.0:
concerns.append(f"{key}: significant difference (score: {value})")
elif value >= 5.0:
concerns.append(f"{key}: medium difference (score: {value})")
# If this dimension is the focus, list more detailed concerns
if dimension in focus_areas:
concerns = [f"{k}: {v}" for k, v in dim_data.items() if v >= 5.0]
return DimensionScore(
score=round(avg_score, 1),
concerns=concerns[:5], # limited quantity
details=dim_data
)
def determine_risk_level(overall_score: float) -> str:
"""Determine risk level"""
if overall_score >= 8.0:
return RiskLevel.CRITICAL.value
elif overall_score >= 6.5:
return RiskLevel.HIGH.value
elif overall_score >= 4.0:
return RiskLevel.MEDIUM.value
else:
return RiskLevel.LOW.value
def generate_failure_predictors(model: str, disease: str, dimensions: Dict[str, DimensionScore]) -> List[str]:
"""Generate clinical trial failure predictors"""
predictors = []
# Identify risks based on dimension scores
high_gap_dims = [dim for dim, score in dimensions.items() if score.score >= 7.0]
if "immune" in high_gap_dims:
predictors.append("Research on immune-related mechanisms may be at risk of translation failure")
if "metabolism" in high_gap_dims:
predictors.append("Pharmacokinetic differences may lead to inappropriate clinical dosing")
if "anatomy" in high_gap_dims and model in ["mouse", "rat"]:
predictors.append("Differences in organ structure may affect drug distribution")
if "behavior" in high_gap_dims and "alzheimer" in disease.lower():
predictors.append("Interspecies differences in cognitive assessment methods may mask true efficacy")
if "genetics" in high_gap_dims:
predictors.append("Differences in gene regulatory networks may affect target effectiveness")
# Add disease-specific risks
disease_risks = get_disease_specific_risks(model, disease)
predictors.extend(disease_risks[:3]) # Up to 3
return predictors if predictors else ["No specific high-risk factors have been identified, but careful assessment is still required"]
def generate_recommendations(model: str, disease: str, dimensions: Dict[str, DimensionScore]) -> List[str]:
"""Generate improvement suggestions"""
recommendations = []
# Recommendations based on model type
if model == "mouse":
if dimensions.get("immune", DimensionScore(0, [], {})).score >= 7.0:
recommendations.append("Consider using mouse models of humanized immune systems")
if dimensions.get("metabolism", DimensionScore(0, [], {})).score >= 7.0:
recommendations.append("In vitro human hepatocyte validation of drug metabolism")
if dimensions.get("behavior", DimensionScore(0, [], {})).score >= 7.0:
recommendations.append("Increase behavioral validation in non-human primates")
recommendations.append("Consider using genetically humanized mice to increase translational relevance")
elif model == "rat":
recommendations.append("Utilizing Better Behavioral Characteristics of Rats for Cognitive Assessment")
if dimensions.get("metabolism", DimensionScore(0, [], {})).score >= 6.0:
recommendations.append("Performing metabolic studies in humanized liver microsomes")
elif model == "zebrafish":
recommendations.append("Mainly used for high-throughput screening and developmental toxicity testing")
recommendations.append("Positive results need to be verified in mammalian models")
recommendations.append("Focus on cardiovascular and neurodevelopmental toxicity assessment")
elif model == "cell_line":
recommendations.append("Improving physiological relevance using 3D culture or organoid models")
recommendations.append("Combined with co-culture system to simulate tumor microenvironment")
recommendations.append("Regularly verify cell line identity and genetic stability")
elif model == "organoid":
recommendations.append("Focus on organoid maturity and functional assessment")
recommendations.append("Consider vascularization techniques to improve nutrient supply")
recommendations.append("Combined with single-cell sequencing to verify cellular heterogeneity")
elif model == "primate":
recommendations.append("Used as final preclinical validation model")
recommendations.append("Strictly follow the 3R principles and design experiments rationally")
# Disease-specific recommendations
normalized_disease = normalize_disease_name(disease)
if normalized_disease == "alzheimer":
recommendations.append("Focus on pathological mechanisms beyond Aβ and tau")
recommendations.append("Pay attention to neuroimmune and microglia research")
elif normalized_disease == "cancer":
recommendations.append("Pay attention to tumor microenvironment and immunotherapy evaluation")
recommendations.append("Consider patient-derived xenograft models (PDX)")
elif normalized_disease == "autoimmune":
recommendations.append("Focus on human-specific immune targets")
recommendations.append("Consider using humanized immune system models")
return recommendations
def analyze_gap(model: str, disease: str, focus_areas: List[str] = None) -> GapAnalysisReport:
"""Perform a conversion gap analysis"""
if focus_areas is None:
focus_areas = []
dimensions_to_analyze = focus_areas if focus_areas else [
"anatomy", "physiology", "metabolism", "immune", "genetics", "behavior"
]
dimensions = {}
total_score = 0
for dim in dimensions_to_analyze:
dim_score = calculate_dimension_score(model, dim, focus_areas)
dimensions[dim] = dim_score
total_score += dim_score.score
overall_score = total_score / len(dimensions) if dimensions else 5.0
risk_level = determine_risk_level(overall_score)
failure_predictors = generate_failure_predictors(model, disease, dimensions)
recommendations = generate_recommendations(model, disease, dimensions)
# Convert to serializable format
serializable_dimensions = {}
for dim, score in dimensions.items():
serializable_dimensions[dim] = {
"score": score.score,
"concerns": score.concerns,
"details": score.details
}
return GapAnalysisReport(
model=model,
disease=disease,
overall_gap_score=round(overall_score, 1),
risk_level=risk_level,
dimensions=serializable_dimensions,
clinical_failure_predictors=failure_predictors,
recommendations=recommendations
)
def format_report(report: GapAnalysisReport, output_format: str = "json") -> str:
"""Format report output"""
if output_format == "json":
return json.dumps(asdict(report), indent=2, ensure_ascii=False)
elif output_format == "markdown":
md = f"""# Conversion Gap Analysis Report
## Basic information
- **Model**: {report.model}
- **disease**: {report.disease}
- **overall gap score**: {report.overall_gap_score}/10
- **risk level**: {report.risk_level}
## Assessment in various dimensions
"""
for dim, data in report.dimensions.items():
md += f"\n### {dim.capitalize()}\n"
md += f"- **score**: {data['score']}\n"
if data['concerns']:
md += "- **Points of concern**:"
for concern in data['concerns']:
md += f" - {concern}\n"
md += "## Predictors of clinical trial failure"
for predictor in report.clinical_failure_predictors:
md += f"- ⚠️ {predictor}\n"
md += "## Improvement suggestions"
for rec in report.recommendations:
md += f"- 💡 {rec}\n"
return md
else: # table format
lines = []
lines.append("=" * 70)
lines.append(f"Conversion Gap Analysis Report - {report.model} vs {report.disease}")
lines.append("=" * 70)
lines.append(f"Overall rating: {report.overall_gap_score}/10 | risk level: {report.risk_level}")
lines.append("-" * 70)
lines.append(f"{'Dimensions':<15} {'score':<10} {'main focus'}")
lines.append("-" * 70)
for dim, data in report.dimensions.items():
concerns = "; ".join(data['concerns'][:2]) if data['concerns'] else "none"
lines.append(f"{dim.capitalize():<15} {data['score']:<10} {concerns}")
lines.append("-" * 70)
lines.append("[Risk warning of clinical trial failure]")
for i, predictor in enumerate(report.clinical_failure_predictors[:3], 1):
lines.append(f" {i}. {predictor}")
lines.append("[Improvement suggestions]")
for i, rec in enumerate(report.recommendations[:4], 1):
lines.append(f" {i}. {rec}")
return "\n".join(lines)
def compare_models(models: List[str], disease: str, focus_areas: List[str] = None) -> str:
"""Compare multiple models"""
reports = []
for model in models:
report = analyze_gap(model, disease, focus_areas)
reports.append(report)
# Sorting: The lower the score, the better (the smaller the gap)
reports.sort(key=lambda x: x.overall_gap_score)
output = ["=" * 80]
output.append(f"Multi-model comparative analysis: {disease}")
output.append("=" * 80)
output.append(f"{'Model':<15} {'gap score':<12} {'risk level':<12} {'Recommended sorting'}")
output.append("-" * 80)
for i, report in enumerate(reports, 1):
output.append(f"{report.model:<15} {report.overall_gap_score:<12} {report.risk_level:<12} #{i}")
output.append("\n" + "=" * 80)
output.append("Detailed analysis")
output.append("=" * 80)
for report in reports:
output.append(f"\n[{report.model.upper()}]")
output.append(f"risk level: {report.risk_level} | key questions: {', '.join(report.clinical_failure_predictors[:2])}")
output.append(f"Main recommendations: {report.recommendations[0] if report.recommendations else 'N/A'}")
return "\n".join(output)
def main():
parser = argparse.ArgumentParser(
description="Translational Gap Analyzer - Assessing the translational gap between basic research models and human disease",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""Example:
python main.py --model mouse --disease "Alzheimer's" --full
python main.py --model mouse --disease "Alzheimer's" --quick
python main.py --models mouse,rat,primate --disease cancer --compare
python main.py --model mouse --disease diabetes --focus metabolism,immune"""
)
parser.add_argument("--model", type=str, help="Model type (mouse, rat, zebrafish, cell_line, organoid, primate)")
parser.add_argument("--models", type=str, help="Multi-model comparison mode, comma separated")
parser.add_argument("--disease", type=str, required=True, help="Disease name")
parser.add_argument("--focus", type=str, help="Areas of concern, separated by commas (anatomy, physiology, metabolism, immune, genetics, behavior)")
parser.add_argument("--full", action="store_true", help="Generate full assessment report")
parser.add_argument("--quick", action="store_true", help="Rapid risk assessment mode")
parser.add_argument("--compare", action="store_true", help="Multi-model comparison mode")
parser.add_argument("--output", type=str, help="Output file path")
parser.add_argument("--format", type=str, choices=["json", "markdown", "table"], default="table", help="Output format")
args = parser.parse_args()
# Analyze areas of concern
focus_areas = []
if args.focus:
focus_areas = [f.strip() for f in args.focus.split(",")]
# Multi-model comparison mode
if args.compare or args.models:
if not args.models:
print("Error: Comparison models require --models argument", file=sys.stderr)
sys.exit(1)
models = [m.strip() for m in args.models.split(",")]
result = compare_models(models, args.disease, focus_areas)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(result)
print(f"Comparison report saved to: {args.output}")
else:
print(result)
return
# Single model analysis mode
if not args.model:
print("Error: Single-model analysis requires --model argument", file=sys.stderr)
sys.exit(1)
# Perform analysis
report = analyze_gap(args.model, args.disease, focus_areas)
# Determine output format
output_format = args.format
if args.full:
output_format = "markdown"
elif args.quick:
output_format = "table"
result = format_report(report, output_format)
# Output results
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(result)
print(f"Report saved to: {args.output}")
else:
print(result)
if __name__ == "__main__":
main()
Analyze data with `toxicity-structure-alert` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
---
name: toxicity-structure-alert
description: Analyze data with `toxicity-structure-alert` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
license: MIT
skill-author: AIPOCH
---
# Toxicity Structure Alert (Skill ID: 141)
Identify potential toxic structural alerts in drug molecules.
## When to Use
- Use this skill when the task is to Identify potential toxic structural alerts in drug molecules by scanning.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
See `## Features` above for related details.
- Scope-focused workflow aligned to: Analyze data with `toxicity-structure-alert` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
- Python 3.8+
- RDKit
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Data Analytics/toxicity-structure-alert"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py --input "Audit validation sample with explicit symptoms, history, assessment, and next-step plan." --format json
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Features
- Scan molecular structures (SMILES/SMARTS)
- Identify known toxic structural alerts
- Assess potential toxicity risk levels
- Generate detailed reports
## Supported Alert Structures
| Alert Structure | Toxicity Type | Risk Level |
|---------|---------|---------|
| Aromatic Nitro | Mutagenicity | High |
| Aromatic Amine | Carcinogenicity | High |
| Epoxide | Alkylating Agent | High |
| Aldehyde | Reactive Toxicity | Medium |
| Acyl Chloride | Reactive Toxicity | Medium |
| Michael Acceptor | Electrophilic Toxicity | Medium |
| Hydrazine | Hepatotoxicity | High |
| Haloalkyl | Alkylating Agent | High |
| Quinone | Oxidative Stress | Medium |
| Thiol-Reactive Groups | Protein Binding | Low-Medium |
## Usage
```text
python -m py_compile scripts/main.py
# Example invocation: python scripts/main.py --input <smiles_string> [--format json|text]
```
### Parameters
- `--input, -i`: Input SMILES string (required)
- `--format, -f`: Output format, optional `json` or `text` (default: text)
- `--detail, -d`: Detail level, optional `basic`, `standard`, `full` (default: standard)
### Examples
```text
# Basic text output
python scripts/main.py -i "O=[N+]([O-])c1ccccc1"
# JSON format output
python scripts/main.py -i "O=C1OC1c1ccccc1" -f json
# Detailed report
python scripts/main.py -i "c1ccc2c(c1)ccc1c3ccccc3ccc21" -d full
```
### Python API
```python
from scripts.main import ToxicityAlertScanner
scanner = ToxicityAlertScanner()
result = scanner.scan("O=[N+]([O-])c1ccccc1")
print(result.alerts)
```
## Output Format
### JSON Output
```json
{
"input": "O=[N+]([O-])c1ccccc1",
"mol_weight": 123.11,
"alert_count": 1,
"risk_score": 0.85,
"risk_level": "HIGH",
"alerts": [
{
"name": "Aromatic Nitro",
"type": "mutagenic",
"smarts": "[N+](=O)[O-]",
"risk_level": "HIGH",
"description": "May cause DNA damage and mutagenicity"
}
],
"recommendations": [
"Recommend Ames test validation",
"Consider structural optimization to reduce toxicity"
]
}
```
## Risk Levels
- **HIGH**: Known significant toxicity, strongly recommended to avoid
- **MEDIUM**: Potential toxicity, further evaluation recommended
- **LOW**: Minor concern, can be considered based on specific circumstances
## Notes
1. This tool is based on known alert structures and cannot replace comprehensive toxicological assessment
2. False positives and false negatives may both exist
3. Recommended to use with other ADMET prediction tools
## References
- Ashby J., Tennant R.W. (1988) Chemical structure, Salmonella mutagenicity...
- Kazius J., McGuire R., Bursi R. (2005) Derivation and validation of toxicophores...
- Enoch S.J., Cronin M.T.D. (2010) A review of the electrophilic reaction chemistry...
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `toxicity-structure-alert` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `toxicity-structure-alert` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
## Inputs to Collect
- Required inputs: the user goal, the primary data or source file, and the requested output format.
- Optional inputs: output directory, formatting preferences, and validation constraints.
- If a required input is unavailable, return a short clarification request before continuing.
## Output Contract
- Return a short summary, the main deliverables, and any assumptions that materially affect interpretation.
- If execution is partial, label what succeeded, what failed, and the next safe recovery step.
- Keep the final answer within the documented scope of the skill.
## Validation and Safety Rules
- Validate identifiers, file paths, and user-provided parameters before execution.
- Do not fabricate results, metrics, citations, or downstream conclusions.
- Use safe fallback behavior when dependencies, credentials, or required inputs are missing.
- Surface any execution failure with a concise diagnosis and recovery path.
FILE:references/runtime_checklist.md
# Runtime Checklist
- Category: `Data Analysis`
- Validate the user goal, required inputs, and output format before taking action.
- Ask a targeted clarification question when a required input is missing.
- Keep the response scoped to the documented workflow and state assumptions explicitly.
- Run a non-destructive smoke check before any file-dependent or data-dependent command.
- Recommended smoke check: `python -m py_compile scripts/main.py`
- If execution fails, stop and return a concise recovery path instead of fabricating results.
FILE:requirements.txt
dataclasses
enum
rdkit
FILE:scripts/main.py
#!/usr/bin/env python3
"""Toxicity Structure Alert (Skill ID: 141)
Scan the drug molecular structure to identify potential toxicity warning structures.
Usage:
python main.py --input <smiles> [--format json|text] [--detail level]"""
import argparse
import json
import sys
from dataclasses import dataclass, asdict
from typing import List, Dict, Optional, Tuple
from enum import Enum
class RiskLevel(Enum):
"""risk level"""
LOW = "LOW"
MEDIUM = "MEDIUM"
HIGH = "HIGH"
@dataclass
class ToxicAlert:
"""Toxicity warning structure data class"""
name: str
name_en: str
type: str
smarts: str
risk_level: RiskLevel
description: str
weight: float = 1.0
@dataclass
class AlertMatch:
"""Matched alerts"""
name: str
type: str
smarts: str
risk_level: str
description: str
match_count: int = 1
@dataclass
class ScanResult:
"""Scan results"""
input_smiles: str
mol_weight: Optional[float]
alert_count: int
risk_score: float
risk_level: str
alerts: List[AlertMatch]
recommendations: List[str]
class ToxicityAlertScanner:
"""Toxicity Alert Structural Scanner"""
# Predefined toxicity warning structures
TOXIC_ALERTS = [
# High toxicity warning
ToxicAlert(
name="aromatic nitro",
name_en="Aromatic Nitro",
type="mutagenic",
smarts="[N+](=O)[O-]c",
risk_level=RiskLevel.HIGH,
description="Nitroaromatic compounds may cause DNA damage, are mutagenic and potentially carcinogenic",
weight=1.0
),
ToxicAlert(
name="Aromatic primary amines",
name_en="Aromatic Primary Amine",
type="carcinogenic",
smarts="Nc1ccccc1",
risk_level=RiskLevel.HIGH,
description="Aromatic amines can be metabolically activated to produce electrophilic substances and form adducts with DNA.",
weight=0.9
),
ToxicAlert(
name="Epoxide",
name_en="Epoxide",
type="alkylating",
smarts="C1OC1",
risk_level=RiskLevel.HIGH,
description="Epoxides are highly reactive three-membered cyclic ethers that can act as alkylating agents to damage DNA.",
weight=1.0
),
ToxicAlert(
name="aziridine",
name_en="Aziridine",
type="alkylating",
smarts="C1NC1",
risk_level=RiskLevel.HIGH,
description="The aziridine ring is highly reactive and can be used as an alkylating agent",
weight=1.0
),
ToxicAlert(
name="Hydrazine/Hydrazine",
name_en="Hydrazine",
type="hepatotoxic",
smarts="[NX3][NX3]",
risk_level=RiskLevel.HIGH,
description="Hydrazines are hepatotoxic and may cause liver damage",
weight=0.9
),
ToxicAlert(
name="Haloalkyl",
name_en="Haloalkyl",
type="alkylating",
smarts="[C][F,Cl,Br,I]",
risk_level=RiskLevel.HIGH,
description="Haloalkyl groups can serve as leaving groups, forming electrophilic centers leading to alkylation",
weight=0.85
),
ToxicAlert(
name="polycyclic aromatic hydrocarbons",
name_en="Polycyclic Aromatic Hydrocarbon",
type="carcinogenic",
smarts="c1ccc2c(c1)ccc1c3ccccc3ccc21",
risk_level=RiskLevel.HIGH,
description="Polycyclic aromatic hydrocarbons can be metabolically activated to form carcinogenic epoxides",
weight=0.95
),
# Moderate toxicity warning
ToxicAlert(
name="Aldehyde group",
name_en="Aldehyde",
type="reactive",
smarts="[CX3H1](=O)",
risk_level=RiskLevel.MEDIUM,
description="The aldehyde group is electrophilic and can form Schiff base with protein",
weight=0.6
),
ToxicAlert(
name="acid chloride",
name_en="Acyl Chloride",
type="reactive",
smarts="C(=O)Cl",
risk_level=RiskLevel.MEDIUM,
description="Acid chlorides are highly reactive and can undergo acylation reactions with nucleophilic groups",
weight=0.7
),
ToxicAlert(
name="Michael receptor",
name_en="Michael Acceptor",
type="electrophilic",
smarts="C=CC(=O)",
risk_level=RiskLevel.MEDIUM,
description="α,β-unsaturated carbonyl group can act as a Michael acceptor to covalently bind to sulfhydryl group",
weight=0.65
),
ToxicAlert(
name="Quinones",
name_en="Quinone",
type="oxidative",
smarts="O=C1C=CC(=O)C=C1",
risk_level=RiskLevel.MEDIUM,
description="Quinones can produce reactive oxygen species through redox cycles, causing oxidative stress",
weight=0.7
),
ToxicAlert(
name="Nitroso",
name_en="Nitroso",
type="carcinogenic",
smarts="N=O",
risk_level=RiskLevel.MEDIUM,
description="N-nitroso compounds are potentially carcinogenic",
weight=0.75
),
ToxicAlert(
name="thioester",
name_en="Thioester",
type="reactive",
smarts="C(=O)S",
risk_level=RiskLevel.MEDIUM,
description="Thioesters can undergo exchange reactions with sulfhydryl groups",
weight=0.55
),
# Low toxicity warning
ToxicAlert(
name="Thiol",
name_en="Thiol",
type="reactive",
smarts="[SX2H]",
risk_level=RiskLevel.LOW,
description="Thiols are reactive and can participate in redox and metal chelation",
weight=0.3
),
ToxicAlert(
name="Sulfonyl chloride",
name_en="Sulfonyl Chloride",
type="reactive",
smarts="S(=O)(=O)Cl",
risk_level=RiskLevel.LOW,
description="Sulfonyl chloride reacts with nucleophiles",
weight=0.4
),
]
def __init__(self):
self.alerts = self.TOXIC_ALERTS
self._compile_patterns()
def _compile_patterns(self):
"""Compile SMARTS mode"""
try:
from rdkit import Chem
self.has_rdkit = True
self._compiled_patterns = {}
for alert in self.alerts:
try:
pattern = Chem.MolFromSmarts(alert.smarts)
if pattern:
self._compiled_patterns[alert.name] = pattern
except Exception:
pass
except ImportError:
self.has_rdkit = False
print("Warning: RDKit not available. Falling back to basic pattern matching.", file=sys.stderr)
def _parse_smiles(self, smiles: str) -> Tuple[Optional['Chem.Mol'], Optional[float]]:
"""Parse SMILES string"""
if not self.has_rdkit:
return None, None
from rdkit import Chem
from rdkit.Chem import Descriptors
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None, None
mol_weight = Descriptors.MolWt(mol)
return mol, mol_weight
def _match_alerts(self, mol) -> List[AlertMatch]:
"""Match toxicity warning structure"""
matches = []
if self.has_rdkit and mol is not None:
# Exact substructure matching using RDKit
from rdkit import Chem
for alert in self.alerts:
pattern = self._compiled_patterns.get(alert.name)
if pattern:
try:
match_list = mol.GetSubstructMatches(pattern)
if match_list:
matches.append(AlertMatch(
name=alert.name,
type=alert.type,
smarts=alert.smarts,
risk_level=alert.risk_level.value,
description=alert.description,
match_count=len(match_list)
))
except Exception:
pass
else:
# Degraded mode: Use simple string matching
smiles_upper = self._current_smiles.upper()
matches = self._fallback_string_match(smiles_upper)
return matches
def _fallback_string_match(self, smiles: str) -> List[AlertMatch]:
"""Degrade mode: based on simple string pattern matching"""
matches = []
# Define a simplified pattern string
fallback_patterns = {
"aromatic nitro": ("O=[N+]([O-])", "NO2"),
"Aromatic primary amines": ("NC", "NH2"),
"Epoxide": ("C1OC1",),
"aziridine": ("C1NC1",),
"Hydrazine/Hydrazine": ("NN",),
"Haloalkyl": ("CL", "BR", "F", "I"),
"polycyclic aromatic hydrocarbons": ("C=CC=CC=CC=CC",),
"Aldehyde group": ("C=O", "CHO"),
"acid chloride": ("C(=O)CL", "COCL"),
"Michael receptor": ("C=CC=O",),
"Quinones": ("O=C1C=CC(=O)",),
"Nitroso": ("N=O", "N-O"),
"Thiol": ("SH",),
}
alert_map = {a.name: a for a in self.alerts}
matched_names = set()
for alert_name, patterns in fallback_patterns.items():
if alert_name in matched_names:
continue
for pattern in patterns:
if pattern in smiles:
alert = alert_map.get(alert_name)
if alert:
matches.append(AlertMatch(
name=alert.name,
type=alert.type,
smarts=alert.smarts,
risk_level=alert.risk_level.value,
description=alert.description,
match_count=smiles.count(pattern)
))
matched_names.add(alert_name)
break
return matches
def _calculate_risk_score(self, matches: List[AlertMatch]) -> Tuple[float, str]:
"""Calculate risk scores and levels"""
if not matches:
return 0.0, "NONE"
# Get the weight of matching alerts
alert_weights = {a.name: (a.weight, a.risk_level) for a in self.alerts}
total_weight = 0.0
max_risk = RiskLevel.LOW
for match in matches:
weight, risk = alert_weights.get(match.name, (0.3, RiskLevel.LOW))
total_weight += weight * match.match_count
if risk == RiskLevel.HIGH:
max_risk = RiskLevel.HIGH
elif risk == RiskLevel.MEDIUM and max_risk != RiskLevel.HIGH:
max_risk = RiskLevel.MEDIUM
# Calculate risk score (0-1)
risk_score = min(total_weight / 2.0, 1.0)
return round(risk_score, 2), max_risk.value
def _generate_recommendations(self, matches: List[AlertMatch], risk_level: str) -> List[str]:
"""Generate suggestions"""
recommendations = []
if risk_level == "HIGH":
recommendations.append("⚠️ High-risk toxic structures are detected, and it is strongly recommended to conduct Ames test verification")
recommendations.append("⚠️ Consider structural optimization to reduce potential toxicity risks")
elif risk_level == "MEDIUM":
recommendations.append("⚡ Medium risk structure detected, in vitro toxicity screening recommended")
recommendations.append("⚡ Evaluate the impact of structural modifications on activity and toxicity")
# Make recommendations based on specific alert types
alert_types = set(m.type for m in matches)
if "mutagenic" in alert_types or "carcinogenic" in alert_types:
recommendations.append("🧬 It is recommended to conduct mutagenicity assessment (such as Ames, chromosomal aberration test)")
if "hepatotoxic" in alert_types:
recommendations.append("🫀 It is recommended to assess the risk of hepatotoxicity")
if "alkylating" in alert_types:
recommendations.append("⚗️ Alkylating agents are highly reactive and special attention needs to be paid to off-target effects")
if "reactive" in alert_types:
recommendations.append("🔄 Reactive groups may affect metabolic stability, it is recommended to evaluate plasma stability")
if not recommendations:
recommendations.append("✅ No significant toxicity warning structures were detected, but standard safety assessment is still recommended")
return recommendations
def scan(self, smiles: str) -> ScanResult:
"""Scan the SMILES string to identify toxicity warning structures
Args:
smiles: input SMILES string
Returns:
ScanResult: scan result"""
self._current_smiles = smiles # Save for downgrade mode
mol, mol_weight = self._parse_smiles(smiles)
matches = self._match_alerts(mol)
risk_score, risk_level = self._calculate_risk_score(matches)
recommendations = self._generate_recommendations(matches, risk_level)
return ScanResult(
input_smiles=smiles,
mol_weight=round(mol_weight, 2) if mol_weight else None,
alert_count=len(matches),
risk_score=risk_score,
risk_level=risk_level,
alerts=matches,
recommendations=recommendations
)
def format_text_output(self, result: ScanResult, detail: str = "standard") -> str:
"""Formatted text output"""
lines = []
lines.append("=" * 60)
lines.append("Toxic Structure Alert Scan Report")
lines.append("=" * 60)
lines.append("")
lines.append(f"enterSMILES: {result.input_smiles}")
if result.mol_weight:
lines.append(f"molecular weight: {result.mol_weight} Da")
lines.append("")
# Risk level display
risk_emojis = {"HIGH": "🔴", "MEDIUM": "🟡", "LOW": "🟢", "NONE": "✅"}
risk_emoji = risk_emojis.get(result.risk_level, "⚪")
lines.append(f"risk level: {risk_emoji} {result.risk_level}")
lines.append(f"risk score: {result.risk_score:.2f} / 1.0")
lines.append(f"Number of alerts: {result.alert_count}")
lines.append("")
if result.alerts:
lines.append("-" * 60)
lines.append("Detected warning structures:")
lines.append("-" * 60)
for i, alert in enumerate(result.alerts, 1):
lines.append(f"\n{i}. {alert.name}")
lines.append(f" type: {alert.type}")
lines.append(f" risk: {alert.risk_level}")
if detail in ("standard", "full"):
lines.append(f" SMARTS: {alert.smarts}")
if detail == "full":
lines.append(f" describe: {alert.description}")
if alert.match_count > 1:
lines.append(f" Number of matches: {alert.match_count}")
lines.append("")
lines.append("-" * 60)
lines.append("suggestion:")
lines.append("-" * 60)
for rec in result.recommendations:
lines.append(f" {rec}")
lines.append("")
lines.append("=" * 60)
lines.append("NOTE: This tool is based on known alert structures and is not a substitute for a comprehensive toxicological assessment")
lines.append("=" * 60)
return "\n".join(lines)
def format_json_output(self, result: ScanResult) -> str:
"""Formatted JSON output"""
data = {
"input": result.input_smiles,
"mol_weight": result.mol_weight,
"alert_count": result.alert_count,
"risk_score": result.risk_score,
"risk_level": result.risk_level,
"alerts": [
{
"name": a.name,
"type": a.type,
"smarts": a.smarts,
"risk_level": a.risk_level,
"description": a.description,
"match_count": a.match_count
}
for a in result.alerts
],
"recommendations": result.recommendations
}
return json.dumps(data, ensure_ascii=False, indent=2)
def main():
"""main function"""
parser = argparse.ArgumentParser(
description="Toxicity Structure Alert - scans drug molecular structure for toxicity alerts",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""Example:
python main.py -i "O=[N+]([O-])c1ccccc1"
python main.py -i "C1CCCCC1" -f json
python main.py -i "c1ccc2c(c1)ccc1c3ccccc3ccc21" -d full"""
)
parser.add_argument(
"--input", "-i",
required=True,
help="Input SMILES string"
)
parser.add_argument(
"--format", "-f",
choices=["json", "text"],
default="text",
help="Output format (default: text)"
)
parser.add_argument(
"--detail", "-d",
choices=["basic", "standard", "full"],
default="standard",
help="Verbosity (default: standard)"
)
args = parser.parse_args()
# Create a scanner and perform scans
scanner = ToxicityAlertScanner()
result = scanner.scan(args.input)
# Output results
if args.format == "json":
print(scanner.format_json_output(result))
else:
print(scanner.format_text_output(result, args.detail))
# Return a non-zero exit code if high risk is detected
if result.risk_level == "HIGH":
sys.exit(1)
else:
sys.exit(0)
if __name__ == "__main__":
main()
Use when converting medical text between academic and patient-friendly tones, translating medical jargon for patients, adapting research papers for public au...
---
name: tone-adjuster
description: Use when converting medical text between academic and patient-friendly tones, translating medical jargon for patients, adapting research papers for public audiences, or rewriting clinical notes for patient handouts. Maintains medical accuracy while adjusting readability level.
license: MIT
skill-author: AIPOCH
---
# Medical Tone Adjuster
Convert medical text between academic rigor and patient-friendly language while preserving clinical accuracy.
## When to Use
- Use this skill when the task needs Use when converting medical text between academic and patient-friendly tones, translating medical jargon for patients, adapting research papers for public audiences, or rewriting clinical notes for patient handouts. Maintains medical accuracy while adjusting readability level.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
- Scope-focused workflow aligned to: Use when converting medical text between academic and patient-friendly tones, translating medical jargon for patients, adapting research papers for public audiences, or rewriting clinical notes for patient handouts. Maintains medical accuracy while adjusting readability level.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.
## Example Usage
```bash
cd "20260318/scientific-skills/Academic Writing/tone-adjuster"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py demo
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Quick Start
```python
from scripts.tone_adjuster import ToneAdjuster
adjuster = ToneAdjuster()
# Academic → Patient-friendly
patient_text = adjuster.convert(
text="The patient presents with acute myocardial infarction...",
target_tone="patient-friendly"
)
# Patient-friendly → Academic
academic_text = adjuster.convert(
text="I had a heart attack...",
target_tone="academic"
)
```
## Core Capabilities
### 1. Academic to Patient-Friendly
```python
adjuster = ToneAdjuster()
result = adjuster.to_patient_friendly(
"The patient exhibits tachycardia with irregular rhythm
consistent with atrial fibrillation",
reading_level="8th_grade"
)
```
**Conversion Rules:**
- Replace medical terms with common equivalents
- Shorten sentence length (aim for <15 words)
- Use active voice
- Remove unnecessary qualifiers
**Examples:**
| Academic | Patient-Friendly |
|----------|------------------|
| Myocardial infarction | Heart attack |
| Tachycardia | Fast heartbeat |
| Hypertension | High blood pressure |
| Benign prostatic hyperplasia | Enlarged prostate (non-cancerous) |
| Idiopathic | Unknown cause |
### 2. Patient-Friendly to Academic
```python
result = adjuster.to_academic(
"My stomach hurts after eating spicy food",
add_citations=True
)
# Output: "The patient reports postprandial abdominal pain
# exacerbated by capsaicin-containing foods"
```
### 3. Reading Level Assessment
```python
metrics = adjuster.assess_reading_level(text)
print(f"Grade level: {metrics.grade_level}")
print(f"Medical terms: {metrics.jargon_count}")
print(f"Recommendations: {metrics.suggestions}")
```
**Reading Levels:**
- **5th-6th Grade**: Young patients, general public
- **8th Grade**: Most adult patients
- **12th Grade**: Educated lay audiences
- **College**: Healthcare professionals
### 4. Jargon Translation
```python
translations = adjuster.translate_jargon(
text="Patient presents with dyspnea and orthopnea...",
show_alternatives=True
)
```
**Common Medical Terms Dictionary:**
```json
{
"dyspnea": {
"patient_friendly": "shortness of breath",
"explanation": "feeling like you can't get enough air"
},
"orthopnea": {
"patient_friendly": "trouble breathing when lying down",
"explanation": "need to prop up with pillows to breathe"
}
}
```
## CLI Usage
```text
# Convert file
python scripts/tone_adjuster.py \
--input clinical_note.txt \
--direction academic-to-patient \
--output patient_handout.txt
# Assess reading level
python scripts/tone_adjuster.py \
--assess readme.txt \
--target-grade 8
```
## Best Practices
**When Converting to Patient-Friendly:**
- ✅ Use "you" and "your" when appropriate
- ✅ Define terms in parentheses on first use
- ✅ Use analogies for complex concepts
- ✅ Keep paragraphs to 2-3 sentences
**When Converting to Academic:**
- ✅ Use precise medical terminology
- ✅ Include anatomical locations
- ✅ Specify temporal relationships
- ✅ Add objective measurements
## Common Pitfalls
❌ **Don't**: "Your heart has a problem"
✅ **Do**: "Your heart muscle shows signs of reduced blood flow"
❌ **Don't**: "The medicine might make you feel bad"
✅ **Do**: "This medication may cause nausea, dizziness, or fatigue"
## Quality Checklist
- [ ] Medical accuracy preserved
- [ ] No critical information lost
- [ ] Appropriate reading level achieved
- [ ] Tone matches intended audience
- [ ] All medical terms explained or translated
---
**Skill ID**: 202 | **Version**: 1.0 | **License**: MIT
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `tone-adjuster` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `tone-adjuster` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/guidelines.md
# Tone Adjuster - References
## Health Literacy
- NIH Plain Language Guidelines
- AHRQ Health Literacy Resources
- AMA Health Literacy Toolkit
## Readability
- Flesch-Kincaid Readability Tests
- Gunning Fog Index
- SMOG Readability Formula
FILE:scripts/main.py
#!/usr/bin/env python3
"""Tone Adjuster - Bidirectional medical text tone converter."""
import json
import re
from typing import Dict, List
class ToneAdjuster:
"""Adjusts tone of medical text."""
JARGON_DICTIONARY = {
"myocardial infarction": "heart attack",
"cerebrovascular accident": "stroke",
"hypertension": "high blood pressure",
"hyperlipidemia": "high cholesterol",
"malignancy": "cancer",
"benign": "not cancerous",
"idiopathic": "unknown cause",
"iatrogenic": "caused by medical treatment",
"acute": "sudden/severe",
"chronic": "long-term",
"exacerbation": "flare-up",
"remission": "symptoms improved",
"etiology": "cause",
"pathogenesis": "how disease develops",
"morbidity": "illness/complications",
"mortality": "death rate"
}
def adjust(self, text: str, target_tone: str, level: str = "moderate") -> Dict:
"""Adjust text tone."""
original_tone = self._assess_tone(text)
if target_tone == "patient_friendly":
converted = self._to_patient_friendly(text, level)
elif target_tone == "academic":
converted = self._to_academic(text)
else:
converted = text
changes = self._identify_changes(text, converted)
return {
"converted_text": converted,
"original_tone": original_tone,
"target_tone": target_tone,
"adjustment_level": level,
"readability_score": self._calc_readability(converted),
"changes_made": changes
}
def _to_patient_friendly(self, text: str, level: str) -> str:
"""Convert to patient-friendly language."""
result = text
# Replace jargon
for medical, plain in self.JARGON_DICTIONARY.items():
pattern = re.compile(medical, re.IGNORECASE)
result = pattern.sub(plain, result)
# Simplify sentence structure
result = result.replace("utilize", "use")
result = result.replace("demonstrate", "show")
result = result.replace("indicate", "suggest")
result = result.replace("approximately", "about")
return result
def _to_academic(self, text: str) -> str:
"""Convert to academic tone."""
# Reverse the dictionary
reverse_dict = {v: k for k, v in self.JARGON_DICTIONARY.items()}
result = text
for plain, medical in reverse_dict.items():
pattern = re.compile(r'\b' + re.escape(plain) + r'\b', re.IGNORECASE)
result = pattern.sub(medical, result)
return result
def _assess_tone(self, text: str) -> str:
"""Assess current tone."""
jargon_count = sum(1 for term in self.JARGON_DICTIONARY if term.lower() in text.lower())
if jargon_count > 3:
return "academic"
elif jargon_count > 0:
return "mixed"
return "patient-friendly"
def _identify_changes(self, original: str, converted: str) -> List[str]:
"""Identify changes made."""
changes = []
for medical, plain in self.JARGON_DICTIONARY.items():
if medical.lower() in original.lower() and plain.lower() in converted.lower():
changes.append(f"'{medical}' → '{plain}'")
return changes[:5] # Limit to 5 changes
def _calc_readability(self, text: str) -> float:
"""Simple readability estimate (Flesch-inspired)."""
words = len(text.split())
sentences = max(1, text.count('.') + text.count('!') + text.count('?'))
avg_words_per_sentence = words / sentences
# Simple score: lower is more readable
score = max(0, min(100, 100 - (avg_words_per_sentence - 10) * 5))
return round(score, 1)
def main():
import sys
adjuster = ToneAdjuster()
text = sys.argv[1] if len(sys.argv) > 1 else "The patient suffered an acute myocardial infarction."
tone = sys.argv[2] if len(sys.argv) > 2 else "patient_friendly"
result = adjuster.adjust(text, tone)
print(json.dumps(result, indent=2))
if __name__ == "__main__":
main()
Plan cross-time-zone meeting windows for distributed teams, providing region-by-region local time mappings and tradeoff analysis for scheduling decisions.
---
name: time-zone-planner
description: Plan cross-time-zone meeting windows for distributed teams, providing region-by-region local time mappings and tradeoff analysis for scheduling decisions.
license: MIT
skill-author: AIPOCH
---
# Time Zone Planner
Structured cross-time-zone meeting planning for distributed teams.
## Quick Check
```bash
python -m py_compile scripts/main.py
python scripts/main.py
```
## When to Use
- Use this skill when planning meeting windows across multiple regions or time zones.
- Use this skill when comparing candidate windows and tradeoffs for distributed team scheduling.
- Do not use this skill for calendar booking, live availability checking, or travel/legal decisions.
## Workflow
1. Confirm the participant regions, meeting duration, preferred local-hour ranges, and any hard constraints.
2. Check whether the request is for a quick overlap recommendation or a full tradeoff analysis.
3. Use the packaged script for baseline scheduling output; for complex requests, provide a manual comparison table with stated assumptions.
4. Return suggested meeting windows, region-by-region local times, and the tradeoffs behind the recommendation.
5. If timezone details or availability constraints are missing, stop and request the minimum missing fields.
## Usage
```text
python scripts/main.py
# Input: {"regions": ["US", "EU", "Asia"], "duration": 60}
```
## Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `regions` | list[string] | Yes | — | Region set, e.g. `["US/Eastern", "Europe/London", "Asia/Shanghai"]` |
| `duration` | integer | Yes | — | Meeting duration in minutes |
| `preferred_hours` | object | No | — | Per-region preferred local hour ranges, e.g. `{"US/Eastern": [9, 17]}` |
**Region format:** Use IANA timezone names (e.g., `US/Eastern`, `Europe/London`, `Asia/Shanghai`) for precise mapping. Short aliases like `"US"` or `"EU"` are accepted but will be mapped to a representative timezone with a note.
## Region Alias Mapping
| Alias | Mapped To | Note |
|-------|-----------|------|
| `US` | `US/Eastern` | Representative only; specify sub-region for accuracy |
| `EU` | `Europe/London` | Representative only; specify country for accuracy |
| `Asia` | `Asia/Shanghai` | Representative only; specify city for accuracy |
## Output
- Suggested meeting windows by region
- Local-time mapping for each included region
- DST assumption notes when applicable
- Tradeoff summary for each candidate window
## Scope Boundaries
- This skill supports scheduling recommendations, not calendar booking.
- This skill does not validate current DST status from live internet sources.
- This skill does not decide business priority between teams without user-supplied rules.
- Manual confirmation is required before sending invites.
## Stress-Case Rules
For multi-constraint requests, always include these explicit blocks:
1. Assumptions
2. Hard Constraints
3. Recommended Window
4. Tradeoffs
5. Risks and Manual Checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate live calendar availability or confirmed participant agreement.
## Input Validation
This skill accepts: a list of participant regions and a meeting duration for cross-timezone scheduling recommendations.
If the request does not involve cross-timezone meeting planning — for example, asking to book calendar events, check live availability, make travel arrangements, or provide legal scheduling advice — do not proceed with the workflow. Instead respond:
> "time-zone-planner is designed to recommend meeting windows across time zones for distributed teams. Your request appears to be outside this scope. Please provide a region list and meeting duration, or use a more appropriate tool."
## References
- [references/guidelines.md](references/guidelines.md) — General time-zone planning notes
- [references/audit-reference.md](references/audit-reference.md) — Supported scope, audit commands, and fallback boundaries
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:POLISH_CHANGELOG.md
# POLISH_CHANGELOG — time-zone-planner
**Original Score:** 84
**Polish Date:** 2026-03-19
## Issues Addressed
### P0 / Veto Fixes
- None (no veto failures)
### P1 Fixes
- **Region alias ambiguity:** Added "Region Alias Mapping" section with a table showing how short aliases (`US`, `EU`, `Asia`) map to representative IANA timezones, with a note that sub-region specification is recommended for accuracy.
- **Parameters table enriched:** Added `preferred_hours` parameter and updated `regions` description to recommend IANA timezone names. Added format guidance for region input.
### P2 Fixes
- None beyond P1 fixes.
### QS-1 (Input Validation)
- Already present and well-formed.
### QS-2 (Progressive Disclosure)
- File is 115 lines — within 300-line limit. No content moved to references/.
### QS-3 (Canonical YAML Frontmatter)
- Already present with all four required fields.
FILE:references/audit-reference.md
# Audit Reference
## Supported Scope
- Recommend meeting windows across a fixed set of regions.
- Explain timezone tradeoffs and local-time conversions.
- Keep outputs bounded to scheduling support rather than live calendar coordination.
## Stable Audit Commands
```bash
python -m py_compile scripts/main.py
python scripts/main.py
```
## Fallback Boundaries
- If participant regions are missing, request them before proposing a final schedule.
- If the user asks for real-time DST validation or live availability checks, state that those require external confirmation.
- If conflicting constraints cannot all be satisfied, return the least-bad option and explain the tradeoff instead of inventing a perfect overlap.
## Output Guardrails
- Separate known inputs from assumed time-window preferences.
- Keep recommended windows explicit by region.
- Call out any manual confirmation still required before a meeting is booked.
FILE:references/guidelines.md
# Time Zone Planner - References
## Time Zone Management
- Global Clinical Trial Coordination
- Time Zone Best Practices
FILE:scripts/main.py
#!/usr/bin/env python3
"""Time Zone Planner - Global meeting scheduler."""
import json
class TimeZonePlanner:
"""Plans meetings across time zones."""
def plan(self, regions: list, duration: int) -> dict:
"""Find optimal meeting times."""
# Simplified timezone mapping
timezones = {
"US": "EST/EDT (UTC-5/-4)",
"EU": "CET/CEST (UTC+1/+2)",
"Asia": "CST (UTC+8)"
}
suggested = []
for region in regions:
if region in timezones:
suggested.append({
"region": region,
"local_time": "09:00",
"timezone": timezones[region]
})
return {
"suggested_times": suggested,
"optimal_window": "08:00-10:00 EST / 14:00-16:00 CET / 21:00-23:00 CST",
"duration_minutes": duration
}
def main():
planner = TimeZonePlanner()
result = planner.plan(["US", "EU", "Asia"], 60)
print(json.dumps(result, indent=2))
if __name__ == "__main__":
main()
Automated generation of baseline characteristics tables (Table 1) for clinical research papers.
---
name: table-1-generator
description: Automated generation of baseline characteristics tables (Table 1) for clinical research papers.
license: MIT
skill-author: AIPOCH
---
# Table 1 Generator
Automated generation of baseline characteristics tables (Table 1) for clinical research papers.
## When to Use
- Use this skill when the task needs Automated generation of baseline characteristics tables (Table 1) for clinical research papers.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
See `## Features` above for related details.
- Scope-focused workflow aligned to: Automated generation of baseline characteristics tables (Table 1) for clinical research papers.
- Packaged executable path(s): `scripts/main.py`.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `numpy`: `unspecified`. Declared in `requirements.txt`.
- `pandas`: `unspecified`. Declared in `requirements.txt`.
- `scipy`: `unspecified`. Declared in `requirements.txt`.
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Data Analytics/table-1-generator-advanced"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Usage
```text
python scripts/main.py --data patients.csv --group treatment --output table1.csv
```
## Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `--data` | str | Yes | - | Patient data CSV file path |
| `--group` | str | No | - | Grouping variable (e.g., treatment/control) |
| `--vars` | list[str] | No | - | Variables to include in the table |
| `--output` | str | Yes | - | Output file path for Table 1 |
## Features
- Automatic variable type detection
- Appropriate statistics (mean±SD, median[IQR], n(%))
- Group comparisons (t-test, chi-square)
- Missing data reporting
- APA formatting
## Output
- Table 1 (CSV/Excel)
- Statistical test results
- Formatted for publication
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `table-1-generator` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `table-1-generator` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:requirements.txt
numpy
pandas
scipy
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Table 1 Generator
Automated baseline characteristics table for clinical papers.
"""
import argparse
import pandas as pd
import numpy as np
from scipy import stats
class Table1Generator:
"""Generate baseline characteristics table."""
def __init__(self, data):
self.data = data
def detect_var_type(self, var):
"""Detect variable type."""
if pd.api.types.is_numeric_dtype(self.data[var]):
unique_count = self.data[var].nunique()
if unique_count <= 5:
return "categorical"
return "continuous"
return "categorical"
def summarize_continuous(self, var, group_var=None):
"""Summarize continuous variable."""
if group_var:
results = []
groups = self.data[group_var].unique()
for group in sorted(groups):
subset = self.data[self.data[group_var] == group][var].dropna()
mean = subset.mean()
std = subset.std()
median = subset.median()
q25 = subset.quantile(0.25)
q75 = subset.quantile(0.75)
n = len(subset)
results.append({
"group": group,
"n": n,
"mean_sd": f"{mean:.1f} ± {std:.1f}",
"median_iqr": f"{median:.1f} [{q25:.1f}-{q75:.1f}]"
})
# Statistical test
groups_data = [self.data[self.data[group_var] == g][var].dropna()
for g in groups]
if len(groups) == 2:
stat, pvalue = stats.ttest_ind(groups_data[0], groups_data[1])
test = f"t-test p={pvalue:.3f}"
else:
stat, pvalue = stats.f_oneway(*groups_data)
test = f"ANOVA p={pvalue:.3f}"
return results, test
else:
subset = self.data[var].dropna()
mean = subset.mean()
std = subset.std()
median = subset.median()
q25 = subset.quantile(0.25)
q75 = subset.quantile(0.75)
n = len(subset)
return [{
"n": n,
"mean_sd": f"{mean:.1f} ± {std:.1f}",
"median_iqr": f"{median:.1f} [{q25:.1f}-{q75:.1f}]"
}], ""
def summarize_categorical(self, var, group_var=None):
"""Summarize categorical variable."""
if group_var:
crosstab = pd.crosstab(self.data[var], self.data[group_var])
percentages = pd.crosstab(self.data[var], self.data[group_var],
normalize='columns') * 100
results = []
for category in crosstab.index:
row = {"category": category}
for group in crosstab.columns:
n = crosstab.loc[category, group]
pct = percentages.loc[category, group]
row[f"group_{group}"] = f"{n} ({pct:.1f}%)"
results.append(row)
# Chi-square test
chi2, pvalue, dof, expected = stats.chi2_contingency(crosstab)
test = f"Chi-square p={pvalue:.3f}"
return results, test
else:
counts = self.data[var].value_counts()
percentages = self.data[var].value_counts(normalize=True) * 100
results = []
for category in counts.index:
results.append({
"category": category,
"n_pct": f"{counts[category]} ({percentages[category]:.1f}%)"
})
return results, ""
def generate_table(self, group_var=None, var_list=None):
"""Generate complete Table 1."""
if var_list is None:
var_list = [c for c in self.data.columns if c != group_var]
table_rows = []
for var in var_list:
var_type = self.detect_var_type(var)
if var_type == "continuous":
results, test = self.summarize_continuous(var, group_var)
table_rows.append({
"Variable": var,
"Type": "Continuous",
"Statistics": results,
"P-value": test
})
else:
results, test = self.summarize_categorical(var, group_var)
for r in results:
table_rows.append({
"Variable": f" {r.get('category', var)}",
"Type": "Categorical",
"Statistics": r,
"P-value": test if r == results[0] else ""
})
return table_rows
def print_table(self, table_rows):
"""Print formatted table."""
print("\n" + "=" * 80)
print("TABLE 1: BASELINE CHARACTERISTICS")
print("=" * 80)
for row in table_rows:
print(f"\n{row['Variable']} ({row['Type']})")
print(f" Stats: {row['Statistics']}")
if row['P-value']:
print(f" Test: {row['P-value']}")
print("=" * 80 + "\n")
def main():
parser = argparse.ArgumentParser(description="Table 1 Generator")
parser.add_argument("--data", "-d", required=True, help="Data CSV file")
parser.add_argument("--group", "-g", help="Grouping variable")
parser.add_argument("--vars", "-v", help="Comma-separated variables to include")
parser.add_argument("--output", "-o", help="Output CSV file")
args = parser.parse_args()
# Load data
data = pd.read_csv(args.data)
# Select variables
var_list = None
if args.vars:
var_list = [v.strip() for v in args.vars.split(",")]
# Generate table
generator = Table1Generator(data)
table = generator.generate_table(args.group, var_list)
# Print
generator.print_table(table)
# Save if requested
if args.output:
# Flatten for CSV output
flat_rows = []
for row in table:
flat_row = {
"Variable": row["Variable"],
"Type": row["Type"],
"P-value": row["P-value"]
}
if isinstance(row["Statistics"], list):
for i, stat in enumerate(row["Statistics"]):
for key, val in stat.items():
flat_row[f"Group{i}_{key}"] = val
else:
for key, val in row["Statistics"].items():
flat_row[key] = val
flat_rows.append(flat_row)
df = pd.DataFrame(flat_rows)
df.to_csv(args.output, index=False)
print(f"Table saved to: {args.output}")
if __name__ == "__main__":
main()
Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.
---
name: systematic-review-screener
description: Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.
license: MIT
skill-author: AIPOCH
---
# Systematic Review Screener
Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.
## When to Use
- Use this skill when the task needs Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.
- Use this skill for evidence insight tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
See `## Features` above for related details.
- Scope-focused workflow aligned to: Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `dataclasses`: `unspecified`. Declared in `requirements.txt`.
- `yaml`: `unspecified`. Declared in `requirements.txt`.
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Evidence Insight/systematic-review-screener"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py -h
python scripts/main.py --help
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Overview
This skill screens academic abstracts against predefined inclusion/exclusion criteria, generating PRISMA-compliant outputs with decision rationale and confidence scores.
**Technical Difficulty: High** ⚠️ Manual verification recommended for final inclusion decisions.
## Features
- **Multi-format Input**: PubMed MEDLINE, EndNote XML, CSV/TSV
- **Criteria Matching**: Configurable inclusion/exclusion rules
- **Confidence Scoring**: 0-100% confidence for each decision
- **Conflict Detection**: Flags abstracts requiring human review
- **PRISMA Export**: Flow diagram data and screening log
- **Batch Processing**: Handles large reference sets efficiently
## Usage
### Basic Screening
```python
# Run with default settings
python scripts/main.py --input references.csv --criteria criteria.yaml
```
### With PRISMA Export
```python
python scripts/main.py --input references.xml --criteria criteria.yaml \
--output results/ --prisma --format excel
```
### Confidence Threshold
```python
python scripts/main.py --input refs.txt --criteria criteria.yaml \
--threshold 0.8 --conflict-only
```
## Input Formats
### 1. CSV/TSV
Required columns: `title`, `abstract` (optional: `authors`, `year`, `doi`, `pmid`)
```csv
title,abstract,authors,year
title,abstract,authors,year
```
### 2. PubMed MEDLINE
Standard .txt export from PubMed search.
### 3. EndNote XML
Export from EndNote with abstracts included.
## Criteria File (YAML)
See `references/criteria_template.yaml` for complete example:
```yaml
study_type:
include:
- "randomized controlled trial"
- "systematic review"
exclude:
- "case report"
- "letter"
- "editorial"
population:
include_keywords:
- "adults"
- "elderly"
exclude_keywords:
- "pediatric"
- "children"
intervention:
required:
- "drug therapy"
- "medication"
language:
allowed: ["English"]
year_range:
min: 2010
max: 2024
confidence_threshold: 0.75
```
## Output Files
| File | Description |
|------|-------------|
| `screened_included.csv` | Records passing all criteria |
| `screened_excluded.csv` | Records failing one or more criteria |
| `conflicts.csv` | Low-confidence decisions requiring review |
| `prisma_data.json` | PRISMA flow diagram counts |
| `screening_log.json` | Full decision trail with rationale |
## PRISMA Workflow Support
Generates structured data for PRISMA 2020 flow diagram:
```json
{
"identification": {
"database_results": 1250,
"register_results": 45,
"other_sources": 12
},
"screening": {
"records_screened": 1307,
"records_excluded": 1150,
"full_text_assessed": 157,
"full_text_excluded": 89
},
"included": {
"qualitative_synthesis": 68,
"quantitative_synthesis": 42
}
}
```
## Configuration
### Environment Variables
```text
export SCREENING_THRESHOLD=0.75 # Default confidence threshold
export BATCH_SIZE=100 # Records per batch
export MAX_WORKERS=4 # Parallel processing workers
```
### Command Line Options
| Option | Description | Default |
|--------|-------------|---------|
| `--input` | Input file path | Required |
| `--criteria` | Criteria YAML path | Required |
| `--output` | Output directory | `./output` |
| `--format` | Output format: csv/excel/json | csv |
| `--threshold` | Confidence threshold | 0.75 |
| `--prisma` | Generate PRISMA data | False |
| `--conflict-only` | Export only conflicts | False |
| `--batch-size` | Processing batch size | 100 |
## Decision Algorithm
1. **Keyword Matching**: Exact and fuzzy keyword matching against title/abstract
2. **Inclusion Scoring**: Points for each inclusion criterion matched
3. **Exclusion Check**: Immediate exclusion if exclusion criterion detected
4. **Confidence Calculation**: Weighted score based on keyword presence and clarity
5. **Conflict Flagging**: Records with confidence < threshold flagged for manual review
## Limitations
- **Not for Final Decisions**: Tool provides recommendations; human review required for inclusion
- **Language Dependent**: Optimized for English abstracts
- **Structured Abstracts**: Performs better on structured abstracts (Background/Methods/Results/Conclusion)
- **Domain Specific**: Criteria must be tailored to research question
## References
- `references/criteria_template.yaml` - Complete criteria configuration example
- `references/prisma_2020_checklist.pdf` - PRISMA 2020 reporting guidelines
- `references/sample_references.csv` - Example input format
## Version
Version: 1.0.0
Last Updated: 2026-02-05
Classification: Research Tool - Requires Human Verification
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `systematic-review-screener` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `systematic-review-screener` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/criteria_template.yaml
# Systematic Review Screening Criteria Template
# Copy and customize this file for your review
# =============================================================================
# STUDY TYPE CRITERIA
# =============================================================================
study_type:
# Study designs to INCLUDE
include:
- "randomized controlled trial"
- "randomised controlled trial"
- "RCT"
- "systematic review"
- "meta-analysis"
- "clinical trial"
- "controlled clinical trial"
- "cluster randomized"
# Study designs to EXCLUDE (immediate exclusion)
exclude:
- "case report"
- "case series"
- "letter"
- "editorial"
- "commentary"
- "opinion"
- "narrative review"
- "scoping review"
- "protocol"
- "conference abstract"
# =============================================================================
# POPULATION CRITERIA
# =============================================================================
population:
# Keywords indicating target population (inclusion)
include_keywords:
- "adults"
- "adult"
- "elderly"
- "older adults"
- "aged"
- "middle-aged"
- "geriatric"
# Keywords indicating excluded populations
exclude_keywords:
- "pediatric"
- "children"
- "adolescent"
- "infant"
- "neonatal"
- "animal"
- "in vitro"
- "cell line"
# =============================================================================
# INTERVENTION CRITERIA
# =============================================================================
intervention:
# At least one of these must be present for inclusion
required:
- "drug therapy"
- "medication"
- "pharmacotherapy"
- "treatment"
- "intervention"
- "therapy"
# =============================================================================
# OUTCOME CRITERIA (Optional - for information only)
# =============================================================================
outcome:
# Outcomes of interest (informational, not exclusionary)
interest:
- "efficacy"
- "effectiveness"
- "safety"
- "adverse events"
- "quality of life"
# =============================================================================
# LANGUAGE CRITERIA
# =============================================================================
language:
allowed:
- "English"
# Note: Language detection is heuristic-based
# =============================================================================
# PUBLICATION YEAR CRITERIA
# =============================================================================
year_range:
min: 2010
max: 2024
# =============================================================================
# SCREENING CONFIGURATION
# =============================================================================
# Confidence threshold (0.0 - 1.0)
# Records below this threshold are flagged for manual review
confidence_threshold: 0.75
# Minimum abstract length for reliable screening (characters)
min_abstract_length: 50
# =============================================================================
# EXAMPLE: SYSTEMATIC REVIEW ON DIABETES TREATMENT
# =============================================================================
# Uncomment and modify for your specific review:
#
# condition:
# required:
# - "diabetes"
# - "type 2 diabetes"
# - "T2DM"
# exclude:
# - "type 1 diabetes"
# - "gestational diabetes"
#
# intervention:
# required:
# - "metformin"
# - "SGLT2 inhibitor"
# - "GLP-1 agonist"
# exclude:
# - "insulin monotherapy"
#
# comparator:
# allowed:
# - "placebo"
# - "standard care"
# - "active comparator"
#
# outcome:
# primary:
# - "HbA1c"
# - "glycemic control"
# secondary:
# - "weight"
# - "cardiovascular"
# - "hypoglycemia"
FILE:references/sample_references.csv
title,abstract,authors,year,doi,pmid
"Effect of Exercise on Type 2 Diabetes: A Randomized Controlled Trial","Background: Physical activity is recommended for type 2 diabetes management. Methods: We conducted a randomized controlled trial of 200 adults with T2DM. Results: Exercise intervention reduced HbA1c by 0.8%. Conclusions: Exercise is effective for glycemic control.","Smith J; Jones A",2023,10.1000/example1,38000001
"Case Report: Rare Complication of Diabetes Medication","We present a case of a 45-year-old patient who developed...","Brown K",2022,10.1000/example2,38000002
"Systematic Review of SGLT2 Inhibitors in Elderly Patients","Objective: To review the safety and efficacy of SGLT2 inhibitors in adults over 65. Methods: Systematic review of RCTs. Results: 15 studies met inclusion criteria. Conclusions: SGLT2 inhibitors are safe and effective in elderly patients.","Davis M; Wilson R",2024,10.1000/example3,38000003
"Letter to Editor: Re: Diabetes Treatment Guidelines","I read with interest the recent article on guidelines...","Johnson T",2023,10.1000/example4,38000004
"Metformin vs Placebo in Pediatric Diabetes: A Randomized Trial","Background: Metformin is used off-label in children. Methods: RCT in 100 pediatric patients. Results: Significant improvement in glycemic control.","Lee S; Chen H",2021,10.1000/example5,38000005
"Pharmacotherapy for Type 2 Diabetes: A Meta-Analysis","Objective: To evaluate drug therapy options for T2DM. Methods: Meta-analysis of 50 RCTs. Results: Multiple agents showed efficacy. Conclusions: Individualized therapy is recommended.","Anderson P; Taylor L",2022,10.1000/example6,38000006
"Narrative Review of Diabetes Management Strategies","This review discusses various approaches to managing diabetes...","White G",2023,10.1000/example7,38000007
"Randomized Controlled Trial of GLP-1 Agonist in Adults with T2DM","Background: GLP-1 agonists show promise. Methods: Double-blind RCT in 300 adults. Results: Significant weight loss and HbA1c reduction.","Martinez C; Garcia R",2024,10.1000/example8,38000008
"Animal Study: Effects of Compound X on Diabetic Rats","Methods: 50 rats were given compound X...","Zhang Y",2020,10.1000/example9,38000009
"Cluster Randomized Trial of Diabetes Education Program","Background: Education improves outcomes. Methods: Cluster RCT in 20 clinics. Results: Improved self-management in intervention group.","Thompson E; Brown A",2023,10.1000/example10,38000010
FILE:requirements.txt
dataclasses
yaml
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Systematic Review Abstract Screener
Automated screening of academic abstracts against inclusion/exclusion criteria.
Supports PRISMA workflow and generates screening reports.
Technical Difficulty: High - Manual verification required for final decisions.
"""
import argparse
import csv
import json
import re
import sys
import xml.etree.ElementTree as ET
from dataclasses import dataclass, field, asdict
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Any, Tuple
import yaml
@dataclass
class ScreeningResult:
"""Result of screening a single reference."""
record_id: str
title: str
abstract: str
authors: str = ""
year: str = ""
doi: str = ""
pmid: str = ""
# Screening outcomes
include: bool = False
confidence: float = 0.0
decision_rationale: List[str] = field(default_factory=list)
matched_criteria: Dict[str, Any] = field(default_factory=dict)
exclusion_reasons: List[str] = field(default_factory=list)
requires_review: bool = False
def to_dict(self) -> Dict:
return asdict(self)
@dataclass
class PRISMAData:
"""PRISMA 2020 flow diagram data structure."""
identification: Dict[str, int] = field(default_factory=lambda: {
"database_results": 0,
"register_results": 0,
"other_sources": 0,
"duplicates_removed": 0
})
screening: Dict[str, int] = field(default_factory=lambda: {
"records_screened": 0,
"records_excluded": 0,
"full_text_assessed": 0,
"full_text_excluded": 0
})
included: Dict[str, int] = field(default_factory=lambda: {
"qualitative_synthesis": 0,
"quantitative_synthesis": 0
})
def to_dict(self) -> Dict:
return asdict(self)
class CriteriaMatcher:
"""Matches abstracts against inclusion/exclusion criteria."""
def __init__(self, criteria: Dict[str, Any]):
self.criteria = criteria
self.confidence_threshold = criteria.get('confidence_threshold', 0.75)
def _normalize_text(self, text: str) -> str:
"""Normalize text for matching."""
return text.lower().strip()
def _fuzzy_match(self, text: str, keywords: List[str]) -> Tuple[bool, List[str], float]:
"""
Check if any keywords appear in text.
Returns: (matched, matched_keywords, confidence_score)
"""
text_lower = self._normalize_text(text)
matched = []
total_score = 0.0
for keyword in keywords:
keyword_lower = self._normalize_text(keyword)
# Exact match
if keyword_lower in text_lower:
matched.append(keyword)
total_score += 1.0
# Word boundary match
elif re.search(r'\b' + re.escape(keyword_lower) + r'\b', text_lower):
matched.append(keyword)
total_score += 0.9
# Fuzzy match (contains substring)
elif keyword_lower.replace(' ', '') in text_lower.replace(' ', ''):
matched.append(keyword)
total_score += 0.7
confidence = min(1.0, total_score / max(1, len(keywords))) if keywords else 0.0
return len(matched) > 0, matched, confidence
def screen(self, reference: Dict[str, str]) -> ScreeningResult:
"""
Screen a single reference against criteria.
Args:
reference: Dictionary with title, abstract, and metadata
Returns:
ScreeningResult with decision and rationale
"""
title = reference.get('title', '')
abstract = reference.get('abstract', '')
full_text = f"{title} {abstract}"
result = ScreeningResult(
record_id=reference.get('record_id', ''),
title=title,
abstract=abstract,
authors=reference.get('authors', ''),
year=reference.get('year', ''),
doi=reference.get('doi', ''),
pmid=reference.get('pmid', '')
)
exclusion_confidence = 0.0
inclusion_confidence = 0.0
# Check exclusion criteria first (hard exclusions)
if 'study_type' in self.criteria:
exclude_types = self.criteria['study_type'].get('exclude', [])
matched, keywords, conf = self._fuzzy_match(full_text, exclude_types)
if matched:
result.exclude = True
result.exclusion_reasons.append(f"Excluded study type: {', '.join(keywords)}")
exclusion_confidence = max(exclusion_confidence, conf)
# Check population exclusions
if 'population' in self.criteria:
exclude_pop = self.criteria['population'].get('exclude_keywords', [])
matched, keywords, conf = self._fuzzy_match(full_text, exclude_pop)
if matched:
result.exclude = True
result.exclusion_reasons.append(f"Excluded population: {', '.join(keywords)}")
exclusion_confidence = max(exclusion_confidence, conf)
# Check year range
if 'year_range' in self.criteria and result.year:
try:
year = int(result.year)
year_min = self.criteria['year_range'].get('min', 1900)
year_max = self.criteria['year_range'].get('max', 2100)
if year < year_min or year > year_max:
result.exclude = True
result.exclusion_reasons.append(f"Year {year} outside range [{year_min}-{year_max}]")
exclusion_confidence = 1.0
except ValueError:
pass
# Check language
if 'language' in self.criteria:
allowed = self.criteria['language'].get('allowed', [])
# Simple language detection from abstract
lang_conf = self._detect_language_confidence(abstract)
if allowed and lang_conf < 0.5:
result.exclude = True
result.exclusion_reasons.append(f"Language not in allowed list: {allowed}")
# Check inclusion criteria (if not already excluded)
if not result.exclusion_reasons:
inclusion_scores = []
# Study type inclusion
if 'study_type' in self.criteria:
include_types = self.criteria['study_type'].get('include', [])
matched, keywords, conf = self._fuzzy_match(full_text, include_types)
if matched:
result.matched_criteria['study_type'] = keywords
inclusion_scores.append(conf)
result.decision_rationale.append(f"Matched study types: {', '.join(keywords)}")
# Population inclusion
if 'population' in self.criteria:
include_pop = self.criteria['population'].get('include_keywords', [])
matched, keywords, conf = self._fuzzy_match(full_text, include_pop)
if matched:
result.matched_criteria['population'] = keywords
inclusion_scores.append(conf)
result.decision_rationale.append(f"Matched population: {', '.join(keywords)}")
# Intervention required
if 'intervention' in self.criteria:
required = self.criteria['intervention'].get('required', [])
matched, keywords, conf = self._fuzzy_match(full_text, required)
if matched:
result.matched_criteria['intervention'] = keywords
inclusion_scores.append(conf)
result.decision_rationale.append(f"Matched intervention: {', '.join(keywords)}")
else:
# Missing required intervention
result.decision_rationale.append("No required intervention keywords found")
inclusion_scores.append(0.0)
# Calculate overall confidence
if inclusion_scores:
inclusion_confidence = sum(inclusion_scores) / len(inclusion_scores)
else:
inclusion_confidence = 0.5 # Neutral if no criteria
# Decision: include if reasonable confidence and no exclusions
result.include = inclusion_confidence >= self.confidence_threshold
# Final confidence calculation
if result.exclusion_reasons:
result.confidence = exclusion_confidence
result.include = False
else:
result.confidence = inclusion_confidence
# Flag for manual review if confidence below threshold
if result.confidence < self.confidence_threshold:
result.requires_review = True
return result
def _detect_language_confidence(self, text: str) -> float:
"""Simple heuristic for English language confidence."""
if not text:
return 0.0
# Common English words check
english_indicators = ['the', 'and', 'of', 'to', 'in', 'a', 'is', 'that', 'for', 'with']
text_lower = text.lower()
words = text_lower.split()
if not words:
return 0.0
indicator_count = sum(1 for word in words if word in english_indicators)
return min(1.0, indicator_count / max(1, len(words) * 0.1))
class ReferenceParser:
"""Parse references from various input formats."""
@staticmethod
def parse_csv(filepath: Path) -> List[Dict[str, str]]:
"""Parse CSV or TSV file."""
records = []
with open(filepath, 'r', encoding='utf-8-sig') as f:
# Detect delimiter
sample = f.read(1024)
f.seek(0)
delimiter = '\t' if '\t' in sample else ','
reader = csv.DictReader(f, delimiter=delimiter)
for idx, row in enumerate(reader):
record = {
'record_id': str(idx),
'title': row.get('title', ''),
'abstract': row.get('abstract', ''),
'authors': row.get('authors', ''),
'year': row.get('year', ''),
'doi': row.get('doi', ''),
'pmid': row.get('pmid', '')
}
records.append(record)
return records
@staticmethod
def parse_pubmed(filepath: Path) -> List[Dict[str, str]]:
"""Parse PubMed MEDLINE format."""
records = []
current_record = {}
with open(filepath, 'r', encoding='utf-8') as f:
for line in f:
line = line.rstrip()
if line.startswith('PMID-'):
if current_record:
current_record['record_id'] = current_record.get('pmid', '')
records.append(current_record)
current_record = {'pmid': line[6:].strip()}
elif line.startswith('TI -'):
current_record['title'] = line[6:].strip()
elif line.startswith('AB -'):
current_record['abstract'] = line[6:].strip()
elif line.startswith('AU -'):
authors = current_record.get('authors', '')
current_record['authors'] = authors + line[6:].strip() + '; ' if authors else line[6:].strip()
elif line.startswith('DP -'):
year_match = re.search(r'(\d{4})', line[6:])
if year_match:
current_record['year'] = year_match.group(1)
elif line.startswith('LID -') and '[doi]' in line:
doi_match = re.search(r'(10\.\S+)', line)
if doi_match:
current_record['doi'] = doi_match.group(1)
elif line.startswith(' ') and current_record.get('abstract'):
# Continuation of abstract
current_record['abstract'] += ' ' + line.strip()
elif line.startswith(' ') and current_record.get('title'):
# Continuation of title
current_record['title'] += ' ' + line.strip()
# Add last record
if current_record:
current_record['record_id'] = current_record.get('pmid', '')
records.append(current_record)
return records
@staticmethod
def parse_endnote(filepath: Path) -> List[Dict[str, str]]:
"""Parse EndNote XML format."""
records = []
tree = ET.parse(filepath)
root = tree.getroot()
for idx, record in enumerate(root.findall('.//record')):
rec = {'record_id': str(idx)}
# Title
title_elem = record.find('.//titles/title/style')
if title_elem is not None:
rec['title'] = title_elem.text or ''
# Abstract
abstract_elem = record.find('.//abstract/style')
if abstract_elem is not None:
rec['abstract'] = abstract_elem.text or ''
# Authors
authors = []
for author in record.findall('.//contributors/authors/author/style'):
if author.text:
authors.append(author.text)
rec['authors'] = '; '.join(authors)
# Year
year_elem = record.find('.//dates/year/style')
if year_elem is not None and year_elem.text:
rec['year'] = year_elem.text
# DOI
doi_elem = record.find('.//electronic-resource-num/style')
if doi_elem is not None and doi_elem.text:
rec['doi'] = doi_elem.text
records.append(rec)
return records
class ReportGenerator:
"""Generate screening reports in various formats."""
@staticmethod
def generate_prisma_data(results: List[ScreeningResult]) -> PRISMAData:
"""Generate PRISMA 2020 flow diagram data."""
data = PRISMAData()
total = len(results)
included = sum(1 for r in results if r.include and not r.requires_review)
excluded = sum(1 for r in results if not r.include and not r.requires_review)
conflicts = sum(1 for r in results if r.requires_review)
data.identification['database_results'] = total
data.screening['records_screened'] = total
data.screening['records_excluded'] = excluded
data.included['qualitative_synthesis'] = included
return data
@staticmethod
def save_csv(results: List[ScreeningResult], filepath: Path):
"""Save results to CSV."""
if not results:
return
fieldnames = ['record_id', 'title', 'authors', 'year', 'doi', 'pmid',
'include', 'confidence', 'requires_review',
'decision_rationale', 'exclusion_reasons', 'matched_criteria']
with open(filepath, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for r in results:
row = r.to_dict()
# Convert lists/dicts to strings for CSV
row['decision_rationale'] = '|'.join(row['decision_rationale'])
row['exclusion_reasons'] = '|'.join(row['exclusion_reasons'])
row['matched_criteria'] = json.dumps(row['matched_criteria'])
writer.writerow(row)
@staticmethod
def save_json(results: List[ScreeningResult], filepath: Path):
"""Save results to JSON."""
with open(filepath, 'w', encoding='utf-8') as f:
json.dump([r.to_dict() for r in results], f, indent=2)
@staticmethod
def save_screening_log(results: List[ScreeningResult], prisma_data: PRISMAData, filepath: Path):
"""Save complete screening log with PRISMA data."""
log = {
'timestamp': datetime.now().isoformat(),
'total_records': len(results),
'included': sum(1 for r in results if r.include),
'excluded': sum(1 for r in results if not r.include),
'conflicts': sum(1 for r in results if r.requires_review),
'prisma_data': prisma_data.to_dict(),
'records': [r.to_dict() for r in results]
}
with open(filepath, 'w', encoding='utf-8') as f:
json.dump(log, f, indent=2)
def parse_arguments():
"""Parse command line arguments."""
parser = argparse.ArgumentParser(
description='Systematic Review Abstract Screener',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s --input references.csv --criteria criteria.yaml
%(prog)s --input refs.xml --criteria criteria.yaml --output results/ --prisma
%(prog)s --input refs.txt --criteria criteria.yaml --threshold 0.8 --conflict-only
"""
)
parser.add_argument('--input', '-i', required=True,
help='Input file (CSV, TSV, PubMed .txt, EndNote .xml)')
parser.add_argument('--criteria', '-c', required=True,
help='Criteria YAML configuration file')
parser.add_argument('--output', '-o', default='./output',
help='Output directory (default: ./output)')
parser.add_argument('--format', '-f', choices=['csv', 'json', 'excel'], default='csv',
help='Output format (default: csv)')
parser.add_argument('--threshold', '-t', type=float, default=0.75,
help='Confidence threshold (default: 0.75)')
parser.add_argument('--prisma', '-p', action='store_true',
help='Generate PRISMA flow diagram data')
parser.add_argument('--conflict-only', action='store_true',
help='Export only conflicts requiring review')
parser.add_argument('--batch-size', '-b', type=int, default=100,
help='Processing batch size (default: 100)')
return parser.parse_args()
def load_criteria(filepath: Path) -> Dict[str, Any]:
"""Load criteria from YAML file."""
with open(filepath, 'r', encoding='utf-8') as f:
return yaml.safe_load(f)
def detect_file_format(filepath: Path) -> str:
"""Detect input file format from extension and content."""
suffix = filepath.suffix.lower()
if suffix in ['.csv']:
return 'csv'
elif suffix in ['.tsv']:
return 'tsv'
elif suffix in ['.xml']:
return 'xml'
elif suffix in ['.txt']:
# Check if PubMed format
with open(filepath, 'r', encoding='utf-8') as f:
sample = f.read(1024)
if 'PMID-' in sample:
return 'pubmed'
# Default to CSV
return 'csv'
def main():
"""Main entry point."""
args = parse_arguments()
input_path = Path(args.input)
criteria_path = Path(args.criteria)
output_dir = Path(args.output)
# Validate inputs
if not input_path.exists():
print(f"Error: Input file not found: {input_path}", file=sys.stderr)
sys.exit(1)
if not criteria_path.exists():
print(f"Error: Criteria file not found: {criteria_path}", file=sys.stderr)
sys.exit(1)
# Create output directory
output_dir.mkdir(parents=True, exist_ok=True)
# Load criteria
print(f"Loading criteria from {criteria_path}...")
criteria = load_criteria(criteria_path)
criteria['confidence_threshold'] = args.threshold
# Detect and parse input format
file_format = detect_file_format(input_path)
print(f"Detected format: {file_format}")
print(f"Parsing references from {input_path}...")
if file_format == 'pubmed':
references = ReferenceParser.parse_pubmed(input_path)
elif file_format == 'xml':
references = ReferenceParser.parse_endnote(input_path)
else:
references = ReferenceParser.parse_csv(input_path)
print(f"Loaded {len(references)} references")
if not references:
print("No references found. Exiting.", file=sys.stderr)
sys.exit(1)
# Initialize screener
matcher = CriteriaMatcher(criteria)
# Process references
print("Screening references...")
results = []
for i in range(0, len(references), args.batch_size):
batch = references[i:i + args.batch_size]
for ref in batch:
result = matcher.screen(ref)
results.append(result)
if (i // args.batch_size + 1) % 10 == 0:
print(f" Processed {min(i + args.batch_size, len(references))}/{len(references)}...")
print(f"Screening complete. {len(results)} records processed.")
# Filter if conflict-only
if args.conflict_only:
results = [r for r in results if r.requires_review]
print(f"Filtered to {len(results)} conflicts requiring review")
# Generate PRISMA data if requested
prisma_data = None
if args.prisma:
prisma_data = ReportGenerator.generate_prisma_data(results)
prisma_path = output_dir / 'prisma_data.json'
with open(prisma_path, 'w') as f:
json.dump(prisma_data.to_dict(), f, indent=2)
print(f"PRISMA data saved to {prisma_path}")
# Split results
included = [r for r in results if r.include and not r.requires_review]
excluded = [r for r in results if not r.include and not r.requires_review]
conflicts = [r for r in results if r.requires_review]
# Save results
print("\nSaving results...")
if args.format == 'csv':
if not args.conflict_only:
ReportGenerator.save_csv(included, output_dir / 'screened_included.csv')
ReportGenerator.save_csv(excluded, output_dir / 'screened_excluded.csv')
ReportGenerator.save_csv(conflicts, output_dir / 'conflicts.csv')
else:
if not args.conflict_only:
ReportGenerator.save_json(included, output_dir / 'screened_included.json')
ReportGenerator.save_json(excluded, output_dir / 'screened_excluded.json')
ReportGenerator.save_json(conflicts, output_dir / 'conflicts.json')
# Save screening log
if prisma_data is None:
prisma_data = ReportGenerator.generate_prisma_data(results)
ReportGenerator.save_screening_log(results, prisma_data, output_dir / 'screening_log.json')
# Print summary
print("\n" + "=" * 50)
print("SCREENING SUMMARY")
print("=" * 50)
print(f"Total records: {len(results)}")
print(f"Included: {len(included)} ({100*len(included)/len(results):.1f}%)")
print(f"Excluded: {len(excluded)} ({100*len(excluded)/len(results):.1f}%)")
print(f"Conflicts (review): {len(conflicts)} ({100*len(conflicts)/len(results):.1f}%)")
print("=" * 50)
print(f"\nResults saved to: {output_dir}")
if conflicts:
print(f"\n⚠️ {len(conflicts)} records require manual review.")
print(f" Check: {output_dir}/conflicts.{args.format}")
print("\n⚠️ IMPORTANT: This tool provides screening recommendations.")
print(" Final inclusion decisions must be verified by human reviewers.")
if __name__ == '__main__':
main()
Design gene circuits for synthetic biology applications
---
name: synthetic-bio-circuit-designer
description: Design gene circuits for synthetic biology applications
version: 1.0.0
category: Bioinfo
tags: []
author: AIPOCH
license: MIT
status: Draft
risk_level: Medium
skill_type: Tool/Script
owner: AIPOCH
reviewer: ''
last_updated: '2026-02-06'
---
# Synthetic Bio Circuit Designer
Design and simulate gene circuits for synthetic biology.
## Usage
```bash
python scripts/main.py --type toggle --p1 P1 --p2 P2
python scripts/main.py --type oscillator
```
## Circuit Types
- Toggle switch: Bistable genetic switch
- Oscillator: Repressilator circuit
## Output
- Circuit design
- Component list
- Expected behavior
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
No additional Python packages required.
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Synthetic Bio Circuit Designer
Design gene circuits for synthetic biology applications.
"""
import argparse
class CircuitDesigner:
"""Design synthetic biology circuits."""
def design_toggle_switch(self, promoter1, promoter2):
"""Design a genetic toggle switch."""
design = f"""
Toggle Switch Circuit Design:
Components:
- Promoter 1: {promoter1} (drives Repressor 2)
- Promoter 2: {promoter2} (drives Repressor 1)
- Repressor 1: inhibits {promoter1}
- Repressor 2: inhibits {promoter2}
Operation:
State A: Repressor 1 OFF, Repressor 2 ON → {promoter1} active
State B: Repressor 1 ON, Repressor 2 OFF → {promoter2} active
Inducers:
- Inducer 1: inactivates Repressor 1
- Inducer 2: inactivates Repressor 2
"""
return design
def design_oscillator(self):
"""Design a repressilator circuit."""
design = """
Repressilator Circuit Design:
Ring topology with 3 genes:
Gene A represses Gene B
Gene B represses Gene C
Gene C represses Gene A
Expected behavior: Oscillating gene expression
Period: Depends on protein degradation rates
"""
return design
def main():
parser = argparse.ArgumentParser(description="Synthetic Bio Circuit Designer")
parser.add_argument("--type", "-t", choices=["toggle", "oscillator"],
required=True, help="Circuit type")
parser.add_argument("--p1", default="P1", help="Promoter 1")
parser.add_argument("--p2", default="P2", help="Promoter 2")
args = parser.parse_args()
designer = CircuitDesigner()
if args.type == "toggle":
design = designer.design_toggle_switch(args.p1, args.p2)
else:
design = designer.design_oscillator()
print(design)
if __name__ == "__main__":
main()
Suggest triage levels (Emergency, Urgent, Outpatient) based on red flag symptoms using a rule-based engine. For AI-assisted decision support only — not a sub...
---
name: symptom-checker-triage
description: Suggest triage levels (Emergency, Urgent, Outpatient) based on red flag symptoms using a rule-based engine. For AI-assisted decision support only — not a substitute for professional medical diagnosis.
license: MIT
skill-author: AIPOCH
---
# Symptom Checker Triage
Analyzes symptom descriptions and suggests triage levels (Emergency / Urgent / Outpatient) based on red flag identification. Provides rationale and recommended next steps. For AI-assisted decision support only.
## Quick Check
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py "Chest pain, difficulty breathing, lasting 30 minutes"
python scripts/main.py "Headache, fever 38.5 degrees, vomiting" --verbose
```
## When to Use
- Triage a patient symptom description to Emergency, Urgent, or Outpatient level
- Identify red flag symptoms in a clinical or research context
- Generate structured triage output for documentation or downstream processing
## Workflow
1. Confirm the symptom description is provided as natural language text.
2. Validate that the request is a symptom triage task; stop early if not.
3. Run `scripts/main.py` with the symptom string or use `--interactive` mode.
4. Return a structured result with triage level, red flags, rationale, and disclaimer.
5. If execution fails or inputs are incomplete, switch to the Fallback Template below.
## Fallback Template
If `scripts/main.py` fails or required fields are missing, respond with:
```
FALLBACK REPORT
───────────────────────────────────────
Objective : <triage goal>
Inputs Available : <symptom description provided>
Missing Inputs : <list exactly what is missing>
Partial Result : <any triage assessment that can be made safely>
Blocked Steps : <what could not be completed and why>
Disclaimer : This is AI-assisted advice only. Seek professional medical care.
Next Steps : <minimum info needed to complete>
───────────────────────────────────────
```
## Stress-Case Output Checklist
For complex multi-constraint requests, always include these sections explicitly:
- **Assumptions**: symptom keywords matched, confidence threshold applied
- **Constraints**: rule-based engine only; no differential diagnosis
- **Risks**: false positives and false negatives are possible; always defer to clinician
- **Unresolved Items**: ambiguous symptoms requiring clarification
## CLI Usage
```bash
# Direct symptom input
python scripts/main.py "Chest pain, radiating to left arm, sweating"
# Interactive mode
python scripts/main.py --interactive
# Verbose output
python scripts/main.py "Headache, fever" --verbose
# JSON output
python scripts/main.py "Abdominal pain, right lower quadrant tenderness" --json
```
## Output Format
```json
{
"triage_level": "emergency|urgent|outpatient",
"confidence": 0.85,
"red_flags": ["Chest pain", "Difficulty breathing"],
"reason": "Chest pain with difficulty breathing may indicate myocardial infarction or pulmonary embolism",
"recommendation": "Go to emergency department immediately",
"department": "Emergency/Cardiology",
"warning": "This is AI-assisted advice and cannot replace professional medical diagnosis"
}
```
## Triage Levels
| Level | Description | Action |
|---|---|---|
| emergency | Life-threatening red flags present | Call emergency services or go to ED immediately |
| urgent | Serious but not immediately fatal | Seek care within 2–4 hours |
| outpatient | Non-urgent | Schedule outpatient appointment |
## Red Flag Categories
→ Full red flags reference: [references/red_flags.md](references/red_flags.md)
Key categories: Cardiovascular, Respiratory, Neurological, Gastrointestinal, Trauma/Poisoning, Obstetric.
## Disclaimer
> **Important**: This tool provides AI-assisted triage suggestions only. It **cannot replace professional medical diagnosis**. If in doubt, seek medical care immediately. Call emergency services in life-threatening situations.
## Input Validation
This skill accepts: natural language symptom descriptions in English or Chinese for triage level suggestion.
If the request does not involve symptom triage — for example, asking to diagnose a specific disease, prescribe medication, interpret lab results, or perform general medical Q&A — do not proceed. Instead respond:
> "`symptom-checker-triage` is designed to suggest triage levels based on symptom red flags. Your request appears to be outside this scope. Please provide a symptom description, or use a more appropriate tool. This tool does not provide diagnoses or treatment recommendations."
## Error Handling
- If no symptom description is provided, request it explicitly.
- If the task goes outside documented scope (diagnosis, prescription), stop immediately.
- If `scripts/main.py` fails, use the Fallback Template above.
- Do not fabricate triage levels, red flag matches, or medical advice.
## Output Requirements
Every final response must include:
1. **Objective** — what symptom set was triaged
2. **Inputs Received** — symptom description used
3. **Assumptions** — keyword matching applied, confidence threshold
4. **Triage Result** — level, red flags identified, rationale
5. **Risks and Limits** — AI-only, not a diagnosis, false positive/negative risk
6. **Next Checks** — always recommend professional medical evaluation
## Dependencies
- Python 3.8+
- No third-party dependencies (rule-based engine, standard library only)
FILE:requirements.txt
dataclasses
enum
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Symptom Checker Triage (ID: 165)
Recommends triage level (emergency vs outpatient) based on common symptom red flags
"""
import sys
import json
import argparse
import re
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass, asdict
from enum import Enum
class TriageLevel(Enum):
EMERGENCY = "emergency" # Life-threatening
URGENT = "urgent" # Urgent but not immediately life-threatening
OUTPATIENT = "outpatient" # Non-urgent
@dataclass
class TriageResult:
triage_level: str
confidence: float
red_flags: List[str]
reason: str
recommendation: str
department: str
warning: str = "This is an AI-assisted recommendation and cannot replace professional medical diagnosis"
# Red flag definitions: symptom keywords -> (red flag name, triage level, weight, department suggestion, reason)
RED_FLAGS = {
# Cardiovascular system - urgent
"chest pain": ("Chest Pain", TriageLevel.EMERGENCY, 0.95, "Emergency/Cardiology", "Chest pain may indicate myocardial infarction, aortic dissection, or pulmonary embolism"),
"chest tightness": ("Chest Tightness", TriageLevel.EMERGENCY, 0.85, "Emergency/Cardiology", "Chest tightness may be a presentation of angina or myocardial infarction"),
"palpitations": ("Palpitations", TriageLevel.URGENT, 0.70, "Cardiology", "Palpitations may indicate cardiac arrhythmia"),
"syncope": ("Syncope", TriageLevel.EMERGENCY, 0.90, "Emergency/Neurology", "Syncope may suggest serious arrhythmia or cerebrovascular accident"),
"coma": ("Coma", TriageLevel.EMERGENCY, 1.0, "Emergency", "Coma is a life-threatening emergency"),
# Respiratory system - urgent
"dyspnea": ("Dyspnea", TriageLevel.EMERGENCY, 0.95, "Emergency/Pulmonology", "Severe dyspnea indicates risk of respiratory failure"),
"rapid breathing": ("Rapid Breathing", TriageLevel.URGENT, 0.75, "Emergency", "Rapid breathing may indicate pulmonary or cardiac disease"),
"hemoptysis": ("Hemoptysis", TriageLevel.EMERGENCY, 0.90, "Emergency/Pulmonology", "Hemoptysis may indicate serious pulmonary disease"),
"asphyxia": ("Asphyxia", TriageLevel.EMERGENCY, 1.0, "Emergency", "Asphyxia is a life-threatening emergency"),
# Nervous system - urgent
"severe headache": ("Severe Headache", TriageLevel.EMERGENCY, 0.85, "Emergency/Neurology", "Sudden severe headache may indicate subarachnoid hemorrhage"),
"worst headache of life": ("Worst Headache of Life", TriageLevel.EMERGENCY, 0.95, "Emergency/Neurology", "Suggests possible subarachnoid hemorrhage"),
"slurred speech": ("Slurred Speech", TriageLevel.EMERGENCY, 0.90, "Emergency/Neurology", "May be a sign of stroke"),
"hemiplegia": ("Hemiplegia", TriageLevel.EMERGENCY, 0.95, "Emergency/Neurology", "Classic symptom of stroke"),
"hemiparesis": ("Hemiparesis", TriageLevel.EMERGENCY, 0.95, "Emergency/Neurology", "Classic symptom of stroke"),
"convulsions": ("Convulsions", TriageLevel.URGENT, 0.80, "Emergency/Neurology", "May indicate epileptic seizure"),
"epilepsy": ("Epilepsy", TriageLevel.URGENT, 0.80, "Neurology", "Epileptic seizures require prompt medical attention"),
# Digestive system - urgent
"hematemesis": ("Hematemesis", TriageLevel.EMERGENCY, 0.95, "Emergency/Gastroenterology", "Upper gastrointestinal major bleeding"),
"melena": ("Melena", TriageLevel.URGENT, 0.80, "Gastroenterology", "May indicate upper gastrointestinal bleeding"),
"hematochezia": ("Hematochezia", TriageLevel.URGENT, 0.75, "Gastroenterology/Colorectal Surgery", "Gastrointestinal bleeding"),
"severe abdominal pain": ("Severe Abdominal Pain", TriageLevel.EMERGENCY, 0.85, "Emergency", "May be appendicitis, intestinal perforation, or other acute abdomen"),
"board-like abdomen": ("Board-like Abdomen", TriageLevel.EMERGENCY, 0.95, "Emergency/Surgery", "Indicates peritonitis"),
"absent bowel sounds": ("Absent Bowel Sounds", TriageLevel.URGENT, 0.75, "Surgery", "May indicate intestinal obstruction"),
# Other systems - urgent
"severe trauma": ("Severe Trauma", TriageLevel.EMERGENCY, 0.95, "Emergency/Surgery", "Severe trauma requires urgent management"),
"major hemorrhage": ("Major Hemorrhage", TriageLevel.EMERGENCY, 1.0, "Emergency", "Risk of hemorrhagic shock"),
"drug overdose": ("Drug Overdose", TriageLevel.EMERGENCY, 0.90, "Emergency", "Poisoning requires urgent gastric lavage or antidote"),
"poisoning": ("Poisoning", TriageLevel.EMERGENCY, 0.95, "Emergency", "Poisoning requires urgent management"),
# Pediatric/obstetric related
"decreased fetal movement": ("Decreased Fetal Movement", TriageLevel.URGENT, 0.80, "Obstetrics/Gynecology", "May indicate fetal distress"),
"vaginal bleeding": ("Vaginal Bleeding", TriageLevel.URGENT, 0.80, "Obstetrics/Gynecology", "Bleeding during pregnancy warrants concern for miscarriage or placental abruption"),
"febrile seizure": ("Febrile Seizure", TriageLevel.URGENT, 0.85, "Pediatric Emergency", "Febrile seizures in children require urgent management"),
}
# Symptom synonym expansion
SYNONYMS = {
"chest pain": ["chest ache", "precordial pain", "chest discomfort", "thoracic pain"],
"dyspnea": ["shortness of breath", "breathlessness", "difficulty breathing", "can't breathe", "suffocation feeling"],
"severe headache": ["splitting headache", "explosive headache", "worst headache of life"],
"hemiplegia": ["hemiparesis", "one-sided limb weakness", "arm and leg weakness"],
"slurred speech": ["unclear speech", "difficulty speaking", "expressive difficulty"],
"coma": ["loss of consciousness", "unconscious", "unresponsive"],
"hematemesis": ["vomiting blood", "vomiting fresh blood"],
"melena": ["tarry stool", "black stool"],
"severe abdominal pain": ["severe stomach pain", "severe belly pain", "writhing in pain"],
"major hemorrhage": ["uncontrolled bleeding", "massive bleeding"],
}
# Outpatient-level symptom definitions
OUTPATIENT_SYMPTOMS = {
"mild headache": ("Mild Headache", TriageLevel.OUTPATIENT, 0.60, "Neurology", "Common symptom, recommend outpatient evaluation"),
"low-grade fever": ("Low-grade Fever", TriageLevel.OUTPATIENT, 0.50, "Internal Medicine", "Temperature <38.5°C can be seen in outpatient setting"),
"cough": ("Cough", TriageLevel.OUTPATIENT, 0.40, "Pulmonology", "Cough without dyspnea can be managed in outpatient setting"),
"runny nose": ("Rhinorrhea", TriageLevel.OUTPATIENT, 0.30, "ENT", "Common symptom of upper respiratory tract infection"),
"sore throat": ("Sore Throat", TriageLevel.OUTPATIENT, 0.40, "ENT", "Common in upper respiratory tract infection"),
"mild abdominal pain": ("Mild Abdominal Pain", TriageLevel.OUTPATIENT, 0.50, "Gastroenterology", "Mild abdominal pain without red flags"),
"diarrhea": ("Diarrhea", TriageLevel.OUTPATIENT, 0.50, "Gastroenterology", "Diarrhea without bloody stool or dehydration"),
"rash": ("Rash", TriageLevel.OUTPATIENT, 0.40, "Dermatology", "Rash without fever"),
"joint pain": ("Joint Pain", TriageLevel.OUTPATIENT, 0.40, "Rheumatology", "Chronic joint pain"),
"insomnia": ("Insomnia", TriageLevel.OUTPATIENT, 0.30, "Neurology/Psychology", "Sleep disorder"),
"fatigue": ("Fatigue", TriageLevel.OUTPATIENT, 0.40, "Internal Medicine", "Rule out anemia, hypothyroidism, etc."),
}
def expand_synonyms(text: str) -> str:
"""Expand synonyms, replacing all synonyms with standard symptom names"""
expanded = text
for standard, synonyms in SYNONYMS.items():
for syn in synonyms:
if syn in expanded:
expanded = expanded.replace(syn, standard)
return expanded
def extract_red_flags(text: str) -> List[Tuple[str, TriageLevel, float, str, str]]:
"""Extract red flags from symptom description"""
expanded_text = expand_synonyms(text)
found_flags = []
for keyword, (name, level, weight, dept, reason) in RED_FLAGS.items():
if keyword in expanded_text:
found_flags.append((name, level, weight, dept, reason))
# Deduplicate and sort by weight
seen = set()
unique_flags = []
for flag in sorted(found_flags, key=lambda x: x[2], reverse=True):
if flag[0] not in seen:
seen.add(flag[0])
unique_flags.append(flag)
return unique_flags
def calculate_temperature(text: str) -> Optional[float]:
"""Extract body temperature from text"""
# Match various temperature formats
patterns = [
r'(\d+\.?\d*)\s*degrees?',
r'(\d+\.?\d*)\s*°C',
r'(\d+\.?\d*)\s*°',
r'temperature\s*(\d+\.?\d*)',
r'fever\s*(\d+\.?\d*)',
r'temp\s*(\d+\.?\d*)',
]
for pattern in patterns:
match = re.search(pattern, text)
if match:
try:
temp = float(match.group(1))
# Determine if Celsius or Fahrenheit
if temp > 50: # Convert Fahrenheit to Celsius
temp = (temp - 32) * 5 / 9
return temp
except ValueError:
continue
return None
def extract_outpatient_symptoms(text: str) -> List[Tuple[str, TriageLevel, float, str, str]]:
"""Extract outpatient-level symptoms from symptom description"""
found = []
text_lower = text.lower()
# Check temperature first
temperature = calculate_temperature(text)
for keyword, (name, level, weight, dept, reason) in OUTPATIENT_SYMPTOMS.items():
if keyword in text:
found.append((name, level, weight, dept, reason))
# If no specific symptoms found but temperature is present, assess based on temperature
if not found and temperature:
if temperature < 38:
found.append(("Low-grade Fever", TriageLevel.OUTPATIENT, 0.50, "Internal Medicine", f"Temperature {temperature:.1f}°C, recommend outpatient visit"))
elif temperature < 39:
found.append(("Moderate Fever", TriageLevel.OUTPATIENT, 0.60, "Fever Clinic", f"Temperature {temperature:.1f}°C, can be managed in outpatient setting"))
# Recognize common symptom keywords
if not found:
common_keywords = {
"headache": ("Headache", TriageLevel.OUTPATIENT, 0.60, "Neurology", "Headache without red flags can be evaluated in outpatient setting"),
"dizziness": ("Dizziness", TriageLevel.OUTPATIENT, 0.50, "Neurology/ENT", "Dizziness requires evaluation"),
"nausea": ("Nausea", TriageLevel.OUTPATIENT, 0.50, "Gastroenterology", "Nausea without vomiting can be managed in outpatient setting"),
"vomiting": ("Vomiting", TriageLevel.OUTPATIENT, 0.60, "Gastroenterology", "Non-severe vomiting can be managed in outpatient setting"),
"cold": ("Cold Symptoms", TriageLevel.OUTPATIENT, 0.50, "Internal Medicine", "Common cold symptoms"),
"fever": ("Fever", TriageLevel.OUTPATIENT, 0.60, "Fever Clinic", "Fever requires further evaluation"),
"high temperature": ("Fever", TriageLevel.OUTPATIENT, 0.60, "Fever Clinic", "Fever requires further evaluation"),
}
for keyword, symptom_info in common_keywords.items():
if keyword in text:
found.append(symptom_info)
return found
def triage(symptom_text: str) -> TriageResult:
"""
Main triage function
Args:
symptom_text: Symptom description text
Returns:
TriageResult: Triage result
"""
if not symptom_text or not symptom_text.strip():
return TriageResult(
triage_level=TriageLevel.OUTPATIENT.value,
confidence=0.0,
red_flags=[],
reason="No symptom description provided",
recommendation="Please provide a symptom description for triage",
department="Unknown"
)
# Extract red flags
red_flags = extract_red_flags(symptom_text)
# Extract temperature
temperature = calculate_temperature(symptom_text)
# Check high fever
if temperature and temperature >= 40:
red_flags.append(("High Fever (>=40°C)", TriageLevel.EMERGENCY, 0.90, "Emergency", "High fever with altered consciousness requires urgent management"))
elif temperature and temperature >= 39:
red_flags.append(("High Fever (>=39°C)", TriageLevel.URGENT, 0.70, "Fever Clinic", "High fever requires prompt medical attention"))
# If red flags are present
if red_flags:
# Take the most severe level
max_level = max(red_flags, key=lambda x: x[2])
level = max_level[1]
confidence = min(0.95, max(flag[2] for flag in red_flags))
flag_names = [flag[0] for flag in red_flags]
departments = list(dict.fromkeys(flag[3] for flag in red_flags)) # Deduplicate while preserving order
if level == TriageLevel.EMERGENCY:
recommendation = "Proceed to emergency immediately, call 911 if necessary"
elif level == TriageLevel.URGENT:
recommendation = "Seek medical attention within 2-4 hours, emergency or fever clinic"
else:
recommendation = "Schedule an outpatient appointment as soon as possible"
# Combine reasons
reasons = [f"{flag[0]}: {flag[4]}" for flag in red_flags[:3]]
reason = "; ".join(reasons)
return TriageResult(
triage_level=level.value,
confidence=confidence,
red_flags=flag_names,
reason=reason,
recommendation=recommendation,
department="/".join(departments[:2])
)
# Check outpatient-level symptoms
outpatient = extract_outpatient_symptoms(symptom_text)
if outpatient:
max_symptom = max(outpatient, key=lambda x: x[2])
return TriageResult(
triage_level=TriageLevel.OUTPATIENT.value,
confidence=max_symptom[2],
red_flags=[],
reason=f"Symptom '{max_symptom[0]}' has no red flags, suitable for outpatient management",
recommendation="Schedule an outpatient appointment; seek urgent care if symptoms worsen",
department=max_symptom[3]
)
# Unrecognized symptoms
return TriageResult(
triage_level=TriageLevel.OUTPATIENT.value,
confidence=0.30,
red_flags=[],
reason="Unrecognized symptoms, recommend medical evaluation",
recommendation="Schedule a general internal medicine outpatient appointment for initial assessment",
department="Internal Medicine"
)
def interactive_mode():
"""Interactive mode"""
print("=" * 50)
print("Symptom Checker Triage")
print("=" * 50)
print("Enter symptom description (e.g. 'chest pain, dyspnea'), type 'quit' to exit")
print("-" * 50)
while True:
try:
user_input = input("\nSymptom description: ").strip()
if user_input.lower() in ['quit', 'exit', 'q']:
print("Thank you for using. Goodbye!")
break
if not user_input:
continue
result = triage(user_input)
print_result(result)
except KeyboardInterrupt:
print("\nThank you for using. Goodbye!")
break
except EOFError:
break
def supports_color() -> bool:
"""Detect whether the terminal supports color"""
import os
return sys.stdout.isatty() and os.environ.get('TERM') not in ('dumb', '')
def print_result(result: TriageResult, verbose: bool = False):
"""Print triage result"""
use_color = supports_color()
level_colors = {
"emergency": "\033[91m" if use_color else "", # Red
"urgent": "\033[93m" if use_color else "", # Yellow
"outpatient": "\033[92m" if use_color else "", # Green
}
reset = "\033[0m" if use_color else ""
level_display = {
"emergency": "🔴 Emergency",
"urgent": "🟠 Urgent",
"outpatient": "🟢 Outpatient",
}
level = result.triage_level
color = level_colors.get(level, "")
print("\n" + "=" * 50)
print(f"Triage Level: {color}{level_display.get(level, level)}{reset}")
print(f"Confidence: {result.confidence:.0%}")
if result.red_flags:
print(f"\n🚨 Identified Red Flags:")
for flag in result.red_flags:
print(f" - {flag}")
print(f"\n📋 Triage Reason:")
print(f" {result.reason}")
print(f"\n💡 Recommendation:")
print(f" {result.recommendation}")
print(f"\n🏥 Suggested Department: {result.department}")
if verbose:
print(f"\n⚠️ Disclaimer: {result.warning}")
print("=" * 50)
def main():
parser = argparse.ArgumentParser(
description="Symptom Checker Triage - recommends triage level based on red flags",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python main.py "chest pain, radiating to left arm, sweating"
python main.py --interactive
python main.py "headache, fever 38 degrees" --verbose
"""
)
parser.add_argument(
"symptoms",
nargs="?",
help="Symptom description (e.g. 'chest pain, dyspnea')"
)
parser.add_argument(
"-i", "--interactive",
action="store_true",
help="Enter interactive mode"
)
parser.add_argument(
"-v", "--verbose",
action="store_true",
help="Show detailed information"
)
parser.add_argument(
"--json",
action="store_true",
help="Output in JSON format"
)
args = parser.parse_args()
if args.interactive:
interactive_mode()
return
if not args.symptoms:
parser.print_help()
print("\nError: Please provide symptom description or use --interactive for interactive mode")
sys.exit(1)
result = triage(args.symptoms)
if args.json:
print(json.dumps(asdict(result), ensure_ascii=False, indent=2))
else:
print_result(result, verbose=args.verbose)
if __name__ == "__main__":
main()
Analyze data with `survival-curve-risk-table` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
---
name: survival-curve-risk-table
description: Analyze data with `survival-curve-risk-table` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
license: MIT
skill-author: AIPOCH
---
# Survival Curve Risk Table Generator
## When to Use
- Use this skill when the task needs Automatically align and add "Number at risk" table below Kaplan-Meier.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
- Scope-focused workflow aligned to: Analyze data with `survival-curve-risk-table` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `lifelines`: `unspecified`. Declared in `requirements.txt`.
- `matplotlib`: `unspecified`. Declared in `requirements.txt`.
- `numpy`: `unspecified`. Declared in `requirements.txt`.
- `pandas`: `unspecified`. Declared in `requirements.txt`.
- `pil`: `unspecified`. Declared in `requirements.txt`.
- `pillow`: `unspecified`. Declared in `requirements.txt`.
- `seaborn`: `unspecified`. Declared in `requirements.txt`.
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Data Analytics/survival-curve-risk-table"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
# Example invocation: python scripts/main.py --help
# Example invocation: python scripts/main.py --input "Audit validation sample with explicit symptoms, history, assessment, and next-step plan."
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Function Overview
Automatically add "Number at risk" tables to Kaplan-Meier survival curves that meet clinical oncology journal standards. Automatically align time points and generate publication-quality combined figures.
## Usage Trigger Conditions
- Need to add number at risk tables to KM survival curves
- Generate survival plots that meet journal requirements such as NEJM, Lancet, JCO
- Risk tables needed in clinical trial reports
- Medical paper chart standardization before submission
- Precise alignment of risk tables with survival curve time axis
## Core Functions
### 1. Automatic Number at Risk Calculation
- Automatically calculate number at risk at each time point from survival data
- Support right-censored data processing
- Count remaining observed subjects by group
- Automatic handling of censoring events
### 2. Journal Standard Formats
- **NEJM Standard**: Clean time axis, groups arranged horizontally
- **Lancet Standard**: Complete statistical information, vertical alignment
- **JCO Standard**: Censoring symbols marked, group comparison
- Support custom journal templates
### 3. Precise Alignment
- Time axis precisely aligned with curve X-axis
- Automatic adjustment of table spacing and font size
- Responsive layout adapts to different image sizes
- Support horizontal/vertical layouts
### 4. Output Formats
- High-quality PNG/JPEG images
- PDF vector graphics
- SVG editable format
- PowerPoint embeddable format
## Usage
### Example 1: Basic Risk Table Generation
```text
# Example invocation: python scripts/main.py \
--input survival_data.csv \
--time-col time \
--event-col event \
--group-col treatment \
--output risk_table.png
```
### Example 2: Specify Journal Style
```text
# Example invocation: python scripts/main.py \
--input survival_data.csv \
--time-col time \
--event-col status \
--group-col arm \
--style NEJM \
--time-points 0,6,12,18,24,30,36 \
--output figure_1a.pdf
```
### Example 3: Combined Figure Generation (Curve + Risk Table)
```text
# Example invocation: python scripts/main.py \
--input survival_data.csv \
--time-col months \
--event-col death \
--group-col group \
--km-plot km_curve.png \
--combine \
--output combined_figure.png
```
### Example 4: Batch Generate Multi-Timepoint Tables
```text
# Example invocation: python scripts/main.py \
--input survival_data.csv \
--time-col time \
--event-col event \
--group-col treatment \
--time-points 0,12,24,36,48,60 \
--format both \
--output-dir ./output/
```
### Example 5: Using Existing Survival Data (Python API)
```python
from scripts.main import RiskTableGenerator
# Initialize generator
generator = RiskTableGenerator(
style="JCO",
time_points=[0, 6, 12, 18, 24, 30],
figure_size=(8, 6)
)
# Load survival data
generator.load_data(
df=survival_df,
time_col="time",
event_col="event",
group_col="treatment_arm"
)
# Generate risk table
generator.generate_risk_table(
output_path="risk_table.png",
show_censored=True
)
# Generate combined figure (KM curve + risk table)
generator.generate_combined_plot(
km_plot_path="km_curve.png",
output_path="combined_figure.pdf"
)
```
## Input Data Format
### CSV Format Example
```csv
time,event,treatment_arm
0,0,Experimental
3.2,1,Experimental
5.1,0,Experimental
12.3,1,Control
18.7,0,Control
24.0,1,Experimental
...
```
### Required Columns
| Column Name | Description | Type |
|------|------|------|
| time | Follow-up time (months) | Numeric |
| event | Event occurrence flag | 0=Censored, 1=Event |
| group | Treatment group (optional) | Text/Categorical |
### Supported Data Formats
- CSV (.csv)
- Excel (.xlsx, .xls)
- SAS (.sas7bdat)
- RData (.rda, .rds)
- Python pickle (.pkl)
## Journal Style Configuration
### NEJM Style
```json
{
"style": "NEJM",
"font_family": "Helvetica",
"font_size": 8,
"time_points": [0, 6, 12, 18, 24, 30, 36],
"table_height": 0.15,
"show_grid": false,
"separator_lines": true
}
```
### Lancet Style
```json
{
"style": "Lancet",
"font_family": "Times New Roman",
"font_size": 9,
"time_points": [0, 12, 24, 36, 48, 60],
"table_height": 0.18,
"show_grid": true,
"header_bold": true
}
```
### JCO Style
```json
{
"style": "JCO",
"font_family": "Arial",
"font_size": 8,
"time_points": [0, 6, 12, 18, 24, 30],
"table_height": 0.16,
"show_censored": true,
"censor_symbol": "+"
}
```
## Command Line Parameters
### Required Parameters
| Parameter | Description | Example |
|------|------|------|
| `--input` | Input data file path | `data.csv` |
| `--time-col` | Time column name | `time` |
| `--event-col` | Event column name | `event` |
### Optional Parameters
| Parameter | Description | Default Value |
|------|------|--------|
| `--group-col` | Group column name | `None` |
| `--output` | Output file path | `risk_table.png` |
| `--style` | Journal style | `NEJM` |
| `--time-points` | Time point list | Auto-calculated |
| `--format` | Output format | `png` |
| `--width` | Image width | `8` (inches) |
| `--height` | Image height | `6` (inches) |
| `--dpi` | Image resolution | `300` |
| `--font-size` | Font size | `8` |
| `--show-censored` | Show censored count | `False` |
| `--combine` | Combine with KM curve | `False` |
| `--km-plot` | KM curve image path | `None` |
## Output Format
### Standalone Risk Table
```
┌─────────────────────────────────────────────────────────┐
│ Number at risk │
├─────────┬─────┬─────┬─────┬─────┬─────┬─────┬───────────┤
│ Group │ 0 │ 12 │ 24 │ 36 │ 48 │ 60 │ 72 (mo) │
├─────────┼─────┼─────┼─────┼─────┼─────┼─────┼───────────┤
│ Exp │ 150 │ 142 │ 128 │ 105 │ 89 │ 72 │ 58 │
│ Control │ 148 │ 135 │ 118 │ 92 │ 76 │ 61 │ 45 │
└─────────┴─────┴─────┴─────┴─────┴─────┴─────┴───────────┘
```
### Combined Figure Layout
```
┌─────────────────────────────────────┐
│ │
│ Kaplan-Meier Survival Curve │
│ │
│ ━━━━━━━━━ Experimental │
│ ─ ─ ─ ─ ─ Control │
│ │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Number at risk │
│ Exp 150 142 128 105 89 72 │
│ Ctrl 148 135 118 92 76 61 │
│ 0 12 24 36 48 60 │
└─────────────────────────────────────┘
```
## Algorithm Description
### Number at Risk Calculation
```
For each time point t:
For each group g:
N_at_risk(t, g) = N_total(g)
- Σ(patients with events occurring ≤ t)
- Σ(patients censored occurring < t)
```
### Time Point Selection Strategy
1. **Auto Mode**: Automatically select equally spaced time points based on data distribution
2. **Fixed Interval**: Select at specified intervals (e.g., every 6 months)
3. **Custom**: User-specified specific time points
4. **Event-driven**: Select based on event occurrence density
## Quality Checklist
- [ ] Time axis precisely aligned with X-axis
- [ ] Number at risk calculations correct (can manually spot-check)
- [ ] Group labels clear and readable
- [ ] Font size meets journal requirements (≥8pt)
- [ ] Image resolution ≥300 DPI
- [ ] Color contrast meets accessibility standards
- [ ] Censoring marks (if present) clearly distinguishable
- [ ] Export format meets submission requirements
## Journal-Specific Notes
### NEJM
- Minimum font size 8pt
- Recommended time unit: months
- Fixed spacing between risk table and curve
### Lancet
- Minimum font size 9pt
- Support multi-group display (max 4 groups)
- Table grid lines optional
### JCO
- Need to label censoring information
- Support risk table + censoring table double-layer structure
- Recommend labeling median follow-up time
## FAQ
### Q: How are time points automatically determined?
A: Default uses quantiles in the data (0%, 25%, 50%, 75%, 100%) or fixed intervals (e.g., every 12 months)
### Q: How to handle multiple groups?
A: Automatically detect group column, support up to 6 groups. Exceeding automatically uses pagination or reduced font
### Q: Can it work with KM curves generated by Python/R?
A: Yes, supports importing external KM curve images for combination
## Dependency Requirements
```
numpy >= 1.20.0
pandas >= 1.3.0
matplotlib >= 3.4.0
seaborn >= 0.11.0
lifelines >= 0.27.0 (optional, for survival analysis)
Pillow >= 8.0.0 (image processing)
```
## Related Tools
- lifelines: Python survival analysis library
- survminer: R survival curve visualization
- ggsurvplot: ggplot2 survival plot extension
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `survival-curve-risk-table` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `survival-curve-risk-table` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
## Inputs to Collect
- Required inputs: the user goal, the primary data or source file, and the requested output format.
- Optional inputs: output directory, formatting preferences, and validation constraints.
- If a required input is unavailable, return a short clarification request before continuing.
## Output Contract
- Return a short summary, the main deliverables, and any assumptions that materially affect interpretation.
- If execution is partial, label what succeeded, what failed, and the next safe recovery step.
- Keep the final answer within the documented scope of the skill.
## Validation and Safety Rules
- Validate identifiers, file paths, and user-provided parameters before execution.
- Do not fabricate results, metrics, citations, or downstream conclusions.
- Use safe fallback behavior when dependencies, credentials, or required inputs are missing.
- Surface any execution failure with a concise diagnosis and recovery path.
FILE:README.md
# Survival Curve Risk Table Generator
Automatically align and add "Number at risk" tables below Kaplan-Meier survival curves, consistent with Clinical Oncology Journal standards.
## Install
```bash
pip install -r requirements.txt
```
## Quick Start
### 1. Create sample data
```bash
python scripts/main.py --create-sample-data sample_data.csv
```
### 2. Generate risk table
```bash
python scripts/main.py \
--input sample_data.csv \
--time-col time \
--event-col event \
--group-col treatment \
--style NEJM \
--output risk_table.png
```
### 3. Generate combination chart (KM curve + risk table)
```bash
python scripts/main.py \
--input sample_data.csv \
--time-col time \
--event-col event \
--group-col treatment \
--combine \
--output combined_figure.png
```
## Command line parameters
|parameter|illustrate|Example|
|------|------|------|
| `--input` |Enter data file path| `data.csv` |
| `--time-col` |time column name| `time` |
| `--event-col` |Event column name (1=event, 0=censored)| `event` |
| `--group-col` |Grouping column name| `treatment` |
| `--output` |Output file path| `risk_table.png` |
| `--style` |Journal style (NEJM/Lancet/JCO)| `NEJM` |
| `--time-points` |Custom time point| `0,6,12,18,24,30` |
| `--combine` |Generate combination diagram| - |
| `--show-censored` |Show censored number of people| - |
## Input data format
CSV format example:
```csv
time,event,treatment
12.5,1,Experimental
18.3,0,Experimental
24.0,1,Control
...
```
## Python API
```python
from scripts.main import RiskTableGenerator
# initialization
generator = RiskTableGenerator(style="NEJM")
# Load data
generator.load_data_from_file(
"data.csv",
time_col="time",
event_col="event",
group_col="treatment"
)
# Generate risk table
generator.generate_risk_table("risk_table.png")
# Generate combination diagram
generator.generate_combined_plot(output_path="combined.png")
```
## Journal style
- **NEJM**: New England Journal of Medicine style- **Lancet**: The Lancet style- **JCO**: Journal of Clinical Oncology style
## License
MIT License
FILE:references/runtime_checklist.md
# Runtime Checklist
- Category: `Data Analysis`
- Validate the user goal, required inputs, and output format before taking action.
- Ask a targeted clarification question when a required input is missing.
- Keep the response scoped to the documented workflow and state assumptions explicitly.
- Run a non-destructive smoke check before any file-dependent or data-dependent command.
- Recommended smoke check: `python -m py_compile scripts/main.py`
- If execution fails, stop and return a concise recovery path instead of fabricating results.
FILE:requirements.txt
lifelines
matplotlib
numpy
pandas
pil
pillow
seaborn
FILE:scripts/main.py
#!/usr/bin/env python3
"""Survival Curve Risk Table Generator
Automatically align and add "Number at risk" table below Kaplan-Meier survival curve
Comply with clinical oncology journal standards (NEJM, Lancet, JCO, etc.)
Author:OpenClaw
Version: 1.0.0"""
import argparse
import sys
import warnings
from pathlib import Path
from typing import List, Optional, Tuple, Dict, Union
import json
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.gridspec import GridSpec
# Try importing optional dependencies
try:
from PIL import Image
HAS_PIL = True
except ImportError:
HAS_PIL = False
warnings.warn("PIL not installed. Image combining features disabled.")
try:
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
HAS_LIFELINES = True
except ImportError:
HAS_LIFELINES = False
warnings.warn("lifelines not installed. Survival analysis features limited.")
class RiskTableGenerator:
"""Survival Curve Risk Table Generator
Main functions:
1. Calculate the number of people at risk at each time point from survival data
2. Generate tables that comply with journal standards
3. Combine with KM curve to generate publication-level images"""
# Predefined journal style configurations
JOURNAL_STYLES = {
"NEJM": {
"font_family": "Helvetica",
"font_size": 8,
"table_height_ratio": 0.15,
"show_grid": False,
"separator_lines": True,
"header_bold": True,
"time_label": "mo",
},
"Lancet": {
"font_family": "Times New Roman",
"font_size": 9,
"table_height_ratio": 0.18,
"show_grid": True,
"separator_lines": True,
"header_bold": True,
"time_label": "months",
},
"JCO": {
"font_family": "Arial",
"font_size": 8,
"table_height_ratio": 0.16,
"show_grid": False,
"separator_lines": False,
"header_bold": False,
"time_label": "mo",
"show_censored": True,
"censor_symbol": "+",
},
"custom": {
"font_family": "Arial",
"font_size": 8,
"table_height_ratio": 0.15,
"show_grid": False,
"separator_lines": True,
"header_bold": True,
"time_label": "mo",
}
}
def __init__(
self,
style: str = "NEJM",
time_points: Optional[List[float]] = None,
figure_size: Tuple[float, float] = (8, 6),
dpi: int = 300,
custom_style: Optional[Dict] = None
):
"""Initialize risk table generator
Args:
style: journal style (NEJM, Lancet, JCO, custom)
time_points: Custom time point list
figure_size: image size (width, height) inches
dpi: image resolution
custom_style: Custom style configuration (used when style='custom')"""
self.style_name = style
self.style = self.JOURNAL_STYLES.get(style, self.JOURNAL_STYLES["NEJM"]).copy()
if custom_style and style == "custom":
self.style.update(custom_style)
self.time_points = time_points
self.figure_size = figure_size
self.dpi = dpi
self.data = None
self.time_col = None
self.event_col = None
self.group_col = None
self.groups = None
def load_data(
self,
df: pd.DataFrame,
time_col: str,
event_col: str,
group_col: Optional[str] = None
) -> None:
"""Load survival data
Args:
df: survival data DataFrame
time_col: time column name
event_col: event column name (1=event, 0=censored)
group_col: grouping column name (optional)"""
self.data = df.copy()
self.time_col = time_col
self.event_col = event_col
self.group_col = group_col
# Validate required columns
required_cols = [time_col, event_col]
missing_cols = [col for col in required_cols if col not in df.columns]
if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")
if group_col and group_col not in df.columns:
raise ValueError(f"Group column '{group_col}' not found in data")
# Extract group information
if group_col:
self.groups = df[group_col].unique().tolist()
else:
self.groups = ["All"]
self.data["__group__"] = "All"
self.group_col = "__group__"
# Data type checking
if not pd.api.types.is_numeric_dtype(self.data[time_col]):
raise ValueError(f"Time column '{time_col}' must be numeric")
# Make sure the event column is numeric
self.data[event_col] = pd.to_numeric(self.data[event_col], errors='coerce')
# Automatically determine time point (if not specified)
if self.time_points is None:
self._auto_select_time_points()
def load_data_from_file(
self,
file_path: str,
time_col: str,
event_col: str,
group_col: Optional[str] = None
) -> None:
"""Load survival data from file
Supported formats: CSV, Excel, SAS, pickle"""
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"File not found: {file_path}")
suffix = path.suffix.lower()
if suffix == '.csv':
df = pd.read_csv(file_path)
elif suffix in ['.xlsx', '.xls']:
df = pd.read_excel(file_path)
elif suffix == '.sas7bdat':
df = pd.read_sas(file_path)
elif suffix in ['.pkl', '.pickle']:
df = pd.read_pickle(file_path)
else:
raise ValueError(f"Unsupported file format: {suffix}")
self.load_data(df, time_col, event_col, group_col)
def _auto_select_time_points(self, n_points: int = 7) -> None:
"""Automatically select time points
Strategy: Use equally spaced time points to cover the 0 to maximum follow-up time of the data"""
if self.data is None:
raise ValueError("No data loaded")
max_time = self.data[self.time_col].max()
# Generate equidistant time points (containing 0 and maximum values)
self.time_points = np.linspace(0, max_time, n_points).tolist()
self.time_points = [round(t, 1) for t in self.time_points]
def calculate_number_at_risk(self) -> pd.DataFrame:
"""Calculate the number of people at risk at each time point
Returns:
DataFrame, columns are time points, behavior grouping, value is the number of people at risk"""
if self.data is None:
raise ValueError("No data loaded")
results = []
for group in self.groups:
group_data = self.data[self.data[self.group_col] == group]
n_total = len(group_data)
row = {"Group": group}
for t in self.time_points:
# The number of people at risk = the total number of people - the number of people who had the incident before time t - the number of people who were censored before time t
events_before = len(group_data[
(group_data[self.time_col] <= t) &
(group_data[self.event_col] == 1)
])
censored_before = len(group_data[
(group_data[self.time_col] < t) &
(group_data[self.event_col] == 0)
])
n_at_risk = n_total - events_before - censored_before
row[f"t_{t}"] = max(0, n_at_risk)
results.append(row)
return pd.DataFrame(results)
def calculate_censored_counts(self) -> pd.DataFrame:
"""Calculate the censored number of people in each time interval (for JCO style)"""
if self.data is None:
raise ValueError("No data loaded")
results = []
for group in self.groups:
group_data = self.data[self.data[self.group_col] == group]
row = {"Group": group}
for i, t in enumerate(self.time_points):
if i == 0:
censored_in_interval = 0
else:
t_prev = self.time_points[i - 1]
censored_in_interval = len(group_data[
(group_data[self.time_col] > t_prev) &
(group_data[self.time_col] <= t) &
(group_data[self.event_col] == 0)
])
row[f"t_{t}"] = censored_in_interval
results.append(row)
return pd.DataFrame(results)
def generate_risk_table(
self,
output_path: str,
show_censored: Optional[bool] = None,
show_events: bool = False,
title: Optional[str] = None
) -> None:
"""Generate independent risk table images
Args:
output_path: output file path
show_censored: whether to display the censored number of people (default is read from the style configuration)
show_events: whether to display the number of events
title: table title"""
if self.data is None:
raise ValueError("No data loaded. Call load_data() first.")
if show_censored is None:
show_censored = self.style.get("show_censored", False)
# Calculated data
risk_df = self.calculate_number_at_risk()
# Set font
plt.rcParams['font.family'] = self.style.get("font_family", "Arial")
# Create graphics
fig, ax = plt.subplots(figsize=self.figure_size, dpi=self.dpi)
ax.axis('off')
# Prepare tabular data
table_data = self._prepare_table_data(risk_df, show_censored)
# draw table
self._draw_risk_table(ax, table_data, title)
# save
plt.tight_layout()
plt.savefig(output_path, dpi=self.dpi, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.close()
print(f"Risk table saved to: {output_path}")
def _prepare_table_data(
self,
risk_df: pd.DataFrame,
show_censored: bool
) -> List[List[str]]:
"""Prepare tabular data"""
# Header
time_label = self.style.get("time_label", "mo")
headers = ["Number at risk"]
for t in self.time_points:
if t == self.time_points[-1]:
headers.append(f"{int(t)} ({time_label})")
else:
headers.append(str(int(t)))
# data row
rows = []
for _, row in risk_df.iterrows():
group_name = row["Group"]
data_row = [group_name]
for t in self.time_points:
data_row.append(str(int(row[f"t_{t}"])))
rows.append(data_row)
# If you need to display the censored number of people (JCO style)
if show_censored:
censored_df = self.calculate_censored_counts()
for _, row in censored_df.iterrows():
group_name = row["Group"]
data_row = [f" ({self.style.get('censor_symbol', '+')})"]
for t in self.time_points:
count = int(row[f"t_{t}"])
data_row.append(str(count) if count > 0 else "")
rows.append(data_row)
return [headers] + rows
def _draw_risk_table(
self,
ax,
table_data: List[List[str]],
title: Optional[str] = None
) -> None:
"""Draw a risk table"""
font_size = self.style.get("font_size", 8)
show_grid = self.style.get("show_grid", False)
header_bold = self.style.get("header_bold", True)
# Create table
table = ax.table(
cellText=table_data[1:],
colLabels=table_data[0],
loc='center',
cellLoc='center'
)
table.auto_set_font_size(False)
table.set_fontsize(font_size)
table.scale(1, 2)
# Set style
for i, key in enumerate(table.get_celld().keys()):
cell = table.get_celld()[key]
row, col = key
# Header style
if row == 0:
cell.set_text_props(fontweight='bold' if header_bold else 'normal')
cell.set_facecolor('#f0f0f0')
cell.set_edgecolor('black' if show_grid else 'none')
else:
# Data row style
cell.set_edgecolor('black' if show_grid else 'none')
# The first column (group name) is left aligned
if col == 0:
cell.set_text_props(ha='left')
cell._loc = 'left'
# Add title
if title:
ax.set_title(title, fontsize=font_size + 2, fontweight='bold', pad=10)
def generate_combined_plot(
self,
km_plot_path: Optional[str] = None,
output_path: str = "combined_survival_plot.png",
km_ax: Optional[matplotlib.axes.Axes] = None,
show_km_plot: bool = True,
km_title: Optional[str] = None
) -> None:
"""Generate a combined graph of KM curve and risk table
Args:
km_plot_path: external KM curve image path (optional)
output_path: output file path
km_ax: pre-generated KM curve axes (optional)
show_km_plot: whether to display KM curve
km_title: KM curve title"""
if not HAS_LIFELINES and km_ax is None and km_plot_path is None:
raise ImportError("lifelines is required for generating KM plots. "
"Install with: pip install lifelines")
# Set font
plt.rcParams['font.family'] = self.style.get("font_family", "Arial")
# Create a graphic layout
fig_height = self.figure_size[1]
table_height_ratio = self.style.get("table_height_ratio", 0.15)
if show_km_plot:
fig = plt.figure(figsize=(self.figure_size[0], fig_height), dpi=self.dpi)
gs = GridSpec(2, 1, height_ratios=[1 - table_height_ratio, table_height_ratio],
hspace=0.05)
# Upper part: KM curve
ax_km = fig.add_subplot(gs[0])
if km_plot_path and Path(km_plot_path).exists():
# Use external images
img = plt.imread(km_plot_path)
ax_km.imshow(img)
ax_km.axis('off')
elif km_ax is not None:
# Use the provided axes (complicated, not implemented yet)
pass
else:
# Automatically generate KM curve
self._plot_km_curve(ax_km, km_title)
# Lower part: risk table
ax_table = fig.add_subplot(gs[1])
ax_table.axis('off')
# Prepare and draw tables
risk_df = self.calculate_number_at_risk()
show_censored = self.style.get("show_censored", False)
table_data = self._prepare_table_data(risk_df, show_censored)
self._draw_risk_table(ax_table, table_data)
# Align X axis
if hasattr(self, '_km_xlim'):
ax_table.set_xlim(self._km_xlim)
else:
# Generate risk table only
self.generate_risk_table(output_path)
return
plt.savefig(output_path, dpi=self.dpi, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.close()
print(f"Combined plot saved to: {output_path}")
def _plot_km_curve(
self,
ax: matplotlib.axes.Axes,
title: Optional[str] = None
) -> None:
"""Use lifelines to draw KM curve"""
if not HAS_LIFELINES:
raise ImportError("lifelines is required for generating KM plots")
colors = plt.cm.tab10(np.linspace(0, 1, len(self.groups)))
for i, group in enumerate(self.groups):
group_data = self.data[self.data[self.group_col] == group]
kmf = KaplanMeierFitter()
kmf.fit(
durations=group_data[self.time_col],
event_observed=group_data[self.event_col],
label=group
)
kmf.plot_survival_function(
ax=ax,
ci_show=False,
color=colors[i],
linewidth=2
)
ax.set_xlabel(f"Time ({self.style.get('time_label', 'months')})", fontsize=10)
ax.set_ylabel("Survival Probability", fontsize=10)
ax.set_ylim(0, 1.05)
ax.legend(loc='lower left', frameon=False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
if title:
ax.set_title(title, fontsize=12, fontweight='bold')
# Save X-axis range for alignment
self._km_xlim = ax.get_xlim()
def export_risk_table_data(self, output_path: str) -> None:
"""Export risk table data to CSV"""
risk_df = self.calculate_number_at_risk()
# Rename columns to more friendly names
time_label = self.style.get("time_label", "mo")
col_map = {"Group": "Group"}
for t in self.time_points:
col_map[f"t_{t}"] = f"{int(t)} {time_label}"
risk_df = risk_df.rename(columns=col_map)
risk_df.to_csv(output_path, index=False)
print(f"Risk table data exported to: {output_path}")
def create_sample_data(output_path: str, n_patients: int = 300, seed: int = 42) -> None:
"""Create sample survival data for testing"""
np.random.seed(seed)
# Generate two sets of data
n_per_group = n_patients // 2
# Experimental Group: Better Survival
exp_times = np.random.exponential(scale=36, size=n_per_group)
exp_events = np.random.binomial(1, 0.6, n_per_group)
# control group
ctrl_times = np.random.exponential(scale=24, size=n_per_group)
ctrl_events = np.random.binomial(1, 0.7, n_per_group)
# Combined data
df = pd.DataFrame({
'time': np.concatenate([exp_times, ctrl_times]),
'event': np.concatenate([exp_events, ctrl_events]),
'treatment': ['Experimental'] * n_per_group + ['Control'] * n_per_group
})
# Censored for more than 60 months
df.loc[df['time'] > 60, 'event'] = 0
df.loc[df['time'] > 60, 'time'] = 60
df.to_csv(output_path, index=False)
print(f"Sample data created: {output_path}")
def main():
"""Command line entry"""
parser = argparse.ArgumentParser(
description="Generate Number at Risk table for Kaplan-Meier survival curves",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""Examples:
#Basic usage
python main.py --input data.csv --time-col time --event-col event --output risk_table.png
#Specify journal style
python main.py --input data.csv --time-col time --event-col event --style NEJM --output figure.pdf
# Combination chart (automatically generate KM curve)
python main.py --input data.csv --time-col time --event-col event --combine --output combined.png
#Specify time point
python main.py --input data.csv --time-col time --event-col event --time-points 0,6,12,18,24,30,36
#Create sample data
python main.py --create-sample-data sample.csv"""
)
# input parameters
parser.add_argument('--input', '-i', type=str, help='Input data file path')
parser.add_argument('--time-col', type=str, help='Time column name')
parser.add_argument('--event-col', type=str, help='Event column name (1=event, 0=censored)')
parser.add_argument('--group-col', type=str, help='Group column name (optional)')
# Output parameters
parser.add_argument('--output', '-o', type=str, default='risk_table.png',
help='Output file path (default: risk_table.png)')
parser.add_argument('--output-dir', type=str, help='Output directory for batch processing')
# style parameters
parser.add_argument('--style', type=str, default='NEJM',
choices=['NEJM', 'Lancet', 'JCO', 'custom'],
help='Journal style (default: NEJM)')
parser.add_argument('--time-points', type=str,
help='Comma-separated time points (e.g., 0,6,12,18,24)')
# Image parameters
parser.add_argument('--width', type=float, default=8, help='Figure width in inches (default: 8)')
parser.add_argument('--height', type=float, default=6, help='Figure height in inches (default: 6)')
parser.add_argument('--dpi', type=int, default=300, help='DPI resolution (default: 300)')
parser.add_argument('--font-size', type=int, help='Font size (overrides style default)')
# Function switch
parser.add_argument('--combine', action='store_true',
help='Generate combined KM plot with risk table')
parser.add_argument('--km-plot', type=str, help='Path to existing KM plot image')
parser.add_argument('--show-censored', action='store_true',
help='Show censored counts in table')
parser.add_argument('--show-events', action='store_true',
help='Show event counts in table')
parser.add_argument('--export-data', action='store_true',
help='Export risk table data as CSV')
# tool
parser.add_argument('--create-sample-data', type=str, metavar='PATH',
help='Create sample survival data for testing')
args = parser.parse_args()
# Create sample data
if args.create_sample_data:
create_sample_data(args.create_sample_data)
return
# Validate required parameters
if not args.input:
parser.error("--input is required (or use --create-sample-data to generate test data)")
if not args.time_col or not args.event_col:
parser.error("--time-col and --event-col are required")
# Analysis time point
time_points = None
if args.time_points:
time_points = [float(x.strip()) for x in args.time_points.split(',')]
# Custom style
custom_style = {}
if args.font_size:
custom_style['font_size'] = args.font_size
# Initialization generator
generator = RiskTableGenerator(
style=args.style,
time_points=time_points,
figure_size=(args.width, args.height),
dpi=args.dpi,
custom_style=custom_style if custom_style else None
)
# Load data
print(f"Loading data from: {args.input}")
generator.load_data_from_file(
args.input,
args.time_col,
args.event_col,
args.group_col
)
print(f"Loaded {len(generator.data)} records")
print(f"Groups: {generator.groups}")
print(f"Time points: {generator.time_points}")
# Export data
if args.export_data:
data_output = args.output.replace('.png', '.csv').replace('.pdf', '.csv')
generator.export_risk_table_data(data_output)
# generate image
if args.combine:
print("Generating combined plot...")
generator.generate_combined_plot(
km_plot_path=args.km_plot,
output_path=args.output
)
else:
print("Generating risk table...")
generator.generate_risk_table(
output_path=args.output,
show_censored=args.show_censored,
show_events=args.show_events
)
print("Done!")
if __name__ == "__main__":
main()
Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
---
name: study-limitations-drafter
description: Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
license: MIT
skill-author: AIPOCH
---
# Study Limitations Drafter
Professional limitation statement generator.
## When to Use
- Use this skill when the task needs Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
- Scope-focused workflow aligned to: Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.
## Example Usage
```bash
cd "20260318/scientific-skills/Academic Writing/study-limitations-drafter"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Use Cases
- Manuscript limitation sections
- Grant proposal risk assessment
- Peer review responses
## Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `limitations` | list[str] | Yes | - | List of study design constraints |
| `severity` | str | No | "minor" | Impact level: "minor" or "major" |
| `mitigation` | list[str] | No | - | Steps taken to address each limitation |
## Returns
- Professionally worded limitation paragraphs
- Balanced tone (honest but not self-defeating)
- Mitigation strategies
## Example
"While the single-center design limits generalizability..."
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
No additional Python packages required.
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `study-limitations-drafter` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `study-limitations-drafter` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## References
- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/audit-reference.md
# Audit Reference
## Scope
- Skill: `study-limitations-drafter`
- Core purpose: Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
- Use only within the documented workflow and category boundary defined in `SKILL.md`
## Supported Audit Paths
- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`
## Fallback Boundary
If required inputs are incomplete, the skill should still return:
- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Study Limitations Drafter
Transform study design flaws into professional limitation statements.
"""
import argparse
class LimitationsDrafter:
"""Draft study limitations sections."""
LIMITATION_TEMPLATES = {
"sample_size": {
"issue": "Small sample size",
"statement": "The sample size in this study was limited, which may have reduced statistical power to detect smaller effects."
},
"retrospective": {
"issue": "Retrospective design",
"statement": "The retrospective nature of this study introduces potential selection bias and limits causal inference."
},
"single_center": {
"issue": "Single-center study",
"statement": "Findings from this single-center study may not generalize to other populations or healthcare settings."
},
"short_followup": {
"issue": "Short follow-up",
"statement": "The relatively short follow-up period limits assessment of long-term outcomes and delayed effects."
},
"missing_data": {
"issue": "Missing data",
"statement": "Missing data for some variables may have introduced bias, though we employed appropriate imputation methods."
}
}
def generate_limitations(self, issues):
"""Generate limitations section."""
sections = []
sections.append("STUDY LIMITATIONS")
sections.append("-"*60)
sections.append("")
for issue in issues:
if issue in self.LIMITATION_TEMPLATES:
template = self.LIMITATION_TEMPLATES[issue]
sections.append(f"• {template['statement']}")
sections.append("")
sections.append("Despite these limitations, this study provides valuable insights...")
return "\n".join(sections)
def main():
parser = argparse.ArgumentParser(description="Study Limitations Drafter")
parser.add_argument("--issues", "-i", required=True,
help="Comma-separated limitation types")
parser.add_argument("--output", "-o", help="Output file")
args = parser.parse_args()
drafter = LimitationsDrafter()
issues = [i.strip() for i in args.issues.split(",")]
text = drafter.generate_limitations(issues)
print(text)
if args.output:
with open(args.output, 'w') as f:
f.write(text)
print(f"\nSaved to: {args.output}")
if __name__ == "__main__":
main()
Recommends appropriate statistical methods (T-test vs ANOVA, etc.) based.
---
name: statistical-analysis-advisor
description: Recommends appropriate statistical methods (T-test vs ANOVA, etc.) based.
license: MIT
skill-author: AIPOCH
---
# Statistical Analysis Advisor
Intelligent statistical test recommendation engine that guides users through selecting the right statistical methods for their data.
## When to Use
- Use this skill when the task needs Recommends appropriate statistical methods (T-test vs ANOVA, etc.) based.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
- Scope-focused workflow aligned to: Recommends appropriate statistical methods (T-test vs ANOVA, etc.) based.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `dataclasses`: `unspecified`. Declared in `requirements.txt`.
- `enum`: `unspecified`. Declared in `requirements.txt`.
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Data Analytics/statistical-analysis-advisor"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Capabilities
1. **Statistical Test Selection**
- Compares and recommends between T-test, ANOVA, Chi-square, Mann-Whitney, Kruskal-Wallis, etc.
- Considers data type, distribution, sample size, and research question
- Provides decision tree logic for test selection
2. **Assumption Checking**
- Normality tests (Shapiro-Wilk, Kolmogorov-Smirnov)
- Homogeneity of variance (Levene's test, Bartlett's test)
- Independence verification
- Outlier detection guidance
3. **Power Analysis & Sample Size**
- Effect size estimation (Cohen's d, eta-squared, Cramér's V)
- Sample size calculations for desired power
- Post-hoc power analysis
## Usage
```python
from scripts.main import StatisticalAdvisor
advisor = StatisticalAdvisor()
# Get test recommendation
recommendation = advisor.recommend_test(
data_type="continuous",
groups=2,
independent=True,
distribution="normal"
)
# Check assumptions
assumptions = advisor.check_assumptions(
data=[group1, group2],
test_type="independent_ttest"
)
# Power analysis
power = advisor.calculate_power(
effect_size=0.5,
alpha=0.05,
sample_size=30
)
```
## Input Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| data_type | str | "continuous", "categorical", "ordinal" |
| groups | int | Number of groups/comparison levels |
| independent | bool | Independent or paired/related samples |
| distribution | str | "normal", "non-normal", "unknown" |
| sample_size | int | Current or planned sample size |
## Technical Difficulty: High ⚠️
**Warning**: Statistical recommendations have significant implications for research validity. This skill requires human verification of all recommendations before application in published research.
## References
- See `references/statistical_tests_guide.md` for detailed test selection criteria
- See `references/assumption_tests.md` for assumption checking procedures
- See `references/power_analysis_guide.md` for power calculation methods
## Limitations
- Does not perform actual data analysis (recommendations only)
- Cannot access raw data directly
- Complex multivariate designs may require specialized consultation
- Bayesian alternatives not covered comprehensively
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `statistical-analysis-advisor` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `statistical-analysis-advisor` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:references/assumption_tests.md
# Assumption Checking Procedures
## Overview
Statistical tests rely on specific assumptions. This guide provides systematic procedures for checking these assumptions and actions to take when assumptions are violated.
## 1. Normality Assumption
### 1.1 Visual Methods
#### Histogram with Normal Overlay
```python
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
def plot_normality_check(data, title="Normality Check"):
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Histogram
axes[0].hist(data, bins='auto', density=True, alpha=0.7)
mu, sigma = np.mean(data), np.std(data)
x = np.linspace(min(data), max(data), 100)
axes[0].plot(x, stats.norm.pdf(x, mu, sigma), 'r-', lw=2)
axes[0].set_title(f'Histogram (μ={mu:.2f}, σ={sigma:.2f})')
# Q-Q Plot
stats.probplot(data, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot')
plt.suptitle(title)
plt.tight_layout()
return fig
```
#### Q-Q Plot Interpretation
- **Points on diagonal line**: Normal distribution
- **S-shape**: Skewness present
- **Inverted U**: Light-tailed distribution
- **U-shape**: Heavy-tailed distribution
### 1.2 Statistical Tests
#### Shapiro-Wilk Test
**Best for:** n < 50
```python
from scipy import stats
statistic, p_value = stats.shapiro(data)
# H0: Data comes from normal distribution
# If p < 0.05, reject normality
```
**Properties:**
- Most powerful normality test
- Sensitive to small sample sizes
- Recommended for n < 50
#### Kolmogorov-Smirnov Test
**Best for:** Larger samples, comparing to theoretical distribution
```python
statistic, p_value = stats.kstest(data, 'norm', args=(mean, std))
```
**Note:** Lass powerful than Shapiro-Wilk for normality testing.
#### Anderson-Darling Test
**Best for:** Detecting tail deviations
```python
result = stats.anderson(data, dist='norm')
# Compare statistic to critical values at different significance levels
```
### 1.3 When to Care About Normality
| Test | Normality Importance | Sample Size Robustness |
|------|---------------------|----------------------|
| T-Test | Moderate | Robust with n > 30 per group |
| ANOVA | Moderate-High | Robust with balanced n > 30 |
| Regression | High (residuals) | Depends on design |
### 1.4 Actions if Normality Violated
1. **Transform data:**
- Log transformation (right skew)
- Square root transformation (count data)
- Box-Cox transformation (find optimal λ)
2. **Use non-parametric alternative:**
- Mann-Whitney U instead of t-test
- Kruskal-Wallis instead of ANOVA
3. **Use robust methods:**
- Trimmed means
- Bootstrap confidence intervals
## 2. Homogeneity of Variance (Homoscedasticity)
### 2.1 Levene's Test
**Recommended for:** Non-normal distributions
```python
from scipy import stats
statistic, p_value = stats.levene(group1, group2, group3)
# H0: Equal variances across groups
# If p < 0.05, variances are unequal
```
### 2.2 Bartlett's Test
**Recommended for:** Normal distributions (more powerful)
```python
statistic, p_value = stats.bartlett(group1, group2, group3)
```
**Caution:** Sensitive to non-normality.
### 2.3 Fligner-Killeen Test
**Recommended for:** Non-normal data, very robust
```python
statistic, p_value = stats.fligner(group1, group2, group3)
```
### 2.4 Actions if Homogeneity Violated
| Test | Alternative |
|------|-------------|
| Independent T-Test | Welch's T-Test |
| One-Way ANOVA | Welch's ANOVA |
| Repeated Measures | Use mixed models |
## 3. Independence Assumption
### 3.1 Checking Independence
**Study Design Review:**
- Were subjects randomly assigned?
- Are measurements truly independent?
- Any clustering/nesting present?
### 3.2 Durbin-Watson Test (for regression/time series)
```python
from statsmodels.stats.stattools import durbin_watson
dw_stat = durbin_watson(residuals)
# ~2: No autocorrelation
# < 1.5 or > 2.5: Possible autocorrelation
```
### 3.3 Actions if Independence Violated
- Use mixed-effects models
- Include cluster-robust standard errors
- Use generalized estimating equations (GEE)
## 4. Linearity (for Regression/Correlation)
### 4.1 Visual Check
```python
# Residuals vs Fitted plot
plt.scatter(fitted_values, residuals)
plt.axhline(y=0, color='r', linestyle='--')
```
**Patterns:**
- **Random scatter**: Linearity satisfied
- **Curved pattern**: Non-linearity present
- **Funnel shape**: Heteroscedasticity
### 4.2 Actions if Non-linear
- Polynomial terms
- Transform variables
- Use non-linear regression
- Splines/GAMs
## 5. Sphericity (Repeated Measures)
### 5.1 Mauchly's Test
```r
# In R
library(car)
model <- aov(outcome ~ time + Error(subject/time), data = df)
mauchly.test(model)
```
### 5.2 Corrections
- Greenhouse-Geisser (ε < 0.75)
- Huynh-Feldt (ε > 0.75)
## 6. Outlier Detection
### 6.1 Methods
#### Z-Score Method
```python
z_scores = np.abs(stats.zscore(data))
outliers = np.where(z_scores > 3)[0] # |z| > 3
```
#### IQR Method
```python
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
outliers = data[(data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)]
```
#### Modified Z-Score (using MAD)
```python
median = np.median(data)
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad
outliers = np.where(np.abs(modified_z) > 3.5)[0]
```
### 6.2 Actions for Outliers
1. **Investigate**: Data entry error?
2. **Report**: Analyze with and without outliers
3. **Transform**: Reduce outlier impact
4. **Robust methods**: Use resistant statistics
## 7. Multicollinearity (Regression)
### 7.1 Variance Inflation Factor (VIF)
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
```
**Interpretation:**
- VIF = 1: No correlation
- VIF 1-5: Moderate correlation
- VIF > 5: High correlation (problematic)
- VIF > 10: Severe multicollinearity
### 7.2 Actions
- Remove correlated predictors
- Combine into composite variable
- Use PCA
- Ridge regression
## 8. Comprehensive Assumption Check Function
```python
def comprehensive_assumption_check(data, groups=None, test_type="ttest"):
"""
Comprehensive assumption checking for common statistical tests.
Parameters:
-----------
data : array-like
Data to check (or list of arrays for multiple groups)
groups : list of array-like, optional
Separate groups for group comparisons
test_type : str
Type of test planned ("ttest", "anova", "regression")
Returns:
--------
dict : Assumption check results
"""
results = {}
if groups is not None:
# Multiple groups
all_data = np.concatenate(groups)
# Normality per group
results['normality'] = {}
for i, group in enumerate(groups):
stat, pval = stats.shapiro(group)
results['normality'][f'group_{i+1}'] = {
'statistic': stat,
'p_value': pval,
'passed': pval > 0.05
}
# Homogeneity
stat, pval = stats.levene(*groups)
results['homogeneity'] = {
'test': 'Levene',
'statistic': stat,
'p_value': pval,
'passed': pval > 0.05
}
# Sample sizes
results['sample_sizes'] = [len(g) for g in groups]
else:
# Single sample or paired differences
stat, pval = stats.shapiro(data)
results['normality'] = {
'statistic': stat,
'p_value': pval,
'passed': pval > 0.05
}
return results
```
## 9. Quick Reference: Assumption Violation Actions
| Assumption | Test | If Violated |
|------------|------|-------------|
| Normality | Shapiro-Wilk | Non-parametric test or transformation |
| Homogeneity | Levene's | Welch's test or Brown-Forsythe |
| Independence | Design review | Mixed models or GEE |
| Linearity | Residual plots | Polynomial/spline terms |
| Sphericity | Mauchly's | GG/HF corrections |
| Multicollinearity | VIF | Remove/combine variables |
## References
1. Tabachnick, B. G., & Fidell, L. S. (2019). Using Multivariate Statistics (7th ed.).
2. Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.).
3. Wilcox, R. R. (2017). Introduction to Robust Estimation and Hypothesis Testing (4th ed.).
FILE:references/power_analysis_guide.md
# Power Analysis Guide
## Overview
Statistical power is the probability of correctly rejecting a false null hypothesis (1 - β). This guide covers power analysis for common statistical tests and sample size determination.
## 1. Key Concepts
### 1.1 Components of Power Analysis
| Component | Symbol | Description |
|-----------|--------|-------------|
| Significance Level | α | Probability of Type I error (typically 0.05) |
| Power | 1-β | Probability of detecting true effect (typically 0.80) |
| Effect Size | d, f, r | Standardized magnitude of effect |
| Sample Size | n | Number of observations |
### 1.2 Effect Size Conventions
#### Cohen's d (for t-tests)
```
d = (M1 - M2) / SD_pooled
```
| Effect Size | d | Interpretation |
|-------------|---|----------------|
| Small | 0.2 | Barely perceptible |
| Medium | 0.5 | Observable difference |
| Large | 0.8 | Obvious difference |
#### Cohen's f (for ANOVA)
```
f = σ_means / σ_within
```
| Effect Size | f | η² |
|-------------|---|-----|
| Small | 0.1 | 0.01 |
| Medium | 0.25 | 0.06 |
| Large | 0.4 | 0.14 |
#### Cramér's V (for Chi-Square)
```
V = sqrt(χ² / (n * (k - 1)))
```
| Effect Size | V |
|-------------|---|
| Small | 0.1 |
| Medium | 0.3 |
| Large | 0.5 |
## 2. Power Calculation Methods
### 2.1 Analytical Approach
#### Two-Sample T-Test
```python
import scipy.stats as stats
import numpy as np
def power_ttest(effect_size, n_per_group, alpha=0.05):
"""Calculate power for independent t-test."""
# Non-centrality parameter
ncp = effect_size * np.sqrt(n_per_group / 2)
# Critical t-value
df = 2 * n_per_group - 2
t_crit = stats.t.ppf(1 - alpha/2, df)
# Power calculation
power = 1 - stats.nct.cdf(t_crit, df, ncp) + stats.nct.cdf(-t_crit, df, ncp)
return power
```
#### Sample Size Formula (Two-Sample T-Test)
```python
def sample_size_ttest(effect_size, power=0.8, alpha=0.05):
"""Calculate required sample size per group."""
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)
n = 2 * ((z_alpha + z_beta) / effect_size) ** 2
return int(np.ceil(n))
```
### 2.2 Simulation Approach
```python
def power_by_simulation(effect_size, n_per_group, n_simulations=10000, alpha=0.05):
"""
Estimate power through Monte Carlo simulation.
More flexible for complex designs.
"""
significant_results = 0
for _ in range(n_simulations):
# Simulate data under alternative hypothesis
group1 = np.random.normal(0, 1, n_per_group)
group2 = np.random.normal(effect_size, 1, n_per_group)
# Perform test
_, p_value = stats.ttest_ind(group1, group2)
if p_value < alpha:
significant_results += 1
return significant_results / n_simulations
```
## 3. Practical Power Analysis Examples
### 3.1 Two-Sample T-Test
```python
from statsmodels.stats.power import TTestIndPower
power_analysis = TTestIndPower()
# Calculate power
effect_size = 0.5 # Medium effect
alpha = 0.05
n_per_group = 30
power = power_analysis.power(effect_size=effect_size,
nobs1=n_per_group,
alpha=alpha)
print(f"Power: {power:.2%}")
# Calculate sample size
required_n = power_analysis.solve_power(effect_size=effect_size,
power=0.8,
alpha=alpha)
print(f"Required n per group: {int(np.ceil(required_n))}")
```
### 3.2 One-Way ANOVA
```python
from statsmodels.stats.power import FTestAnovaPower
power_analysis = FTestAnovaPower()
# Calculate power for 3 groups
effect_size = 0.25 # f = 0.25 (medium)
alpha = 0.05
n_per_group = 20
k_groups = 3
power = power_analysis.power(effect_size=effect_size,
nobs=n_per_group,
alpha=alpha,
k_groups=k_groups)
# Calculate sample size
required_n = power_analysis.solve_power(effect_size=effect_size,
power=0.8,
alpha=alpha,
k_groups=k_groups)
```
### 3.3 Chi-Square Test
```python
def power_chi_square(effect_size, n, df=1, alpha=0.05):
"""
Approximate power for chi-square test.
effect_size: Cramér's V or phi coefficient
"""
# Non-centrality parameter
ncp = n * effect_size ** 2
# Critical value
crit = stats.chi2.ppf(1 - alpha, df)
# Power
power = 1 - stats.ncx2.cdf(crit, df, ncp)
return power
```
## 4. Sample Size Determination
### 4.1 Planning Table: T-Test
| Effect Size | Power = 0.80 | Power = 0.90 | Power = 0.95 |
|-------------|--------------|--------------|--------------|
| Small (d=0.2) | 393/group | 526/group | 651/group |
| Medium (d=0.5) | 64/group | 86/group | 105/group |
| Large (d=0.8) | 26/group | 34/group | 42/group |
### 4.2 Planning Table: ANOVA (3 groups)
| Effect Size | Power = 0.80 | Power = 0.90 |
|-------------|--------------|--------------|
| Small (f=0.1) | 109/group | 145/group |
| Medium (f=0.25) | 18/group | 24/group |
| Large (f=0.4) | 8/group | 10/group |
### 4.3 Accounting for Attrition
```python
def adjust_for_attrition(n_required, attrition_rate=0.20):
"""
Adjust sample size for expected dropout.
Parameters:
-----------
n_required : int
Required completers
attrition_rate : float
Expected dropout rate (0.20 = 20%)
Returns:
--------
int : Adjusted enrollment target
"""
n_enroll = int(np.ceil(n_required / (1 - attrition_rate)))
return n_enroll
# Example
n_needed = 64 # Completers needed
n_enroll = adjust_for_attrition(n_needed, attrition_rate=0.20)
print(f"Enroll {n_enroll} to get {n_needed} completers with 20% attrition")
```
## 5. Power Curves
```python
import matplotlib.pyplot as plt
def plot_power_curve(effect_sizes, sample_sizes, alpha=0.05):
"""Generate power curves for different sample sizes."""
fig, ax = plt.subplots(figsize=(10, 6))
for n in sample_sizes:
powers = []
for es in effect_sizes:
power = power_analysis.power(effect_size=es, nobs1=n, alpha=alpha)
powers.append(power)
ax.plot(effect_sizes, powers, label=f'n={n}', marker='o')
ax.axhline(y=0.8, color='r', linestyle='--', label='Target Power (80%)')
ax.set_xlabel('Effect Size (Cohen\'s d)')
ax.set_ylabel('Power')
ax.set_title('Power Curves for Independent T-Test')
ax.legend()
ax.grid(True, alpha=0.3)
return fig
# Generate curve
effect_sizes = np.arange(0.1, 1.0, 0.1)
sample_sizes = [20, 30, 50, 100]
plot_power_curve(effect_sizes, sample_sizes)
```
## 6. Post-Hoc Power Analysis
**Warning:** Post-hoc power analysis is controversial. Better to report:
- Observed effect size
- Confidence interval
- p-value
If you must calculate observed power:
```python
def observed_power_from_pvalue(p_value, n_per_group, alpha=0.05):
"""
Calculate observed power from obtained p-value.
Not recommended for general use.
"""
# Back-calculate effect size from p-value
df = 2 * n_per_group - 2
t_observed = stats.t.ppf(1 - p_value/2, df)
# Approximate effect size
d_observed = t_observed * np.sqrt(2/n_per_group)
# Calculate power at observed effect
power = power_analysis.power(effect_size=d_observed,
nobs1=n_per_group,
alpha=alpha)
return power, d_observed
```
## 7. Sensitivity Analysis
```python
def sensitivity_analysis(n_per_group, min_power=0.8, alpha=0.05):
"""
Determine minimum detectable effect size for given sample.
"""
# Binary search for effect size that achieves target power
low, high = 0.01, 2.0
while high - low > 0.001:
mid = (low + high) / 2
power = power_analysis.power(effect_size=mid,
nobs1=n_per_group,
alpha=alpha)
if power < min_power:
low = mid
else:
high = mid
return mid
# Example: What effect size can we detect with n=50?
mde = sensitivity_analysis(50, min_power=0.8)
print(f"Minimum detectable effect size (d): {mde:.2f}")
```
## 8. Multiplicity Corrections
When conducting multiple tests, adjust sample size calculations:
```python
def bonferroni_adjusted_alpha(alpha, n_tests):
"""Conservative correction for multiple comparisons."""
return alpha / n_tests
def fdr_adjusted_sample_size(n_original, n_tests, alpha=0.05):
"""
Approximate adjustment for FDR control.
Conservative approach: increase sample size by factor.
"""
# Benjamini-Hochberg adjustment factor (approximate)
adjustment = np.sum(1 / np.arange(1, n_tests + 1))
return int(np.ceil(n_original * adjustment))
```
## 9. Equivalence and Non-Inferiority
```python
def sample_size_equivalence(margin, sd, alpha=0.05, power=0.8):
"""
Sample size for equivalence test (TOST).
Parameters:
-----------
margin : float
Equivalence margin
sd : float
Standard deviation
"""
# Two one-sided tests approach
z_alpha = stats.norm.ppf(1 - alpha)
z_beta = stats.norm.ppf(power)
n = 2 * ((z_alpha + z_beta) * sd / margin) ** 2
return int(np.ceil(n))
```
## 10. Reporting Guidelines
Include in methods section:
```
Sample size was determined based on a power analysis for a
independent samples t-test. With an expected medium effect
size (Cohen's d = 0.5), α = 0.05, and desired power of 0.80,
a minimum of 64 participants per group (128 total) was required.
To account for an anticipated 15% attrition rate, we aimed to
enroll 150 participants.
```
## 11. Software Comparison
| Software | Function/Package | Pros | Cons |
|----------|-----------------|------|------|
| Python | statsmodels | Free, flexible | Requires coding |
| R | pwr package | Comprehensive | Learning curve |
| G*Power | GUI application | User-friendly | Limited customization |
| Stata | power command | Integrated | Commercial |
## References
1. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.).
2. Faul, F., et al. (2009). G*Power 3.1. Behavior Research Methods, 41(4), 1149-1160.
3. Perugini, M., et al. (2018). A Practical Primer To Power Analysis.
4. Lakens, D. (2021). Sample Size Justification. PsyArXiv.
FILE:references/statistical_tests_guide.md
# Statistical Tests Guide
## Overview
This guide provides detailed criteria for selecting appropriate statistical tests based on research design, data characteristics, and distributional assumptions.
## 1. Two-Group Comparisons
### 1.1 Independent Samples
#### T-Test (Independent)
**Use when:**
- Continuous outcome variable
- Two independent groups
- Approximately normal distribution
- Homogeneity of variances (can use Welch's if violated)
**Formula:**
```
t = (M1 - M2) / sqrt(((s1²/n1) + (s2²/n2)))
```
**Assumptions:**
1. Independence of observations
2. Normality (robust to violations with n > 30 per group)
3. Homogeneity of variance (use Welch's t-test if violated)
#### Mann-Whitney U Test (Wilcoxon Rank-Sum)
**Use when:**
- Ordinal or continuous non-normal data
- Two independent groups
- Assumption of normality severely violated
- Outliers present
**Advantages:**
- Robust to outliers
- No distributional assumptions
- Works with ordinal data
**Disadvantages:**
- Less powerful than t-test when normality holds
- Tests different null hypothesis (distribution equality vs. mean equality)
### 1.2 Paired/Related Samples
#### Paired T-Test
**Use when:**
- Same subjects measured twice
- Matched pairs design
- Differences are approximately normal
**Formula:**
```
t = M_diff / (SD_diff / sqrt(n))
```
#### Wilcoxon Signed-Rank Test
**Use when:**
- Paired design with non-normal differences
- Ordinal paired data
- Robust alternative to paired t-test
## 2. Three or More Groups (ANOVA Family)
### 2.1 One-Way ANOVA
**Use when:**
- Continuous outcome
- Three or more independent groups
- Normal distribution within each group
- Homogeneity of variances
**Formula:**
```
F = MS_between / MS_within
```
**Post-hoc tests:**
- Tukey HSD (all pairwise comparisons)
- Bonferroni (conservative, planned comparisons)
- Scheffé (complex comparisons)
### 2.2 Welch's ANOVA
**Use when:**
- Heterogeneous variances across groups
- More robust than standard ANOVA
### 2.3 Kruskal-Wallis H Test
**Use when:**
- Non-normal distributions
- Ordinal data with 3+ groups
- Non-parametric alternative to one-way ANOVA
**Post-hoc:** Dunn's test with Bonferroni correction
### 2.4 Repeated Measures ANOVA
**Use when:**
- Same subjects measured at multiple time points
- Assesses within-subject effects
**Assumptions:**
1. Sphericity (Mauchly's test)
2. Normality of residuals
3. No significant outliers
**Correction for sphericity violation:**
- Greenhouse-Geisser (conservative)
- Huynh-Feldt (less conservative)
### 2.5 Friedman Test
**Use when:**
- Repeated measures with non-normal data
- Non-parametric alternative to repeated measures ANOVA
## 3. Categorical Data Analysis
### 3.1 Chi-Square Test of Independence
**Use when:**
- Two categorical variables
- Independent observations
- Expected frequencies ≥ 5
**Formula:**
```
χ² = Σ((O - E)² / E)
```
**Requirements:**
- No more than 20% of cells with expected count < 5
- No cells with expected count < 1
### 3.2 Fisher's Exact Test
**Use when:**
- Small sample sizes
- Expected cell counts < 5
- 2×2 contingency tables
**Advantages:**
- Exact p-values (not approximate)
- Valid for any sample size
**Limitations:**
- Computationally intensive for large tables
### 3.3 McNemar's Test
**Use when:**
- Paired categorical data
- Before/after design with binary outcome
- Matched case-control studies
## 4. Correlation Analysis
### 4.1 Pearson Correlation
**Use when:**
- Two continuous variables
- Linear relationship
- Bivariate normality
### 4.2 Spearman's Rank Correlation
**Use when:**
- Ordinal data
- Non-linear monotonic relationship
- Outliers present
- Non-normal distributions
### 4.3 Kendall's Tau
**Use when:**
- Small sample sizes
- Tied ranks present
- More robust than Spearman
## 5. Regression Analysis
### 5.1 Linear Regression
**Use when:**
- Continuous outcome
- Linear relationship with predictors
- Independent errors
**Assumptions:**
1. Linearity
2. Independence of errors
3. Homoscedasticity
4. Normality of residuals
5. No multicollinearity
### 5.2 Logistic Regression
**Use when:**
- Binary outcome variable
- Predicting probability of event
### 5.3 Poisson Regression
**Use when:**
- Count data as outcome
- Rate/ratio outcomes
## 6. Decision Tree Summary
```
Data Type?
├── Continuous
│ ├── 2 Groups
│ │ ├── Independent → T-Test / Mann-Whitney
│ │ └── Paired → Paired T-Test / Wilcoxon
│ └── 3+ Groups
│ ├── Independent → ANOVA / Kruskal-Wallis
│ └── Repeated → RM-ANOVA / Friedman
├── Categorical
│ ├── 2×2 Table → Chi-Square / Fisher's Exact
│ └── Larger Table → Chi-Square
└── Ordinal
└── Non-parametric tests (Mann-Whitney, Kruskal-Wallis)
```
## 7. Effect Size Measures
| Test | Effect Size | Small | Medium | Large |
|------|-------------|-------|--------|-------|
| T-Test | Cohen's d | 0.2 | 0.5 | 0.8 |
| ANOVA | η² (eta-squared) | 0.01 | 0.06 | 0.14 |
| ANOVA | f | 0.1 | 0.25 | 0.4 |
| Chi-Square | Cramér's V | 0.1 | 0.3 | 0.5 |
| Correlation | r | 0.1 | 0.3 | 0.5 |
## 8. Software Implementation
### Python
```python
import scipy.stats as stats
# T-Test
stats.ttest_ind(group1, group2)
stats.ttest_ind(group1, group2, equal_var=False) # Welch's
# ANOVA
stats.f_oneway(group1, group2, group3)
# Non-parametric
stats.mannwhitneyu(group1, group2)
stats.kruskal(group1, group2, group3)
# Chi-Square
stats.chi2_contingency(contingency_table)
```
### R
```r
# T-Test
t.test(outcome ~ group, data = df)
t.test(outcome ~ group, data = df, var.equal = FALSE) # Welch's
# ANOVA
aov(outcome ~ group, data = df)
# Non-parametric
wilcox.test(outcome ~ group, data = df)
kruskal.test(outcome ~ group, data = df)
# Chi-Square
chisq.test(table)
fisher.test(table)
```
## References
1. Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE.
2. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.).
3. Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing.
FILE:requirements.txt
dataclasses
enum
FILE:scripts/main.py
#!/usr/bin/env python3
"""
Statistical Analysis Advisor
Recommends appropriate statistical methods based on dataset characteristics.
"""
from dataclasses import dataclass
from typing import List, Dict, Optional, Tuple, Any
from enum import Enum
import math
class DataType(Enum):
CONTINUOUS = "continuous"
CATEGORICAL = "categorical"
ORDINAL = "ordinal"
class Distribution(Enum):
NORMAL = "normal"
NON_NORMAL = "non-normal"
UNKNOWN = "unknown"
@dataclass
class TestRecommendation:
"""Result container for statistical test recommendations."""
recommended_test: str
alternative_tests: List[str]
assumptions: List[str]
warnings: List[str]
rationale: str
@dataclass
class AssumptionCheck:
"""Result container for assumption checking."""
assumption: str
test_name: str
satisfied: bool
details: str
recommendation: str
@dataclass
class PowerAnalysis:
"""Result container for power analysis."""
effect_size: float
sample_size: int
alpha: float
power: float
interpretation: str
class StatisticalAdvisor:
"""
Intelligent advisor for statistical test selection and experimental design.
"""
def __init__(self):
self.test_decision_tree = self._build_decision_tree()
self.assumption_tests = self._load_assumption_tests()
def _build_decision_tree(self) -> Dict:
"""Build the statistical test decision tree."""
return {
DataType.CONTINUOUS: {
2: { # Two groups
True: { # Independent
Distribution.NORMAL: "Independent Samples T-Test",
Distribution.NON_NORMAL: "Mann-Whitney U Test",
Distribution.UNKNOWN: "Mann-Whitney U Test (robust)"
},
False: { # Paired
Distribution.NORMAL: "Paired Samples T-Test",
Distribution.NON_NORMAL: "Wilcoxon Signed-Rank Test",
Distribution.UNKNOWN: "Wilcoxon Signed-Rank Test (robust)"
}
},
"3+": { # Three or more groups
True: { # Independent
Distribution.NORMAL: "One-Way ANOVA",
Distribution.NON_NORMAL: "Kruskal-Wallis H Test",
Distribution.UNKNOWN: "Kruskal-Wallis H Test (robust)"
},
False: { # Repeated measures
Distribution.NORMAL: "Repeated Measures ANOVA",
Distribution.NON_NORMAL: "Friedman Test",
Distribution.UNKNOWN: "Friedman Test (robust)"
}
}
},
DataType.CATEGORICAL: {
2: {
True: {
"default": "Chi-Square Test of Independence",
"expected_low": "Fisher's Exact Test"
}
},
"3+": {
True: {
"default": "Chi-Square Test of Independence"
}
}
},
DataType.ORDINAL: {
2: {
True: {
"default": "Mann-Whitney U Test"
},
False: {
"default": "Wilcoxon Signed-Rank Test"
}
},
"3+": {
True: {
"default": "Kruskal-Wallis H Test"
},
False: {
"default": "Friedman Test"
}
}
}
}
def _load_assumption_tests(self) -> Dict:
"""Load assumption checking procedures."""
return {
"independent_ttest": {
"normality": {
"test": "Shapiro-Wilk",
"alternative": "Kolmogorov-Smirnov",
"action_if_violated": "Use Mann-Whitney U Test"
},
"homogeneity": {
"test": "Levene's Test",
"alternative": "Bartlett's Test",
"action_if_violated": "Use Welch's T-Test"
},
"independence": {
"check": "Study design review",
"action_if_violated": "Non-independent data requires mixed models"
}
},
"paired_ttest": {
"normality": {
"test": "Shapiro-Wilk on differences",
"action_if_violated": "Use Wilcoxon Signed-Rank Test"
},
"independence_of_pairs": {
"check": "Study design review",
"action_if_violated": "Requires specialized paired analysis"
}
},
"anova": {
"normality": {
"test": "Shapiro-Wilk (per group)",
"action_if_violated": "Consider transformation or Kruskal-Wallis"
},
"homogeneity": {
"test": "Levene's Test",
"action_if_violated": "Use Welch's ANOVA or Brown-Forsythe"
},
"independence": {
"check": "Study design review",
"action_if_violated": "Requires mixed-effects models"
}
},
"chi_square": {
"expected_frequencies": {
"rule": "All expected counts ≥ 5",
"alternative_rule": "No more than 20% below 5",
"action_if_violated": "Use Fisher's Exact Test"
},
"independence": {
"check": "Study design review",
"action_if_violated": "Use McNemar's Test for paired data"
}
}
}
def recommend_test(
self,
data_type: str,
groups: int,
independent: bool,
distribution: str = "unknown",
paired: Optional[bool] = None,
expected_cell_counts: Optional[int] = None
) -> TestRecommendation:
"""
Recommend appropriate statistical test based on data characteristics.
Args:
data_type: "continuous", "categorical", or "ordinal"
groups: Number of groups (2 or more)
independent: Whether samples are independent
distribution: "normal", "non-normal", or "unknown"
paired: Whether samples are paired (overrides independent if set)
expected_cell_counts: For chi-square, minimum expected count
Returns:
TestRecommendation with recommended test and alternatives
"""
# Normalize inputs
data_type_enum = DataType(data_type.lower())
dist_enum = Distribution(distribution.lower())
# Handle paired override
if paired is not None:
independent = not paired
# Determine group key
group_key = 2 if groups == 2 else "3+"
warnings = []
alternative_tests = []
# Special handling for categorical data
if data_type_enum == DataType.CATEGORICAL:
if expected_cell_counts is not None and expected_cell_counts < 5:
recommended = "Fisher's Exact Test"
alternative_tests = ["Chi-Square with Yates' correction", "Bootstrap methods"]
warnings.append("Low expected cell counts - Fisher's Exact Test preferred")
else:
recommended = "Chi-Square Test of Independence"
alternative_tests = ["Fisher's Exact Test (if sparse)", "G-test"]
assumptions = [
"Independence of observations",
"Expected frequency ≥ 5 in at least 80% of cells",
"Mutual exclusivity of categories"
]
rationale = f"Chi-square tests are appropriate for categorical data with {groups} groups."
else:
# Lookup in decision tree
try:
tree = self.test_decision_tree[data_type_enum][group_key][independent]
if isinstance(tree, dict) and Distribution.NORMAL in tree:
recommended = tree.get(dist_enum, tree[Distribution.UNKNOWN])
else:
recommended = tree.get("default", "Consultation recommended")
# Determine alternatives
if data_type_enum == DataType.CONTINUOUS:
if groups == 2:
if independent:
alternative_tests = ["Welch's T-Test", "Yuen's Trimmed Mean T-Test"]
else:
alternative_tests = ["Permutation Test", "Bootstrap CI"]
else: # 3+ groups
if independent:
alternative_tests = ["Welch's ANOVA", "Brown-Forsythe Test"]
else:
alternative_tests = ["Mixed-effects Model", "GEE"]
assumptions = self._get_assumptions_for_test(recommended)
rationale = self._get_rationale(recommended, data_type_enum, groups, independent, dist_enum)
except KeyError:
recommended = "Consultation recommended"
alternative_tests = []
assumptions = []
warnings.append("Complex design - specialized consultation advised")
rationale = "The specified design requires specialized statistical consultation."
# Add sample size warnings
if groups >= 2 and data_type_enum == DataType.CONTINUOUS:
warnings.extend(self._check_sample_size_recommendations(groups))
return TestRecommendation(
recommended_test=recommended,
alternative_tests=alternative_tests,
assumptions=assumptions,
warnings=warnings,
rationale=rationale
)
def _get_assumptions_for_test(self, test_name: str) -> List[str]:
"""Get assumptions list for a specific test."""
test_key = test_name.lower().replace(" ", "_").replace("-", "_")
assumption_map = {
"independent": "independent_ttest",
"paired": "paired_ttest",
"anova": "anova",
"chi_square": "chi_square",
"mann_whitney": "independent_ttest", # Uses same assumptions check
"wilcoxon": "paired_ttest",
"kruskal_wallis": "anova"
}
for key, value in assumption_map.items():
if key in test_key:
test_info = self.assumption_tests.get(value, {})
return [f"{k}: {v.get('test', v.get('check', 'Required'))}"
for k, v in test_info.items()]
return ["Consult references/assumption_tests.md for detailed requirements"]
def _get_rationale(
self,
test_name: str,
data_type: DataType,
groups: int,
independent: bool,
distribution: Distribution
) -> str:
"""Generate rationale for test recommendation."""
parts = []
parts.append(f"Selected {test_name} based on:")
parts.append(f"- Data type: {data_type.value}")
parts.append(f"- Groups: {groups}")
parts.append(f"- Sample relationship: {'Independent' if independent else 'Paired/Related'}")
parts.append(f"- Distribution: {distribution.value}")
if "mann-whitney" in test_name.lower() or "wilcoxon" in test_name.lower() or "kruskal" in test_name.lower():
parts.append("\nNon-parametric test chosen due to non-normal distribution or ordinal data.")
parts.append("These tests are robust to outliers and don't assume normality.")
if "welch" in test_name.lower():
parts.append("\nWelch's test used for unequal variances.")
return "\n".join(parts)
def _check_sample_size_recommendations(self, groups: int) -> List[str]:
"""Check and return sample size recommendations."""
warnings = []
if groups == 2:
warnings.append("Minimum n=20 per group recommended for robust normality tests")
else:
warnings.append(f"Minimum n=30 per group recommended for ANOVA with {groups} groups")
return warnings
def check_assumptions(
self,
test_type: str,
sample_sizes: Optional[List[int]] = None,
normality_pvalues: Optional[List[float]] = None,
levene_pvalue: Optional[float] = None
) -> List[AssumptionCheck]:
"""
Check statistical assumptions for a given test.
Args:
test_type: Type of test being considered
sample_sizes: List of sample sizes per group
normality_pvalues: P-values from normality tests
levene_pvalue: P-value from Levene's homogeneity test
Returns:
List of AssumptionCheck results
"""
results = []
test_key = test_type.lower().replace(" ", "_")
# Check normality
if normality_pvalues is not None:
for i, pval in enumerate(normality_pvalues):
satisfied = pval > 0.05
results.append(AssumptionCheck(
assumption=f"Normality (Group {i+1})",
test_name="Shapiro-Wilk",
satisfied=satisfied,
details=f"p-value = {pval:.4f}",
recommendation="Use non-parametric alternative" if not satisfied else "Assumption satisfied"
))
# Check homogeneity
if levene_pvalue is not None:
satisfied = levene_pvalue > 0.05
results.append(AssumptionCheck(
assumption="Homogeneity of Variance",
test_name="Levene's Test",
satisfied=satisfied,
details=f"p-value = {levene_pvalue:.4f}",
recommendation="Use Welch's test" if not satisfied else "Assumption satisfied"
))
# Sample size check
if sample_sizes is not None:
for i, n in enumerate(sample_sizes):
satisfied = n >= 30
results.append(AssumptionCheck(
assumption=f"Adequate Sample Size (Group {i+1})",
test_name="Rule of Thumb",
satisfied=satisfied,
details=f"n = {n}",
recommendation="Consider power analysis" if not satisfied else "Sample size adequate"
))
return results
def calculate_power(
self,
effect_size: float,
alpha: float = 0.05,
sample_size: Optional[int] = None,
power: Optional[float] = None,
groups: int = 2
) -> PowerAnalysis:
"""
Calculate statistical power or required sample size.
Args:
effect_size: Cohen's d (for t-tests) or f (for ANOVA)
alpha: Significance level (default 0.05)
sample_size: Sample size per group (if known)
power: Desired power (if calculating sample size)
groups: Number of groups
Returns:
PowerAnalysis with results
Note:
Either sample_size OR power must be provided (not both)
"""
if sample_size is None and power is None:
raise ValueError("Either sample_size or power must be provided")
if sample_size is not None and power is not None:
raise ValueError("Provide only sample_size OR power, not both")
# Approximate power calculation using normal approximation
# For more precise calculations, use specialized libraries like statsmodels
if power is not None:
# Calculate required sample size
# Using simplified formula: n = 2*((Z_(1-alpha/2) + Z_power)/effect_size)^2 for two groups
z_alpha = 1.96 # for alpha=0.05
z_power = {0.8: 0.84, 0.9: 1.28, 0.95: 1.645}.get(power, 0.84)
n_per_group = int(2 * ((z_alpha + z_power) / effect_size) ** 2)
if groups > 2:
# Rough adjustment for ANOVA
n_per_group = int(n_per_group * (1 + 0.1 * (groups - 2)))
calculated_power = power
calculated_n = n_per_group
interpretation = f"To achieve {power*100:.0f}% power with effect size {effect_size}, need n={n_per_group} per group."
else:
# Calculate achieved power
z_alpha = 1.96
ncp = effect_size * math.sqrt(sample_size / 2) # non-centrality parameter
# Approximate power using normal distribution
import statistics
calculated_power = 1 - statistics.NormalDist().cdf(z_alpha - ncp)
calculated_power = min(calculated_power, 0.999) # Cap at 99.9%
calculated_n = sample_size
if calculated_power >= 0.8:
interpretation = f"Power ({calculated_power:.1%}) is adequate (≥80%)."
elif calculated_power >= 0.6:
interpretation = f"Power ({calculated_power:.1%}) is marginal. Consider increasing sample size."
else:
interpretation = f"Power ({calculated_power:.1%}) is insufficient. Increase sample size or effect size."
return PowerAnalysis(
effect_size=effect_size,
sample_size=calculated_n,
alpha=alpha,
power=calculated_power,
interpretation=interpretation
)
def get_effect_size_interpretation(self, effect_size: float, metric: str = "cohens_d") -> str:
"""
Interpret effect size magnitude.
Args:
effect_size: Calculated effect size value
metric: Type of effect size metric
Returns:
Interpretation string
"""
interpretations = {
"cohens_d": {
(0, 0.2): "Negligible",
(0.2, 0.5): "Small",
(0.5, 0.8): "Medium",
(0.8, float('inf')): "Large"
},
"eta_squared": {
(0, 0.01): "Small",
(0.01, 0.06): "Medium",
(0.06, 0.14): "Large",
(0.14, float('inf')): "Very Large"
},
"cramers_v": {
(0, 0.1): "Small",
(0.1, 0.3): "Medium",
(0.3, 0.5): "Large",
(0.5, float('inf')): "Very Large"
}
}
metric_interp = interpretations.get(metric, interpretations["cohens_d"])
for (lower, upper), label in metric_interp.items():
if lower <= abs(effect_size) < upper:
return f"{label} effect size ({metric} = {effect_size:.3f})"
return f"Effect size = {effect_size:.3f} (interpretation not available)"
def compare_tests(self, scenario: Dict[str, Any]) -> Dict[str, TestRecommendation]:
"""
Compare multiple test options for a given scenario.
Args:
scenario: Dictionary with data characteristics
Returns:
Dictionary mapping test names to recommendations
"""
recommendations = {}
# Parametric option
if scenario.get("distribution") == "normal":
rec = self.recommend_test(**scenario)
recommendations["parametric"] = rec
# Non-parametric option
non_normal_scenario = scenario.copy()
non_normal_scenario["distribution"] = "non-normal"
rec = self.recommend_test(**non_normal_scenario)
recommendations["non_parametric"] = rec
# Robust option
robust_scenario = scenario.copy()
robust_scenario["distribution"] = "unknown"
rec = self.recommend_test(**robust_scenario)
recommendations["robust"] = rec
return recommendations
def main():
"""Example usage and testing."""
advisor = StatisticalAdvisor()
# Example 1: Two-group comparison with normal distribution
print("=" * 60)
print("Example 1: Two-group independent comparison")
print("=" * 60)
rec = advisor.recommend_test(
data_type="continuous",
groups=2,
independent=True,
distribution="normal"
)
print(f"Recommended: {rec.recommended_test}")
print(f"Alternatives: {', '.join(rec.alternative_tests)}")
print(f"Rationale:\n{rec.rationale}")
print(f"Assumptions:")
for assumption in rec.assumptions:
print(f" - {assumption}")
# Example 2: Power analysis
print("\n" + "=" * 60)
print("Example 2: Power Analysis")
print("=" * 60)
power_result = advisor.calculate_power(
effect_size=0.5,
alpha=0.05,
sample_size=30
)
print(f"Effect size: {power_result.effect_size}")
print(f"Sample size: {power_result.sample_size}")
print(f"Achieved power: {power_result.power:.1%}")
print(f"Interpretation: {power_result.interpretation}")
# Example 3: Sample size calculation
print("\n" + "=" * 60)
print("Example 3: Sample Size Calculation")
print("=" * 60)
size_result = advisor.calculate_power(
effect_size=0.5,
alpha=0.05,
power=0.8
)
print(f"For 80% power with effect size 0.5:")
print(f"Required sample size: {size_result.sample_size} per group")
print(f"Total N: {size_result.sample_size * 2}")
# Example 4: Effect size interpretation
print("\n" + "=" * 60)
print("Example 4: Effect Size Interpretation")
print("=" * 60)
for es in [0.1, 0.4, 0.7, 1.2]:
interp = advisor.get_effect_size_interpretation(es, "cohens_d")
print(f"Cohen's d = {es}: {interp}")
if __name__ == "__main__":
main()
Map spatial transcriptomics data from 10x Genomics Visium/Xenium onto.
---
name: spatial-transcriptomics-mapper
description: Map spatial transcriptomics data from 10x Genomics Visium/Xenium onto.
license: MIT
skill-author: AIPOCH
---
# Spatial Transcriptomics Mapper (ID: 196)
## When to Use
- Use this skill when the task is to Map spatial transcriptomics data from 10x Genomics Visium/Xenium onto.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.
## Key Features
See `## Features` above for related details.
- Scope-focused workflow aligned to: Map spatial transcriptomics data from 10x Genomics Visium/Xenium onto.
- Packaged executable path(s): `scripts/__init__.py` plus 1 additional script(s).
- Structured execution path designed to keep outputs consistent and reviewable.
## Dependencies
See `## Prerequisites` above for related details.
- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `h5py`: `unspecified`. Declared in `requirements.txt`.
- `matplotlib`: `unspecified`. Declared in `requirements.txt`.
- `numpy`: `unspecified`. Declared in `requirements.txt`.
- `pandas`: `unspecified`. Declared in `requirements.txt`.
- `pil`: `unspecified`. Declared in `requirements.txt`.
- `pyarrow`: `unspecified`. Declared in `requirements.txt`.
- `scanpy`: `unspecified`. Declared in `requirements.txt`.
- `seaborn`: `unspecified`. Declared in `requirements.txt`.
- `squidpy`: `unspecified`. Declared in `requirements.txt`.
- `tifffile`: `unspecified`. Declared in `requirements.txt`.
## Example Usage
See `## Usage` above for related details.
```bash
cd "20260318/scientific-skills/Data Analytics/spatial-transcriptomics-mapper"
python -m py_compile scripts/main.py
python scripts/main.py --help
```
Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.
## Implementation Details
See `## Workflow` above for related details.
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/__init__.py` with additional helper scripts under `scripts/`.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
## Quick Check
Use this command to verify that the packaged script entry point can be parsed before deeper execution.
```bash
python -m py_compile scripts/main.py
```
## Audit-Ready Commands
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```
## Workflow
1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
## Description
Spatial Transcriptomics analysis tool for processing 10x Genomics Visium or Xenium data, projecting gene expression data back onto tissue section images to draw "gene-space" distribution maps. Supports gene expression visualization, spatial clustering analysis, and morphological feature correlation.
## Features
- **Visium Data Processing**: Supports Space Ranger output (filtered_feature_bc_matrix.h5, spatial/tissue_positions_list.csv, spatial/tissue_lowres_image.png)
- **Xenium Data Processing**: Supports Xenium Explorer output (.h5, transcripts.parquet, nucleus_boundaries.parquet)
- **Gene Expression Mapping**: Projects expression of specified genes onto tissue images
- **Spatial Clustering Visualization**: Displays spatial distribution of Seurat/Scanpy clustering results
- **Multi-gene Joint Analysis**: Supports combined visualization of multiple genes
- **High-resolution Output**: Supports high-resolution image export
## Installation
```text
# Required dependencies
pip install scanpy squidpy matplotlib seaborn pillow numpy pandas h5py
# Optional: For Xenium data processing
pip install pyarrow dask
# Optional: For advanced image processing
pip install opencv-python scikit-image
```
## Quick Start - Test Data
Generate sample Visium data to test the tool:
```text
# Generate test data
python scripts/generate_test_data.py \
--platform visium \
--output ./test_data/visium_sample \
--n-spots 500 \
--n-genes 1000
# Run analysis on test data
python scripts/main.py \
--platform visium \
--data-dir ./test_data/visium_sample \
--gene GENE_0000 \
--output ./test_output/
```
## Usage
### Basic - Visium Data
```text
python scripts/main.py \
--platform visium \
--data-dir /path/to/spaceranger/outs/ \
--gene PIK3CA \
--output ./output/
```
### Basic - Xenium Data
```text
python scripts/main.py \
--platform xenium \
--data-dir /path/to/xenium/outs/ \
--gene PIK3CA \
--output ./output/
```
### Multiple Genes
```text
python scripts/main.py \
--platform visium \
--data-dir /path/to/data/ \
--genes PIK3CA,PTEN,EGFR \
--mode overlay \
--output ./output/
```
### With Clustering Results
```text
python scripts/main.py \
--platform visium \
--data-dir /path/to/data/ \
--cluster-file ./clusters.csv \
--output ./output/
```
## Input File Structure
### Visium (Space Ranger output)
```
outs/
├── filtered_feature_bc_matrix.h5 # Gene expression matrix
├── raw_feature_bc_matrix.h5 # Raw counts (optional)
├── spatial/
│ ├── tissue_positions_list.csv # Spot positions
│ ├── tissue_lowres_image.png # Low-res H&E image
│ ├── tissue_hires_image.png # High-res H&E image
│ └── scalefactors_json.json # Scale factors
└── web_summary.html
```
### Xenium
```
outs/
├── cell_feature_matrix.h5 # Cell x gene matrix
├── transcripts.parquet # Transcript coordinates
├── nucleus_boundaries.parquet # Cell boundaries
├── cell_boundaries.parquet
├── morphology_focus.ome.tif # Morphology image
└── experiment.xenium
```
## Output Files
- `{gene}_spatial_map.png`: Single gene spatial expression map
- `{gene}_heatmap.png`: Gene expression heatmap
- `multi_gene_overlay.png`: Multi-gene overlay map (if using --mode overlay)
- `cluster_spatial_map.png`: Cluster spatial distribution map
- `combined_report.html`: Comprehensive HTML report
## Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `--platform` | str | required | Platform type: visium or xenium |
| `--data-dir` | str | required | Data directory path |
| `--gene` | str | optional | Single gene name |
| `--genes` | list | optional | Multiple genes, comma-separated |
| `--mode` | str | single | Mode: single/overlay/multi |
| `--cluster-file` | str | optional | Clustering result CSV file path |
| `--output` | str | ./output | Output directory |
| `--dpi` | int | 300 | Output image DPI |
| `--cmap` | str | viridis | Color map scheme |
| `--spot-size` | float | 1.0 | Visium spot size factor |
| `--alpha` | float | 0.8 | Transparency (0-1) |
| `--min-count` | int | 0 | Minimum expression filter |
| --crop | str | optional | Crop region (x1,y1,x2,y2) |
## Examples
### Example 1: Single Gene Visualization
```text
python scripts/main.py \
--platform visium \
--data-dir ./visium_sample/outs/ \
--gene EPCAM \
--cmap Reds \
--output ./results/
```
### Example 2: Tumor Marker Combination
```text
python scripts/main.py \
--platform visium \
--data-dir ./breast_cancer/outs/ \
--genes PIK3CA,ERBB2,ESR1,PGR \
--mode multi \
--cmap plasma \
--output ./tumor_markers/
```
### Example 3: Xenium Subcellular Resolution
```text
python scripts/main.py \
--platform xenium \
--data-dir ./xenium_lung/outs/ \
--genes SFTPB,SFTPC,SCGB1A1 \
--dpi 600 \
--output ./xenium_results/
```
### Example 4: Spatial Clustering Visualization
```text
python scripts/main.py \
--platform visium \
--data-dir ./sample/outs/ \
--cluster-file ./seurat_clusters.csv \
--output ./clusters/
```
## API Usage
```python
from skills.spatial_transcriptomics_mapper.scripts.main import SpatialMapper
# Initialize
mapper = SpatialMapper(
platform="visium",
data_dir="/path/to/data",
output_dir="./output"
)
# Load data
mapper.load_data()
# Plot single gene
mapper.plot_gene_spatial(
gene="PIK3CA",
cmap="viridis",
save_path="./output/pik3ca.png"
)
# Plot multiple genes
mapper.plot_multi_genes(
genes=["PIK3CA", "PTEN", "EGFR"],
mode="grid",
save_path="./output/multi.png"
)
# Get spatial statistics
stats = mapper.get_spatial_stats(gene="PIK3CA")
```
## Notes
- Visium data uses low-resolution images by default to improve processing speed, can use --hires parameter to enable high resolution
- For large Xenium datasets, it is recommended to use --crop parameter to specify region of interest
- Color map reference: https://matplotlib.org/stable/tutorials/colors/colormaps.html
- For large samples, consider using --downsample parameter to reduce resolution
## References
- 10x Genomics Visium: https://www.10xgenomics.com/products/spatial-gene-expression
- 10x Genomics Xenium: https://www.10xgenomics.com/platforms/xenium
- Scanpy: https://scanpy.readthedocs.io/
- Squidpy: https://squidpy.readthedocs.io/
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```text
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
## Output Requirements
Every final response should make these items explicit when they are relevant:
- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks
## Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.
## Input Validation
This skill accepts requests that match the documented purpose of `spatial-transcriptomics-mapper` and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
> `spatial-transcriptomics-mapper` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
## Response Template
Use the following fixed structure for non-trivial requests:
1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
FILE:requirements.txt
h5py
matplotlib
numpy
pandas
pil
pyarrow
scanpy
seaborn
squidpy
tifffile
FILE:scripts/main.py
#!/usr/bin/env python3
"""Spatial Transcriptomics Mapper
Process 10x Visium or Xenium data to project gene expression back to tissue section images
Author:OpenClaw
Version: 1.0.0"""
import os
import sys
import json
import argparse
import warnings
from pathlib import Path
from typing import List, Dict, Optional, Tuple, Union
import logging
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.colors import Normalize, LinearSegmentedColormap
from PIL import Image
# Optional imports with fallbacks
try:
import scanpy as sc
SCANPY_AVAILABLE = True
except ImportError:
SCANPY_AVAILABLE = False
warnings.warn("Scanpy not available. Some features may be limited.")
try:
import squidpy as sq
SQUIDPY_AVAILABLE = True
except ImportError:
SQUIDPY_AVAILABLE = False
try:
import seaborn as sns
SEABORN_AVAILABLE = True
except ImportError:
SEABORN_AVAILABLE = False
try:
import pyarrow.parquet as pq
PYARROW_AVAILABLE = True
except ImportError:
PYARROW_AVAILABLE = False
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class SpatialMapper:
"""Spatial transcriptome data mapper, supports Visium and Xenium platforms"""
def __init__(
self,
platform: str,
data_dir: str,
output_dir: str = "./output",
dpi: int = 300,
cmap: str = "viridis"
):
"""Initialize SpatialMapper
Parameters
----------
platform :str
Platform type: 'visium' or 'xenium'
data_dir : str
Data directory path
output_dir : str
output directory path
dpi : int
Output image DPI
cmap : str
color mapping scheme"""
self.platform = platform.lower()
self.data_dir = Path(data_dir)
self.output_dir = Path(output_dir)
self.dpi = dpi
self.cmap = cmap
# Validate platform
if self.platform not in ["visium", "xenium"]:
raise ValueError(f"Unsupported platform: {platform}. Use 'visium' or 'xenium'.")
# Create output directory
self.output_dir.mkdir(parents=True, exist_ok=True)
# Data containers
self.adata = None # AnnData object for Visium
self.gene_names = []
self.image = None
self.scale_factors = {}
self.positions = None
# Xenium specific
self.transcripts = None
self.boundaries = None
logger.info(f"Initialized SpatialMapper for {platform}")
logger.info(f"Data directory: {data_dir}")
logger.info(f"Output directory: {output_dir}")
def load_data(self) -> None:
"""Load spatial transcriptome data"""
if self.platform == "visium":
self._load_visium_data()
elif self.platform == "xenium":
self._load_xenium_data()
def _load_visium_data(self) -> None:
"""Load Visium data (Space Ranger output)"""
if not SCANPY_AVAILABLE:
raise ImportError("Scanpy is required for Visium data processing. Install with: pip install scanpy")
logger.info("Loading Visium data...")
# Check for required files
matrix_file = self.data_dir / "filtered_feature_bc_matrix.h5"
if not matrix_file.exists():
# Try alternative path structure
matrix_file = self.data_dir / "filtered_gene_bc_matrices" / "h5" / "filtered_feature_bc_matrix.h5"
if not matrix_file.exists():
raise FileNotFoundError(f"Matrix file not found: {matrix_file}")
# Load AnnData
self.adata = sc.read_visium(
self.data_dir,
count_file=matrix_file.name if matrix_file.parent == self.data_dir else None,
load_images=True
)
# Store gene names
self.gene_names = self.adata.var_names.tolist()
logger.info(f"Loaded {len(self.gene_names)} genes and {self.adata.n_obs} spots")
# Load scale factors
scalefactors_file = self.data_dir / "spatial" / "scalefactors_json.json"
if scalefactors_file.exists():
with open(scalefactors_file) as f:
self.scale_factors = json.load(f)
# Load tissue image
img_path = self.data_dir / "spatial" / "tissue_lowres_image.png"
if img_path.exists():
self.image = np.array(Image.open(img_path))
logger.info(f"Loaded tissue image: {self.image.shape}")
def _load_xenium_data(self) -> None:
"""Load Xenium data"""
logger.info("Loading Xenium data...")
# Load cell feature matrix
matrix_file = self.data_dir / "cell_feature_matrix.h5"
if matrix_file.exists() and SCANPY_AVAILABLE:
self.adata = sc.read_10x_h5(str(matrix_file))
self.gene_names = self.adata.var_names.tolist()
logger.info(f"Loaded {len(self.gene_names)} genes")
# Load transcripts
transcripts_file = self.data_dir / "transcripts.parquet"
if transcripts_file.exists():
if PYARROW_AVAILABLE:
self.transcripts = pq.read_table(transcripts_file).to_pandas()
logger.info(f"Loaded {len(self.transcripts)} transcripts")
else:
logger.warning("PyArrow not available, skipping transcripts loading")
# Load cell boundaries
boundaries_file = self.data_dir / "nucleus_boundaries.parquet"
if boundaries_file.exists() and PYARROW_AVAILABLE:
self.boundaries = pq.read_table(boundaries_file).to_pandas()
logger.info(f"Loaded {len(self.boundaries)} boundary points")
# Load morphology image
img_file = self.data_dir / "morphology_focus.ome.tif"
if img_file.exists():
try:
from tifffile import imread
self.image = imread(str(img_file))
logger.info(f"Loaded morphology image: {self.image.shape}")
except ImportError:
logger.warning("tifffile not available for OME-TIFF loading")
def plot_gene_spatial(
self,
gene: str,
save_path: Optional[str] = None,
title: Optional[str] = None,
spot_size: float = 1.0,
alpha: float = 0.8,
vmin: Optional[float] = None,
vmax: Optional[float] = None,
show_image: bool = True,
figsize: Tuple[int, int] = (10, 10)
) -> plt.Figure:
"""Plot spatial expression of individual genes
Parameters
----------
gene :str
Gene name
save_path: str, optional
save path
title: str, optional
Chart title
spot_size : float
Spot size factor
alpha : float
Transparency
vmin, vmax: float, optional
color range
show_image : bool
Whether to display tissue images
figsize : tuple
Image size
Returns
-------
fig : matplotlib.figure.Figure"""
if gene not in self.gene_names:
raise ValueError(f"Gene '{gene}' not found in dataset")
if self.platform == "visium":
return self._plot_visium_gene(
gene, save_path, title, spot_size, alpha,
vmin, vmax, show_image, figsize
)
else:
return self._plot_xenium_gene(
gene, save_path, title, alpha, vmin, vmax, figsize
)
def _plot_visium_gene(
self,
gene: str,
save_path: Optional[str],
title: Optional[str],
spot_size: float,
alpha: float,
vmin: Optional[float],
vmax: Optional[float],
show_image: bool,
figsize: Tuple[int, int]
) -> plt.Figure:
"""Drawing Visium single gene space map"""
fig, ax = plt.subplots(figsize=figsize)
# Get gene expression
gene_idx = self.gene_names.index(gene)
expression = self.adata.X[:, gene_idx].toarray().flatten() if hasattr(self.adata.X, 'toarray') else self.adata.X[:, gene_idx]
# Get spatial coordinates
spatial = self.adata.obsm['spatial']
# Plot tissue image background
if show_image and self.image is not None:
ax.imshow(self.image)
# Scale coordinates to image
if 'tissue_lowres_scalef' in self.scale_factors:
scale = self.scale_factors['tissue_lowres_scalef']
spatial_scaled = spatial * scale
else:
spatial_scaled = spatial
else:
spatial_scaled = spatial
# Filter zero expression for better visualization
nonzero_mask = expression > 0
# Plot spots
scatter = ax.scatter(
spatial_scaled[nonzero_mask, 0],
spatial_scaled[nonzero_mask, 1],
c=expression[nonzero_mask],
cmap=self.cmap,
s=spot_size * 50,
alpha=alpha,
edgecolors='none',
vmin=vmin or np.percentile(expression[expression > 0], 1),
vmax=vmax or np.percentile(expression[expression > 0], 99)
)
# Add colorbar
cbar = plt.colorbar(scatter, ax=ax, shrink=0.5)
cbar.set_label(f'{gene} Expression', fontsize=10)
# Styling
ax.set_title(title or f'{gene} Spatial Expression', fontsize=14, fontweight='bold')
ax.set_xlabel('X coordinate', fontsize=10)
ax.set_ylabel('Y coordinate', fontsize=10)
ax.axis('off' if show_image else 'on')
plt.tight_layout()
# Save figure
if save_path:
plt.savefig(save_path, dpi=self.dpi, bbox_inches='tight',
facecolor='white', edgecolor='none')
logger.info(f"Saved figure to {save_path}")
return fig
def _plot_xenium_gene(
self,
gene: str,
save_path: Optional[str],
title: Optional[str],
alpha: float,
vmin: Optional[float],
vmax: Optional[float],
figsize: Tuple[int, int]
) -> plt.Figure:
"""Mapping Xenium single gene space"""
fig, ax = plt.subplots(figsize=figsize)
if self.transcripts is not None:
# Filter transcripts for this gene
gene_transcripts = self.transcripts[self.transcripts['feature_name'] == gene]
if len(gene_transcripts) == 0:
logger.warning(f"No transcripts found for gene {gene}")
ax.text(0.5, 0.5, f"No transcripts found for {gene}",
ha='center', va='center', transform=ax.transAxes)
else:
# Plot transcripts
scatter = ax.scatter(
gene_transcripts['x_location'],
gene_transcripts['y_location'],
c=gene_transcripts.get('qv', 20), # Quality value if available
cmap=self.cmap,
s=1,
alpha=alpha,
vmin=vmin or 0,
vmax=vmax or 40
)
cbar = plt.colorbar(scatter, ax=ax, shrink=0.5)
cbar.set_label('Quality Value', fontsize=10)
logger.info(f"Plotted {len(gene_transcripts)} transcripts for {gene}")
else:
ax.text(0.5, 0.5, "Transcript data not available",
ha='center', va='center', transform=ax.transAxes)
# Styling
ax.set_title(title or f'{gene} Transcript Locations', fontsize=14, fontweight='bold')
ax.set_aspect('equal')
ax.axis('off')
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=self.dpi, bbox_inches='tight',
facecolor='white', edgecolor='none')
logger.info(f"Saved figure to {save_path}")
return fig
def plot_multi_genes(
self,
genes: List[str],
mode: str = "grid",
save_path: Optional[str] = None,
spot_size: float = 1.0,
alpha: float = 0.8,
figsize_per_gene: Tuple[int, int] = (5, 5)
) -> plt.Figure:
"""Mapping the space of multiple genes
Parameters
----------
genes : list
Gene name list
mode : str
Layout mode: 'grid' or 'overlay'
save_path: str, optional
save path
spot_size : float
Spot size
alpha : float
Transparency
figsize_per_gene : tuple
Subgraph size for each gene
Returns
-------
fig : matplotlib.figure.Figure"""
# Validate genes
missing_genes = [g for g in genes if g not in self.gene_names]
if missing_genes:
logger.warning(f"Genes not found: {missing_genes}")
genes = [g for g in genes if g in self.gene_names]
if not genes:
raise ValueError("No valid genes to plot")
if mode == "grid":
n_genes = len(genes)
n_cols = min(3, n_genes)
n_rows = (n_genes + n_cols - 1) // n_cols
fig, axes = plt.subplots(
n_rows, n_cols,
figsize=(figsize_per_gene[0] * n_cols, figsize_per_gene[1] * n_rows)
)
axes = np.atleast_1d(axes).flatten()
for i, gene in enumerate(genes):
ax = axes[i]
self._plot_gene_on_axis(ax, gene, spot_size, alpha)
# Hide unused subplots
for i in range(len(genes), len(axes)):
axes[i].axis('off')
elif mode == "overlay":
fig, ax = plt.subplots(figsize=(10, 10))
self._plot_genes_overlay(ax, genes, spot_size, alpha)
else:
raise ValueError(f"Unknown mode: {mode}")
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=self.dpi, bbox_inches='tight',
facecolor='white', edgecolor='none')
logger.info(f"Saved multi-gene figure to {save_path}")
return fig
def _plot_gene_on_axis(
self,
ax: plt.Axes,
gene: str,
spot_size: float,
alpha: float
) -> None:
"""Plot gene expression on specified axis"""
if self.platform != "visium" or self.adata is None:
return
gene_idx = self.gene_names.index(gene)
expression = self.adata.X[:, gene_idx].toarray().flatten() if hasattr(self.adata.X, 'toarray') else self.adata.X[:, gene_idx]
spatial = self.adata.obsm['spatial']
# Scale coordinates
if 'tissue_lowres_scalef' in self.scale_factors:
scale = self.scale_factors['tissue_lowres_scalef']
spatial_scaled = spatial * scale
else:
spatial_scaled = spatial
# Plot
nonzero_mask = expression > 0
scatter = ax.scatter(
spatial_scaled[nonzero_mask, 0],
spatial_scaled[nonzero_mask, 1],
c=expression[nonzero_mask],
cmap=self.cmap,
s=spot_size * 30,
alpha=alpha,
edgecolors='none'
)
# Add tissue image background
if self.image is not None:
ax.imshow(self.image, alpha=0.3)
ax.set_title(gene, fontsize=12, fontweight='bold')
ax.axis('off')
plt.colorbar(scatter, ax=ax, shrink=0.5)
def _plot_genes_overlay(
self,
ax: plt.Axes,
genes: List[str],
spot_size: float,
alpha: float
) -> None:
"""Plot multiple genes overlay (using different colors)"""
if self.platform != "visium" or self.adata is None:
return
# Use distinct colors for each gene
colors = plt.cm.Set1(np.linspace(0, 1, len(genes)))
spatial = self.adata.obsm['spatial']
if 'tissue_lowres_scalef' in self.scale_factors:
scale = self.scale_factors['tissue_lowres_scalef']
spatial_scaled = spatial * scale
else:
spatial_scaled = spatial
for i, gene in enumerate(genes):
gene_idx = self.gene_names.index(gene)
expression = self.adata.X[:, gene_idx].toarray().flatten() if hasattr(self.adata.X, 'toarray') else self.adata.X[:, gene_idx]
nonzero_mask = expression > 0
ax.scatter(
spatial_scaled[nonzero_mask, 0],
spatial_scaled[nonzero_mask, 1],
c=[colors[i]],
s=spot_size * 30 * (expression[nonzero_mask] / expression[nonzero_mask].max()),
alpha=alpha,
edgecolors='none',
label=gene
)
if self.image is not None:
ax.imshow(self.image, alpha=0.3)
ax.legend(loc='upper right', title='Genes')
ax.set_title('Multi-gene Overlay', fontsize=14, fontweight='bold')
ax.axis('off')
def plot_cluster_spatial(
self,
cluster_file: Optional[str] = None,
cluster_key: str = "cluster",
save_path: Optional[str] = None,
spot_size: float = 1.5,
figsize: Tuple[int, int] = (10, 10)
) -> plt.Figure:
"""Plot spatial clustering
Parameters
----------
cluster_file: str, optional
Clustering result CSV file path
cluster_key : str
Clustering obs key in AnnData
save_path: str, optional
save path
spot_size : float
Spot size
figsize : tuple
Image size
Returns
-------
fig : matplotlib.figure.Figure"""
if self.platform != "visium" or self.adata is None:
raise ValueError("Cluster spatial plot only supported for Visium data")
# Load cluster assignments if provided
if cluster_file:
clusters = pd.read_csv(cluster_file, index_col=0)
cluster_col = clusters.columns[0]
self.adata.obs['cluster'] = clusters[cluster_col].astype('category')
elif cluster_key in self.adata.obs:
self.adata.obs['cluster'] = self.adata.obs[cluster_key].astype('category')
else:
raise ValueError(f"No cluster information found. Provide cluster_file or ensure '{cluster_key}' exists in adata.obs")
fig, ax = plt.subplots(figsize=figsize)
# Get coordinates
spatial = self.adata.obsm['spatial']
if 'tissue_lowres_scalef' in self.scale_factors:
scale = self.scale_factors['tissue_lowres_scalef']
spatial_scaled = spatial * scale
else:
spatial_scaled = spatial
# Get unique clusters
clusters = self.adata.obs['cluster'].cat.categories
n_clusters = len(clusters)
# Use distinct colors
colors = plt.cm.tab20(np.linspace(0, 1, max(n_clusters, 20)))
# Plot each cluster
for i, cluster in enumerate(clusters):
mask = self.adata.obs['cluster'] == cluster
ax.scatter(
spatial_scaled[mask, 0],
spatial_scaled[mask, 1],
c=[colors[i % 20]],
s=spot_size * 50,
alpha=0.8,
edgecolors='black',
linewidths=0.3,
label=f'Cluster {cluster}'
)
# Add tissue image
if self.image is not None:
ax.imshow(self.image, alpha=0.3, zorder=0)
ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.0), title='Clusters')
ax.set_title('Spatial Cluster Distribution', fontsize=14, fontweight='bold')
ax.axis('off')
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=self.dpi, bbox_inches='tight',
facecolor='white', edgecolor='none')
logger.info(f"Saved cluster plot to {save_path}")
return fig
def get_spatial_stats(self, gene: str) -> Dict:
"""Get spatial statistics of genes
Parameters
----------
gene :str
Gene name
Returns
-------
dict
Statistics Dictionary"""
if gene not in self.gene_names:
raise ValueError(f"Gene '{gene}' not found")
gene_idx = self.gene_names.index(gene)
if self.adata is not None:
expression = self.adata.X[:, gene_idx].toarray().flatten() if hasattr(self.adata.X, 'toarray') else self.adata.X[:, gene_idx]
else:
return {}
spatial = self.adata.obsm.get('spatial', None)
stats = {
'gene': gene,
'total_counts': int(expression.sum()),
'mean_expression': float(expression.mean()),
'max_expression': float(expression.max()),
'expressed_spots': int((expression > 0).sum()),
'detection_rate': float((expression > 0).mean()),
}
if spatial is not None:
# Find hotspot (location with highest expression)
hotspot_idx = expression.argmax()
stats['hotspot_location'] = {
'x': float(spatial[hotspot_idx, 0]),
'y': float(spatial[hotspot_idx, 1])
}
return stats
def generate_report(
self,
genes: List[str],
output_file: str = "spatial_report.html"
) -> str:
"""Generate comprehensive reports in HTML format
Parameters
----------
genes : list
List of genes to include in the report
output_file : str
Output HTML file name
Returns
-------
str
Output file path"""
report_path = self.output_dir / output_file
# Generate plots for each gene
plot_paths = []
for gene in genes[:6]: # Limit to 6 genes for report
if gene in self.gene_names:
plot_path = self.output_dir / f"{gene}_spatial_map.png"
self.plot_gene_spatial(gene, save_path=str(plot_path))
plot_paths.append((gene, f"{gene}_spatial_map.png"))
# Generate cluster plot if available
cluster_plot = None
if self.adata is not None and 'cluster' in self.adata.obs:
cluster_plot_path = self.output_dir / "cluster_spatial_map.png"
self.plot_cluster_spatial(save_path=str(cluster_plot_path))
cluster_plot = "cluster_spatial_map.png"
# Build HTML
html_content = f"""
<!DOCTYPE html>
<html>
<head>
<title>Spatial Transcriptomics Report</title>
<style>
body {{ font-family: Arial, sans-serif; margin: 40px; background-color: #f5f5f5; }}
h1 {{ color: #333; border-bottom: 3px solid #4CAF50; padding-bottom: 10px; }}
h2 {{ color: #555; margin-top: 30px; }}
.gene-grid {{ display: grid; grid-template-columns: repeat(auto-fit, minmax(400px, 1fr)); gap: 20px; margin-top: 20px; }}
.gene-card {{ background: white; border-radius: 8px; padding: 15px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }}
.gene-card img {{ width: 100%; height: auto; border-radius: 4px; }}
.gene-name {{ font-size: 18px; font-weight: bold; color: #4CAF50; margin-bottom: 10px; }}
.stats {{ background: white; padding: 20px; border-radius: 8px; margin-top: 20px; }}
table {{ width: 100%; border-collapse: collapse; }}
th, td {{ padding: 12px; text-align: left; border-bottom: 1px solid #ddd; }}
th {{ background-color: #4CAF50; color: white; }}
.info {{ background: #e3f2fd; padding: 15px; border-radius: 8px; margin-bottom: 20px; }}
</style>
</head>
<body>
<h1>🧬 Spatial Transcriptomics Analysis Report</h1>
<div class="info">
<strong>Platform:</strong> {self.platform.upper()}<br>
<strong>Sample:</strong> {self.data_dir.name}<br>
<strong>Total Genes:</strong> {len(self.gene_names)}<br>
{f"<strong>Total Spots:</strong> {self.adata.n_obs}" if self.adata else ""}
</div>
<h2>📊 Gene Expression Maps</h2>
<div class="gene-grid">
"""
for gene, plot_file in plot_paths:
stats = self.get_spatial_stats(gene)
html_content += f"""
<div class="gene-card">
<div class="gene-name">{gene}</div>
<img src="{plot_file}" alt="{gene} expression">
<div style="margin-top: 10px; font-size: 12px; color: #666;">
Expression: {stats.get('total_counts', 'N/A')} counts |
Detection rate: {stats.get('detection_rate', 0):.1%}
</div>
</div>
"""
html_content += """
</div>
"""
if cluster_plot:
html_content += f"""
<h2>🔬 Spatial Clusters</h2>
<div class="gene-card">
<img src="{cluster_plot}" alt="Cluster map" style="max-width: 600px;">
</div>
"""
# Add statistics table
html_content += """
<h2>📈 Expression Statistics</h2>
<div class="stats">
<table>
<tr>
<th>Gene</th>
<th>Total Counts</th>
<th>Mean Expression</th>
<th>Max Expression</th>
<th>Detection Rate</th>
</tr>
"""
for gene in genes[:10]:
if gene in self.gene_names:
stats = self.get_spatial_stats(gene)
html_content += f"""
<tr>
<td>{gene}</td>
<td>{stats.get('total_counts', 0):,}</td>
<td>{stats.get('mean_expression', 0):.4f}</td>
<td>{stats.get('max_expression', 0):.1f}</td>
<td>{stats.get('detection_rate', 0):.1%}</td>
</tr>
"""
html_content += """
</table>
</div>
<footer style="margin-top: 40px; text-align: center; color: #999; font-size: 12px;">
Generated by Spatial Transcriptomics Mapper v1.0
</footer>
</body>
</html>
"""
with open(report_path, 'w') as f:
f.write(html_content)
logger.info(f"Generated HTML report: {report_path}")
return str(report_path)
def parse_args() -> argparse.Namespace:
"""Parse command line parameters"""
parser = argparse.ArgumentParser(
description="Spatial Transcriptomics Mapper - Project gene expression back to tissue section images",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""Examples:
# Visium single gene analysis
python main.py --platform visium --data-dir ./visium/outs --gene PIK3CA --output ./results/
# Xenium Multigene Analysis
python main.py --platform xenium --data-dir ./xenium/outs --genes SFTPB,SFTPC --mode multi
# Spatial clustering visualization
python main.py --platform visium --data-dir ./data/outs --cluster-file ./clusters.csv"""
)
# Required arguments
parser.add_argument(
"--platform",
type=str,
required=True,
choices=["visium", "xenium"],
help="Spatial transcriptome platform types"
)
parser.add_argument(
"--data-dir",
type=str,
required=True,
help="Data directory path"
)
# Gene selection
gene_group = parser.add_mutually_exclusive_group()
gene_group.add_argument(
"--gene",
type=str,
help="single gene name"
)
gene_group.add_argument(
"--genes",
type=str,
help="Multiple genes, separated by commas (eg: PIK3CA, PTEN, EGFR)"
)
# Analysis options
parser.add_argument(
"--mode",
type=str,
default="single",
choices=["single", "overlay", "multi", "cluster"],
help="Analysis mode (default: single)"
)
parser.add_argument(
"--cluster-file",
type=str,
help="Clustering result CSV file path (barcode,cluster)"
)
# Output options
parser.add_argument(
"--output",
type=str,
default="./output",
help="Output directory (default: ./output)"
)
parser.add_argument(
"--dpi",
type=int,
default=300,
help="Output image DPI (default: 300)"
)
parser.add_argument(
"--cmap",
type=str,
default="viridis",
help="Color mapping scheme (default: viridis)"
)
# Visualization options
parser.add_argument(
"--spot-size",
type=float,
default=1.0,
help="Visium spot size factor (default: 1.0)"
)
parser.add_argument(
"--alpha",
type=float,
default=0.8,
help="Transparency 0-1 (default: 0.8)"
)
parser.add_argument(
"--min-count",
type=int,
default=0,
help="Minimum expression filter (default: 0)"
)
# Advanced options
parser.add_argument(
"--crop",
type=str,
help="Cropping area (format: x1,y1,x2,y2)"
)
parser.add_argument(
"--report",
action="store_true",
help="Generate HTML report"
)
parser.add_argument(
"-v", "--verbose",
action="store_true",
help="Verbose output"
)
return parser.parse_args()
def main():
"""main function"""
args = parse_args()
# Set log level
if args.verbose:
logging.getLogger().setLevel(logging.DEBUG)
# Validate data directory
if not os.path.exists(args.data_dir):
logger.error(f"Data directory not found: {args.data_dir}")
sys.exit(1)
# Initialize mapper
try:
mapper = SpatialMapper(
platform=args.platform,
data_dir=args.data_dir,
output_dir=args.output,
dpi=args.dpi,
cmap=args.cmap
)
# Load data
logger.info("Loading data...")
mapper.load_data()
# Determine genes to analyze
genes_to_analyze = []
if args.gene:
genes_to_analyze = [args.gene]
elif args.genes:
genes_to_analyze = [g.strip() for g in args.genes.split(",")]
# Execute based on mode
if args.mode == "cluster" or args.cluster_file:
# Cluster visualization mode
logger.info("Generating cluster spatial map...")
output_path = Path(args.output) / "cluster_spatial_map.png"
mapper.plot_cluster_spatial(
cluster_file=args.cluster_file,
save_path=str(output_path),
spot_size=args.spot_size
)
elif args.mode == "multi" and len(genes_to_analyze) > 1:
# Multi-gene grid mode
logger.info(f"Generating multi-gene plot for: {genes_to_analyze}")
output_path = Path(args.output) / "multi_gene_grid.png"
mapper.plot_multi_genes(
genes=genes_to_analyze,
mode="grid",
save_path=str(output_path),
spot_size=args.spot_size,
alpha=args.alpha
)
elif args.mode == "overlay" and len(genes_to_analyze) > 1:
# Overlay mode
logger.info(f"Generating overlay plot for: {genes_to_analyze}")
output_path = Path(args.output) / "multi_gene_overlay.png"
mapper.plot_multi_genes(
genes=genes_to_analyze,
mode="overlay",
save_path=str(output_path),
spot_size=args.spot_size,
alpha=args.alpha
)
elif genes_to_analyze:
# Single gene mode
for gene in genes_to_analyze:
if gene in mapper.gene_names:
logger.info(f"Processing gene: {gene}")
output_path = Path(args.output) / f"{gene}_spatial_map.png"
mapper.plot_gene_spatial(
gene=gene,
save_path=str(output_path),
spot_size=args.spot_size,
alpha=args.alpha
)
# Print stats
stats = mapper.get_spatial_stats(gene)
logger.info(f" Total counts: {stats.get('total_counts', 'N/A')}")
logger.info(f" Detection rate: {stats.get('detection_rate', 0):.1%}")
else:
logger.warning(f"Gene '{gene}' not found in dataset")
logger.info(f"Available genes (first 10): {mapper.gene_names[:10]}")
else:
logger.info("No genes specified. Use --gene or --genes to specify genes to analyze.")
logger.info(f"Dataset contains {len(mapper.gene_names)} genes")
# Generate HTML report if requested
if args.report and genes_to_analyze:
logger.info("Generating HTML report...")
report_path = mapper.generate_report(genes=genes_to_analyze)
logger.info(f"Report saved to: {report_path}")
logger.info("Analysis complete!")
except Exception as e:
logger.error(f"Error during analysis: {e}")
if args.verbose:
import traceback
traceback.print_exc()
sys.exit(1)
if __name__ == "__main__":
main()
FILE:scripts/__init__.py
# Spatial Transcriptomics Mapper
from .main import SpatialMapper
__version__ = "1.0.0"
__all__ = ["SpatialMapper"]