AIpoch

@clawhub-aipoch-ai-772015cadb

225prompts

0upvotes received

0contributions

Joined 3 months ago

225 contributions in the last year

Aug

Sep

Oct

Nov

Dec

Jan

Feb

Mar

Apr

May

Jun

Jul

Less

Vector Text Fixer

Skill

Fix garbled text in PDF/SVG vector graphics caused by font encoding issues, making files editable in AI tools. Supports batch processing and JSON export for...

---
name: vector-text-fixer
description: Fix garbled text in PDF/SVG vector graphics caused by font encoding issues, making files editable in AI tools. Supports batch processing and JSON export for manual correction.
license: MIT
skill-author: AIPOCH
---

# Vector Text Fixer

Fixes garbled text in PDF/SVG vector graphics caused by font embedding problems, encoding errors, or missing font substitution. Outputs repaired files or editable JSON for AI tool import.

## Quick Check

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py --input document.pdf --output fixed.pdf
python scripts/main.py --input diagram.svg --output fixed.svg
```

## When to Use

- Fix garbled/box characters in PDF files caused by font embedding issues
- Repair SVG text encoding errors before editing in Illustrator or Inkscape
- Batch-process a folder of PDF/SVG files with garbled text
- Export a text map JSON for manual correction in AI editors

## Workflow

1. Confirm input file path (PDF or SVG) or batch folder, and desired output path.
2. Validate that the request involves PDF/SVG garbled text repair; stop early if not.
3. Run `scripts/main.py --input <file> --output <file>` or `--batch <folder>`.
4. Return a structured result separating repaired blocks, skipped blocks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the Fallback Template below.

## Fallback Template

If `scripts/main.py` fails or required fields are missing, respond with:

```
FALLBACK REPORT
───────────────────────────────────────
Objective        : <repair goal>
Inputs Available : <file path or batch folder provided>
Missing Inputs   : <list exactly what is missing>
  Note: --input requires a valid PDF or SVG file path, not a text string.
        For batch mode use --batch <folder_path> instead.
Partial Result   : <any blocks repaired safely>
Blocked Steps    : <what could not be completed and why>
Next Steps       : <minimum info needed to complete>
───────────────────────────────────────
```

## Stress-Case Output Checklist

For complex multi-constraint requests, always include these sections explicitly:

- **Assumptions**: repair level default (standard), encoding auto-detected
- **Constraints**: encrypted PDFs require password unlock first; scanned PDFs need OCR first
- **Risks**: severely damaged files may not be fully repairable; rare fonts may not map correctly
- **Unresolved Items**: blocks with confidence < 0.3 flagged for manual review

## Supported Scenarios

**PDF Garbled Text:**
- Box/question mark issues from font embedding problems
- Garbled text from encoding conversion errors
- Missing font substitution characters
- Multi-language mixed encoding issues

**SVG Garbled Text:**
- Text entity encoding errors
- Special character escaping issues
- Invalid font reference display abnormalities
- XML encoding declaration errors

## CLI Usage

```bash
# Fix single PDF
python scripts/main.py --input document.pdf --output fixed.pdf

# Fix single SVG
python scripts/main.py --input diagram.svg --output fixed.svg

# Batch process folder
python scripts/main.py --batch ./input_folder --output ./output_folder

# Interactive repair
python scripts/main.py --input doc.pdf --interactive

# Export editable JSON
python scripts/main.py --input doc.pdf --export-json editable.json

# Specify repair level
python scripts/main.py --input doc.pdf --output fixed.pdf --repair-level aggressive
```

## Parameters

| Parameter | Required | Description | Default |
|---|---|---|---|
| `--input` | Yes* | Input PDF or SVG file path | — |
| `--batch` | Yes* | Batch input folder path | — |
| `--output` | Yes | Output file or folder path | — |
| `--repair-level` | No | `minimal` / `standard` / `aggressive` | `standard` |
| `--interactive` | No | Enable interactive repair mode | False |
| `--export-json` | No | Export editable JSON format | — |
| `--encoding` | No | Source file encoding (default: auto-detect) | auto |

*At least one of `--input` or `--batch` is required.

## Repair Levels

- **Minimal**: Only obvious errors (replacement characters, null bytes); maximum original integrity
- **Standard**: Common encoding issues + smart font replacement; balanced repair rate and accuracy
- **Aggressive**: Full text re-encoding + OCR-assisted recognition; for severely garbled documents

## Output Format (JSON Export)

```json
{
  "file_type": "pdf",
  "pages": [{
    "page_num": 1,
    "text_blocks": [{
      "id": "tb_001",
      "bbox": [100, 200, 300, 220],
      "original_text": "?????",
      "detected_encoding": "UTF-8",
      "confidence": 0.3,
      "suggested_fix": "Sample Text"
    }]
  }],
  "repair_summary": {
    "total_blocks": 15,
    "fixed_blocks": 12,
    "skipped_blocks": 3
  }
}
```

## Input Validation

This skill accepts: PDF (.pdf) or SVG (.svg) file paths, or a folder path for batch processing, where the files contain garbled or unreadable text caused by font/encoding issues.

If the request does not involve PDF/SVG garbled text repair — for example, asking to convert file formats, edit PDF content directly, perform OCR on scanned images, or process non-vector files — do not proceed. Instead respond:

> "`vector-text-fixer` is designed to fix garbled text in PDF/SVG vector graphics caused by font encoding issues. Your request appears to be outside this scope. Please provide a valid PDF or SVG file path, or use a more appropriate tool."

## Error Handling

- If `--input` receives a text string instead of a file path, report the error and request a valid file path.
- If the file is encrypted, report that password unlock is required before processing.
- If the task goes outside documented scope, stop instead of guessing.
- If `scripts/main.py` fails, use the Fallback Template above.
- Do not fabricate repaired text content or execution outcomes.

## Output Requirements

Every final response must include:

1. **Objective** — file(s) repaired and repair level used
2. **Inputs Received** — file path, repair level, encoding settings
3. **Assumptions** — defaults applied (repair level, encoding detection)
4. **Result** — output file path, blocks fixed vs skipped
5. **Risks and Limits** — confidence thresholds, manual review blocks
6. **Next Checks** — review low-confidence blocks manually before use

## Limitations

- Encrypted PDFs require password unlock before processing
- Severely damaged vector files may not be fully repairable
- Some rare fonts may not map correctly
- Scanned PDFs require OCR recognition first

## Dependencies

```
pdfplumber >= 0.10.0
PyMuPDF >= 1.23.0
cairosvg >= 2.7.0
beautifulsoup4 >= 4.12.0
fonttools >= 4.40.0
chardet >= 5.0.0
Pillow >= 10.0.0
```

FILE:requirements.txt
bs4
dataclasses
fitz

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Vector Text Fixer - Fix garbled text in PDF/SVG vector graphics
"""

import argparse
import json
import os
import re
import sys
from pathlib import Path
from typing import Dict, List, Optional, Tuple, Any
from dataclasses import dataclass, asdict


@dataclass
class TextBlock:
    """Text block data structure"""
    id: str
    bbox: List[float]  # [x0, y0, x1, y1]
    original_text: str
    page_num: int = 1
    font_info: Dict[str, Any] = None
    confidence: float = 1.0
    suggested_fix: str = ""
    is_garbled: bool = False


class GarbledTextDetector:
    """Garbled text detector"""
    
    # Common garbled/replacement characters
    REPLACEMENT_CHARS = {
        '\ufffd',  # replacement character
        '\u25a1',  # white square
        '\u25a0',  # black square
        '\u25af',  # white rectangle
        '\u2588',  # full block
        '\ufffe',  # non-character
        '\uffff',  # non-character
        '?',       # question mark substitute
    }
    
    # Common garbled patterns
    GARBLED_PATTERNS = [
        r'[\u0000-\u0008\u000b-\u000c\u000e-\u001f]',  # Control characters
        r'[\ufffd\u25a1\u25a0\u25af\u2588\ufffe\uffff]',  # Replacement characters
        r'[?]{2,}',  # Consecutive replacement characters
        r'(?:\\x[0-9a-fA-F]{2}){2,}',  # Escape sequences
        r'[\x80-\x9f]',  # C1 control characters
    ]
    
    def __init__(self):
        self.compiled_patterns = [re.compile(p) for p in self.GARBLED_PATTERNS]
    
    def is_garbled(self, text: str) -> Tuple[bool, float]:
        """
        Detect whether text is garbled
        Returns: (is_garbled, garbled_confidence 0-1)
        """
        if not text or not isinstance(text, str):
            return False, 1.0
        
        text_len = len(text)
        if text_len == 0:
            return False, 1.0
        
        garbled_score = 0.0
        
        # 1. Check replacement characters
        replacement_count = sum(1 for c in text if c in self.REPLACEMENT_CHARS)
        garbled_score += (replacement_count / text_len) * 0.5
        
        # 2. Check garbled patterns
        for pattern in self.compiled_patterns:
            matches = pattern.findall(text)
            if matches:
                garbled_score += len(matches) / text_len * 0.3
        
        # 3. Check abnormal character distribution
        if self._has_abnormal_distribution(text):
            garbled_score += 0.2
        
        # 4. Check for mixed encoding signs
        if self._has_encoding_mixed(text):
            garbled_score += 0.15
        
        is_garbled = garbled_score > 0.15 or replacement_count > 0
        confidence = max(0.0, 1.0 - min(garbled_score, 1.0))
        
        return is_garbled, confidence
    
    def _has_abnormal_distribution(self, text: str) -> bool:
        """Check whether character distribution is abnormal"""
        if len(text) < 3:
            return False
        
        # Count ratio of non-printable characters
        unprintable = sum(1 for c in text if ord(c) < 32 and c not in '\t\n\r')
        ratio = unprintable / len(text)
        return ratio > 0.3
    
    def _has_encoding_mixed(self, text: str) -> bool:
        """Detect whether mixed encoding signs are present"""
        # Detect signs of UTF-8 multi-byte characters being incorrectly parsed
        # e.g.: Ã© should be é (UTF-8 bytes parsed as Latin-1)
        mixed_patterns = [
            r'Ã[\xa0-\xbf]',  # UTF-8 misread as Latin-1
            r'Â[\x80-\xbf]',
            r'Ã¢',
            r'Ã£',
        ]
        for pattern in mixed_patterns:
            if re.search(pattern, text):
                return True
        return False


class PDFFixer:
    """PDF text fixer"""
    
    def __init__(self, detector: GarbledTextDetector):
        self.detector = detector
        self.text_blocks: List[TextBlock] = []
    
    def fix(self, input_path: str, output_path: str, 
            repair_level: str = "standard") -> Dict[str, Any]:
        """
        Fix garbled text in a PDF file
        """
        try:
            import fitz  # PyMuPDF
        except ImportError:
            return {
                "success": False,
                "error": "PyMuPDF (fitz) is required. Install: pip install PyMuPDF"
            }
        
        try:
            doc = fitz.open(input_path)
        except Exception as e:
            return {"success": False, "error": f"Cannot open PDF: {str(e)}"}
        
        self.text_blocks = []
        repair_count = 0
        
        for page_num in range(len(doc)):
            page = doc[page_num]
            blocks = self._extract_text_blocks(page, page_num + 1)
            
            for block in blocks:
                is_garbled, confidence = self.detector.is_garbled(block.original_text)
                
                if is_garbled:
                    block.is_garbled = True
                    block.confidence = confidence
                    block.suggested_fix = self._suggest_fix(
                        block.original_text, 
                        repair_level
                    )
                    repair_count += 1
                
                self.text_blocks.append(block)
        
        # Generate repair report
        result = {
            "success": True,
            "file_type": "pdf",
            "pages": len(doc),
            "total_blocks": len(self.text_blocks),
            "garbled_blocks": repair_count,
            "text_blocks": [asdict(b) for b in self.text_blocks],
            "output_path": output_path
        }
        
        doc.close()
        return result
    
    def _extract_text_blocks(self, page, page_num: int) -> List[TextBlock]:
        """Extract text blocks from a PDF page"""
        blocks = []
        
        try:
            import fitz
            
            # Get text blocks on the page
            text_dict = page.get_text("dict")
            
            block_id = 0
            for block in text_dict.get("blocks", []):
                if "lines" in block:  # Text block
                    for line in block["lines"]:
                        for span in line.get("spans", []):
                            text = span.get("text", "")
                            if text.strip():
                                bbox = span.get("bbox", [0, 0, 0, 0])
                                font_info = {
                                    "font": span.get("font", "Unknown"),
                                    "size": span.get("size", 0),
                                    "flags": span.get("flags", 0),
                                    "color": span.get("color", 0)
                                }
                                
                                tb = TextBlock(
                                    id=f"p{page_num}_b{block_id}",
                                    bbox=list(bbox),
                                    original_text=text,
                                    page_num=page_num,
                                    font_info=font_info
                                )
                                blocks.append(tb)
                                block_id += 1
        except Exception as e:
            print(f"Warning: Error extracting text: {e}")
        
        return blocks
    
    def _suggest_fix(self, garbled_text: str, repair_level: str) -> str:
        """Suggest repair text based on garbled content"""
        # More complex repair logic can be implemented here
        # Currently returns a placeholder prompting the user to input manually
        
        if repair_level == "minimal":
            # Minimal repair: only remove replacement characters
            return garbled_text.replace('\ufffd', '').strip()
        
        elif repair_level == "aggressive":
            # Deep repair: attempt to decode common encoding errors
            return self._try_decode_fixes(garbled_text)
        
        else:  # standard
            # Standard repair: mark for user confirmation
            if all(c in GarbledTextDetector.REPLACEMENT_CHARS for c in garbled_text):
                return f"[Manual input required - original: {len(garbled_text)} garbled characters]"
            else:
                return garbled_text.replace('\ufffd', '[?]').strip()
    
    def _try_decode_fixes(self, text: str) -> str:
        """Attempt multiple encoding repairs"""
        # Common encoding error pattern fixes
        fixes = []
        
        # UTF-8 parsed as Latin-1
        try:
            fixed = text.encode('latin-1').decode('utf-8')
            fixes.append(fixed)
        except:
            pass
        
        # GBK/GB2312 issue
        try:
            fixed = text.encode('latin-1').decode('gbk', errors='ignore')
            fixes.append(fixed)
        except:
            pass
        
        # Return the first reasonable fix
        for fix in fixes:
            if not self.detector.is_garbled(fix)[0]:
                return fix
        
        return f"[Manual input required]"


class SVGFixer:
    """SVG text fixer"""
    
    def __init__(self, detector: GarbledTextDetector):
        self.detector = detector
        self.text_elements: List[TextBlock] = []
    
    def fix(self, input_path: str, output_path: str,
            repair_level: str = "standard") -> Dict[str, Any]:
        """
        Fix garbled text in an SVG file
        """
        try:
            from bs4 import BeautifulSoup
        except ImportError:
            return {
                "success": False,
                "error": "BeautifulSoup4 is required. Install: pip install beautifulsoup4"
            }
        
        try:
            with open(input_path, 'r', encoding='utf-8', errors='ignore') as f:
                content = f.read()
        except Exception as e:
            return {"success": False, "error": f"Cannot read SVG: {str(e)}"}
        
        soup = BeautifulSoup(content, 'xml')
        
        # Extract all text elements
        self.text_elements = []
        repair_count = 0
        
        text_tags = soup.find_all(['text', 'tspan', 'textPath'])
        
        for idx, tag in enumerate(text_tags):
            text_content = tag.get_text()
            if not text_content.strip():
                continue
            
            is_garbled, confidence = self.detector.is_garbled(text_content)
            
            # Get position and font information
            x = tag.get('x', '0')
            y = tag.get('y', '0')
            font_family = tag.get('font-family', 'default')
            font_size = tag.get('font-size', '12')
            
            tb = TextBlock(
                id=f"text_{idx}",
                bbox=[float(x) if x else 0, float(y) if y else 0, 0, 0],
                original_text=text_content,
                font_info={
                    "font_family": font_family,
                    "font_size": font_size
                },
                page_num=1
            )
            
            if is_garbled:
                tb.is_garbled = True
                tb.confidence = confidence
                tb.suggested_fix = self._suggest_fix(text_content, repair_level)
                repair_count += 1
            
            self.text_elements.append(tb)
        
        # Get SVG basic information
        svg_tag = soup.find('svg')
        svg_info = {
            "width": svg_tag.get('width', 'unknown') if svg_tag else 'unknown',
            "height": svg_tag.get('height', 'unknown') if svg_tag else 'unknown',
            "viewBox": svg_tag.get('viewBox', '') if svg_tag else ''
        }
        
        result = {
            "success": True,
            "file_type": "svg",
            "svg_info": svg_info,
            "total_elements": len(self.text_elements),
            "garbled_elements": repair_count,
            "text_elements": [asdict(t) for t in self.text_elements],
            "output_path": output_path
        }
        
        return result
    
    def _suggest_fix(self, garbled_text: str, repair_level: str) -> str:
        """Suggest SVG text repair"""
        if repair_level == "minimal":
            return garbled_text.replace('\ufffd', '').strip()
        elif repair_level == "aggressive":
            return self._try_xml_entity_fix(garbled_text)
        else:
            if '\ufffd' in garbled_text:
                return f"[Manual input required - original: {len(garbled_text)} garbled characters]"
            return garbled_text
    
    def _try_xml_entity_fix(self, text: str) -> str:
        """Attempt to fix XML entity encoding issues"""
        import html
        # Decode HTML entities
        decoded = html.unescape(text)
        if not self.detector.is_garbled(decoded)[0]:
            return decoded
        return f"[Manual input required]"


class VectorTextFixer:
    """Vector text fixer main class"""
    
    def __init__(self):
        self.detector = GarbledTextDetector()
        self.pdf_fixer = PDFFixer(self.detector)
        self.svg_fixer = SVGFixer(self.detector)
    
    def fix_file(self, input_path: str, output_path: str,
                 repair_level: str = "standard") -> Dict[str, Any]:
        """
        Automatically select repair method based on file type
        """
        input_path = Path(input_path)
        
        if not input_path.exists():
            return {"success": False, "error": f"File not found: {input_path}"}
        
        suffix = input_path.suffix.lower()
        
        if suffix == '.pdf':
            return self.pdf_fixer.fix(str(input_path), output_path, repair_level)
        elif suffix == '.svg':
            return self.svg_fixer.fix(str(input_path), output_path, repair_level)
        else:
            return {"success": False, "error": f"Unsupported file format: {suffix}"}
    
    def batch_fix(self, input_folder: str, output_folder: str,
                  repair_level: str = "standard") -> List[Dict[str, Any]]:
        """
        Batch repair PDF/SVG files in a folder
        """
        input_folder = Path(input_folder)
        output_folder = Path(output_folder)
        output_folder.mkdir(parents=True, exist_ok=True)
        
        results = []
        
        for file_path in input_folder.iterdir():
            if file_path.suffix.lower() in ['.pdf', '.svg']:
                output_path = output_folder / f"fixed_{file_path.name}"
                result = self.fix_file(str(file_path), str(output_path), repair_level)
                results.append(result)
        
        return results
    
    def export_editable_json(self, input_path: str, output_path: str) -> Dict[str, Any]:
        """
        Export editable JSON format for manual repair in AI tools
        """
        result = self.fix_file(input_path, "", repair_level="standard")
        
        if not result.get("success"):
            return result
        
        # Add editable markers
        editable_data = {
            "file_info": {
                "original_path": input_path,
                "exported_at": self._get_timestamp(),
                "tool": "Vector Text Fixer v1.0.0"
            },
            "repair_data": result
        }
        
        # Add user-editable fields
        if result["file_type"] == "pdf":
            for block in editable_data["repair_data"]["text_blocks"]:
                block["user_editable"] = block.get("suggested_fix", "")
        elif result["file_type"] == "svg":
            for elem in editable_data["repair_data"]["text_elements"]:
                elem["user_editable"] = elem.get("suggested_fix", "")
        
        # Save JSON
        try:
            with open(output_path, 'w', encoding='utf-8') as f:
                json.dump(editable_data, f, ensure_ascii=False, indent=2)
            editable_data["export_success"] = True
        except Exception as e:
            editable_data["export_success"] = False
            editable_data["export_error"] = str(e)
        
        return editable_data
    
    def _get_timestamp(self) -> str:
        """Get current timestamp"""
        from datetime import datetime
        return datetime.now().isoformat()


def main():
    """Command-line entry point"""
    parser = argparse.ArgumentParser(
        description='Vector Text Fixer - Fix garbled text in PDF/SVG vector graphics'
    )
    
    # Input options
    input_group = parser.add_mutually_exclusive_group(required=True)
    input_group.add_argument('--input', '-i', help='Input file path (PDF or SVG)')
    input_group.add_argument('--batch', '-b', help='Batch process input folder')
    
    # Output options
    parser.add_argument('--output', '-o', help='Output file/folder path')
    parser.add_argument('--export-json', '-j', help='Export editable JSON format')
    
    # Repair options
    parser.add_argument('--repair-level', '-r', 
                       choices=['minimal', 'standard', 'aggressive'],
                       default='standard',
                       help='Repair level (default: standard)')
    parser.add_argument('--interactive', action='store_true',
                       help='Enable interactive repair mode')
    
    args = parser.parse_args()
    
    # Create fixer instance
    fixer = VectorTextFixer()
    
    # Export JSON mode
    if args.export_json:
        if not args.input:
            print("Error: --export-json requires --input to be specified")
            sys.exit(1)
        
        result = fixer.export_editable_json(args.input, args.export_json)
        
        if result.get("export_success"):
            print(f"✓ JSON export successful: {args.export_json}")
            print(f"  File type: {result['repair_data'].get('file_type')}")
            print(f"  Detected text blocks: {result['repair_data'].get('total_blocks') or result['repair_data'].get('total_elements')}")
            print(f"  Garbled text blocks: {result['repair_data'].get('garbled_blocks') or result['repair_data'].get('garbled_elements')}")
        else:
            print(f"✗ Export failed: {result.get('export_error', 'Unknown error')}")
            sys.exit(1)
        return
    
    # Batch processing mode
    if args.batch:
        if not args.output:
            print("Error: Batch processing requires --output to be specified")
            sys.exit(1)
        
        print(f"Starting batch processing: {args.batch}")
        results = fixer.batch_fix(args.batch, args.output, args.repair_level)
        
        success_count = sum(1 for r in results if r.get("success"))
        total_count = len(results)
        
        print(f"\nProcessing complete: {success_count}/{total_count} files succeeded")
        for r in results:
            if r.get("success"):
                garbled = r.get("garbled_blocks") or r.get("garbled_elements", 0)
                print(f"  ✓ {r.get('output_path')} (garbled: {garbled})")
            else:
                print(f"  ✗ {r.get('error', 'Unknown error')}")
        return
    
    # Single file processing mode
    if args.input:
        print(f"Processing file: {args.input}")
        
        result = fixer.fix_file(args.input, args.output or "", args.repair_level)
        
        if result.get("success"):
            print(f"✓ Analysis complete")
            print(f"  File type: {result.get('file_type')}")
            
            if result.get('file_type') == 'pdf':
                print(f"  Pages: {result.get('pages')}")
                print(f"  Text blocks: {result.get('total_blocks')}")
                print(f"  Garbled blocks: {result.get('garbled_blocks')}")
            else:
                print(f"  Text elements: {result.get('total_elements')}")
                print(f"  Garbled elements: {result.get('garbled_elements')}")
            
            # Show garbled text details
            blocks = result.get('text_blocks') or result.get('text_elements', [])
            garbled_blocks = [b for b in blocks if b.get('is_garbled')]
            
            if garbled_blocks:
                print(f"\nDetected garbled text:")
                for i, block in enumerate(garbled_blocks[:5], 1):
                    orig = block.get('original_text', '')[:50]
                    sugg = block.get('suggested_fix', '')[:50]
                    print(f"  {i}. ID: {block.get('id')}")
                    print(f"     Original: {orig}")
                    print(f"     Suggested: {sugg}")
                    print(f"     Confidence: {block.get('confidence', 0):.2f}")
                
                if len(garbled_blocks) > 5:
                    print(f"  ... and {len(garbled_blocks) - 5} more garbled text items")
            
            if args.output:
                print(f"\nOutput path: {args.output}")
                print("Tip: Use --export-json to export editable format for manual repair")
        else:
            print(f"✗ Processing failed: {result.get('error', 'Unknown error')}")
            sys.exit(1)


if __name__ == "__main__":
    main()

ClawHub Coding Research+2

A@clawhub-aipoch-ai-772015cadb

Variant Pathogenicity Predictor

Skill

Integrate REVEL, CADD, PolyPhen scores to predict variant pathogenicity.

---
name: variant-pathogenicity-predictor
description: Integrate REVEL, CADD, PolyPhen scores to predict variant pathogenicity.
license: MIT
skill-author: AIPOCH
---
# Variant Pathogenicity Predictor

Integrate REVEL, CADD, PolyPhen and other scores to predict variant pathogenicity.

## When to Use

- Use this skill when the task needs Integrate REVEL, CADD, PolyPhen scores to predict variant pathogenicity.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

- Scope-focused workflow aligned to: Integrate REVEL, CADD, PolyPhen scores to predict variant pathogenicity.
- Packaged executable path(s): `scripts/main.py`.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Data Analytics/variant-pathogenicity-predictor"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Usage

```text
python scripts/main.py --variant "chr17:43094692:G:A" --gene "BRCA1"
python scripts/main.py --vcf variants.vcf --output report.json
```

## Parameters

- `--variant`: Variant in format chr:pos:ref:alt
- `--vcf`: VCF file with variants
- `--gene`: Gene symbol
- `--scores`: Prediction scores to use (REVEL,CADD,PolyPhen)

## Integrated Scores

- REVEL (Rare Exome Variant Ensemble Learner)
- CADD (Combined Annotation Dependent Depletion)
- PolyPhen-2 (Polymorphism Phenotyping)
- SIFT (Sorting Intolerant From Tolerant)
- MutationTaster

## Output

- Pathogenicity classification
- ACMG guideline interpretation
- Individual score breakdown
- Confidence assessment

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

No additional Python packages required.

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `variant-pathogenicity-predictor` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `variant-pathogenicity-predictor` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Variant Pathogenicity Predictor
Integrate REVEL, CADD, PolyPhen scores for variant classification.
"""

import argparse
import json


class VariantPathogenicityPredictor:
    """Predict variant pathogenicity."""
    
    def predict(self, variant_data):
        """Predict pathogenicity from multiple scores."""
        
        scores = {
            "REVEL": variant_data.get("REVEL", 0.5),
            "CADD": variant_data.get("CADD", 15),
            "PolyPhen": variant_data.get("PolyPhen", 0.5)
        }
        
        # Normalize CADD (higher = more damaging)
        cadd_norm = min(scores["CADD"] / 30, 1.0)
        
        # Composite score
        composite = (scores["REVEL"] * 0.4 + 
                    cadd_norm * 0.3 + 
                    scores["PolyPhen"] * 0.3)
        
        # Classification
        if composite >= 0.9:
            classification = "Pathogenic"
        elif composite >= 0.7:
            classification = "Likely Pathogenic"
        elif composite >= 0.3:
            classification = "Uncertain Significance"
        elif composite >= 0.1:
            classification = "Likely Benign"
        else:
            classification = "Benign"
        
        return {
            "composite_score": composite,
            "classification": classification,
            "individual_scores": scores
        }


def main():
    parser = argparse.ArgumentParser(description="Variant Pathogenicity Predictor")
    parser.add_argument("--revel", type=float, help="REVEL score")
    parser.add_argument("--cadd", type=float, help="CADD score")
    parser.add_argument("--polyphen", type=float, help="PolyPhen score")
    parser.add_argument("--demo", action="store_true", help="Run demo")
    
    args = parser.parse_args()
    
    predictor = VariantPathogenicityPredictor()
    
    if args.demo:
        variant = {"REVEL": 0.85, "CADD": 25, "PolyPhen": 0.92}
    else:
        variant = {
            "REVEL": args.revel or 0.5,
            "CADD": args.cadd or 15,
            "PolyPhen": args.polyphen or 0.5
        }
    
    result = predictor.predict(variant)
    
    print(f"\n{'='*60}")
    print("VARIANT PATHOGENICITY PREDICTION")
    print(f"{'='*60}\n")
    
    print(f"Composite Score: {result['composite_score']:.3f}")
    print(f"Classification: {result['classification']}")
    print("\nIndividual Scores:")
    for tool, score in result['individual_scores'].items():
        print(f"  {tool}: {score}")
    
    print(f"\n{'='*60}\n")


if __name__ == "__main__":
    main()

ClawHub Coding Data Analysis+2

A@clawhub-aipoch-ai-772015cadb

Target Novelty Scorer

Skill

Score the novelty of biological targets through literature mining and.

---
name: target-novelty-scorer
description: Score the novelty of biological targets through literature mining and.
license: MIT
skill-author: AIPOCH
---
# Target Novelty Scorer

ID: 177

## When to Use

- Use this skill when the task needs Score the novelty of biological targets through literature mining and.
- Use this skill for evidence insight tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

See `## Features` above for related details.

- Scope-focused workflow aligned to: Score the novelty of biological targets through literature mining and.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

- Python 3.9+
- requests
- pandas
- biopython (Entrez API)
- numpy

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Evidence Insight/target-novelty-scorer"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Description

Score the novelty of biological targets based on literature mining. By analyzing literature in academic databases such as PubMed and PubMed Central, assess the research popularity, uniqueness, and innovation potential of target molecules in the research field.

## Features

- 🔬 **Literature Retrieval**: Automatically retrieve literature related to targets from PubMed and other databases
- 📊 **Novelty Scoring**: Calculate target novelty score based on multi-dimensional indicators (0-100)
- 📈 **Trend Analysis**: Analyze temporal trends in target research
- 🧬 **Cross-validation**: Verify current research status of targets by combining multiple databases
- 📝 **Report Generation**: Generate detailed novelty analysis reports

## Scoring Criteria

1. **Research Heat (0-25 points)**: Number of related publications and citations in recent years
2. **Uniqueness (0-25 points)**: Distinction from known popular targets
3. **Research Depth (0-20 points)**: Progress of preclinical/clinical research
4. **Collaboration Network (0-15 points)**: Diversity of research institutions/teams
5. **Temporal Trend (0-15 points)**: Research growth trends in recent years

## Usage

### Basic Usage

```text
cd /Users/z04030865/.openclaw/workspace/skills/target-novelty-scorer
python scripts/main.py --target "PD-L1"
```

### Advanced Options

```text
python scripts/main.py \
  --target "BRCA1" \
  --db pubmed \
  --years 10 \
  --output report.json \
  --format json
```

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `--target` | string | required | Target molecule name or gene symbol |
| `--db` | string | pubmed | Data source (pubmed, pmc, all) |
| `--years` | int | 5 | Analysis year range |
| `--output` | string | stdout | Output file path |
| `--format` | string | text | Output format (text, json, csv) |
| `--verbose` | flag | false | Verbose output |

## Output Format

### JSON Output

```json
{
  "target": "PD-L1",
  "novelty_score": 72.5,
  "confidence": 0.85,
  "breakdown": {
    "research_heat": 18.5,
    "uniqueness": 20.0,
    "research_depth": 15.2,
    "collaboration": 12.0,
    "trend": 6.8
  },
  "metadata": {
    "total_papers": 15234,
    "recent_papers": 3421,
    "clinical_trials": 89,
    "analysis_date": "2026-02-06"
  },
  "interpretation": "This target has moderate novelty, with moderate research heat in recent years..."
}
```

## API Requirements

- NCBI API Key (for PubMed retrieval)
- Optional: Europe PMC API

## Installation

```text
pip install -r requirements.txt
```

## License

MIT License - Part of OpenClaw Bioinformatics Skills Collection

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python scripts with tools | High |
| Network Access | External API calls | High |
| File System Access | Read/write data | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Data handled securely | Medium |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] API requests use HTTPS only
- [ ] Input validated against allowed patterns
- [ ] API timeout and retry mechanisms implemented
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no internal paths exposed)
- [ ] Dependencies audited
- [ ] No exposure of internal service architecture

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `target-novelty-scorer` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `target-novelty-scorer` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## References

- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/audit-reference.md
# Audit Reference

## Scope

- Skill: `target-novelty-scorer`
- Core purpose: Score the novelty of biological targets through literature mining and.
- Use only within the documented workflow and category boundary defined in `SKILL.md`

## Supported Audit Paths

- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`

## Fallback Boundary

If required inputs are incomplete, the skill should still return:

- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable

FILE:requirements.txt
dataclasses
numpy

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Target Novelty Scorer
Target novelty scoring tool based on literature mining

Usage:
    python main.py --target "PD-L1"
    python main.py --target "BRCA1" --years 10 --format json
"""

import argparse
import json
import sys
import time
from dataclasses import asdict, dataclass
from datetime import datetime
from typing import Dict, List, Optional

import numpy as np


@dataclass
class NoveltyScore:
    """Novelty score data structure"""
    target: str
    novelty_score: float
    confidence: float
    breakdown: Dict[str, float]
    metadata: Dict
    interpretation: str


class PubMedSearcher:
    """PubMed literature searcher (simulated implementation)"""
    
    def __init__(self, api_key: Optional[str] = None):
        self.api_key = api_key
        self.base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
    
    def search(self, query: str, years: int = 5) -> Dict:
        """
        Search PubMed literature
        
        Actual implementation should call NCBI E-utilities API
        Here using simulated data for demonstration
        """
        # Simulated search results
        current_year = datetime.now().year
        
        # Generate simulated data based on target name (actual implementation should call real API)
        np.random.seed(hash(query) % 2**32)
        
        total_papers = np.random.randint(100, 50000)
        recent_papers = int(total_papers * np.random.uniform(0.1, 0.5))
        clinical_trials = np.random.randint(0, min(200, total_papers // 100))
        
        # Generate year distribution
        year_distribution = {
            str(year): np.random.randint(recent_papers // years // 2, recent_papers // years * 2)
            for year in range(current_year - years, current_year + 1)
        }
        
        return {
            "query": query,
            "total_count": total_papers,
            "recent_count": recent_papers,
            "clinical_trials": clinical_trials,
            "year_distribution": year_distribution,
            "search_year": current_year,
            "years_analyzed": years
        }


class NoveltyScorer:
    """Target novelty scorer"""
    
    def __init__(self):
        self.searcher = PubMedSearcher()
        
        # Scoring weight configuration
        self.weights = {
            "research_heat": 0.25,      # Research heat 25%
            "uniqueness": 0.25,         # Uniqueness 25%
            "research_depth": 0.20,     # Research depth 20%
            "collaboration": 0.15,      # Collaboration network 15%
            "trend": 0.15               # Time trend 15%
        }
    
    def _score_research_heat(self, data: Dict) -> float:
        """
        Score: Research heat (0-25 points)
        Based on recent literature count and citations
        """
        total = data.get("total_count", 0)
        recent = data.get("recent_count", 0)
        
        # Literature quantity score (0-15 points)
        if total < 500:
            quantity_score = 12 + (total / 500) * 3  # Scarce: high score
        elif total < 5000:
            quantity_score = 8 + (5000 - total) / 4500 * 4
        elif total < 20000:
            quantity_score = 4 + (20000 - total) / 15000 * 4
        else:
            quantity_score = max(0, 4 - (total - 20000) / 50000)
        
        # Recent activity score (0-10 points)
        recent_ratio = recent / total if total > 0 else 0
        recency_score = min(10, recent_ratio * 20)
        
        return min(25, quantity_score + recency_score)
    
    def _score_uniqueness(self, data: Dict, target: str) -> float:
        """
        Score: Uniqueness (0-25 points)
        Distinctiveness from known hot targets
        """
        total = data.get("total_count", 0)
        
        # Hot target list (examples)
        hot_targets = ["p53", "KRAS", "EGFR", "HER2", "PD-L1", "PD-1", "VEGF"]
        
        # If itself is a hot target, uniqueness is lower
        if target.upper() in [t.upper() for t in hot_targets]:
            base_score = 5
        else:
            # Uniqueness based on literature count
            if total < 1000:
                base_score = 20 + min(5, (1000 - total) / 200)
            elif total < 5000:
                base_score = 15 + (5000 - total) / 4000 * 5
            elif total < 15000:
                base_score = 8 + (15000 - total) / 10000 * 7
            else:
                base_score = max(0, 8 - (total - 15000) / 20000 * 8)
        
        return min(25, base_score)
    
    def _score_research_depth(self, data: Dict) -> float:
        """
        Score: Research depth (0-20 points)
        Progress level of preclinical/clinical research
        """
        clinical_trials = data.get("clinical_trials", 0)
        total = data.get("total_count", 0)
        
        # Clinical trial count score
        if clinical_trials == 0:
            clinical_score = 8  # Early target, unknown potential
        elif clinical_trials < 10:
            clinical_score = 12 + clinical_trials * 0.5
        elif clinical_trials < 50:
            clinical_score = 16 + (clinical_trials - 10) * 0.1
        else:
            clinical_score = max(10, 20 - (clinical_trials - 50) * 0.05)
        
        # Basic research depth (based on total literature count)
        if total < 1000:
            basic_score = 2
        elif total < 5000:
            basic_score = 3
        else:
            basic_score = 4
        
        return min(20, clinical_score + basic_score)
    
    def _score_collaboration(self, data: Dict) -> float:
        """
        Score: Collaboration network (0-15 points)
        Diversity of research institutions/teams distribution
        """
        total = data.get("total_count", 0)
        
        # Estimated collaboration diversity based on literature count
        if total < 100:
            diversity_score = 5
        elif total < 1000:
            diversity_score = 8 + (total - 100) / 900 * 4
        elif total < 5000:
            diversity_score = 12 + (total - 1000) / 4000 * 3
        else:
            diversity_score = 15
        
        return min(15, diversity_score)
    
    def _score_trend(self, data: Dict) -> float:
        """
        Score: Time trend (0-15 points)
        Recent research growth trend
        """
        year_dist = data.get("year_distribution", {})
        
        if not year_dist or len(year_dist) < 2:
            return 7.5  # Neutral score
        
        # Calculate growth trend
        years = sorted(year_dist.keys())
        values = [year_dist[y] for y in years]
        
        if len(values) >= 2:
            # Simple linear trend
            early_avg = np.mean(values[:len(values)//2])
            recent_avg = np.mean(values[len(values)//2:])
            
            if early_avg > 0:
                growth_rate = (recent_avg - early_avg) / early_avg
            else:
                growth_rate = 0
            
            # Convert to 0-15 points
            if growth_rate > 1.0:  # Growth over 100%
                trend_score = 15
            elif growth_rate > 0.5:
                trend_score = 12 + (growth_rate - 0.5) * 6
            elif growth_rate > 0.2:
                trend_score = 9 + (growth_rate - 0.2) * 10
            elif growth_rate > 0:
                trend_score = 6 + growth_rate * 15
            elif growth_rate > -0.2:
                trend_score = max(0, 6 + growth_rate * 30)
            else:
                trend_score = max(0, 3 + (growth_rate + 0.2) * 15)
        else:
            trend_score = 7.5
        
        return min(15, max(0, trend_score))
    
    def calculate_confidence(self, data: Dict) -> float:
        """Calculate confidence (0-1)"""
        total = data.get("total_count", 0)
        
        # More literature, higher confidence
        if total < 50:
            return 0.4
        elif total < 200:
            return 0.6
        elif total < 1000:
            return 0.75
        elif total < 5000:
            return 0.85
        else:
            return 0.90
    
    def generate_interpretation(self, score: float, data: Dict) -> str:
        """Generate score interpretation"""
        total = data.get("total_count", 0)
        clinical = data.get("clinical_trials", 0)
        
        if score >= 80:
            level = "Extremely High Novelty"
            desc = "This target has limited research but unique value, representing a highly promising innovative direction."
        elif score >= 65:
            level = "High Novelty"
            desc = "This target has some research foundation but still has significant exploration space, recommended for focused attention."
        elif score >= 50:
            level = "Medium Novelty"
            desc = "This target has moderate research heat, requiring further evaluation of its differentiation advantages."
        elif score >= 35:
            level = "Low Novelty"
            desc = "This target already has substantial research, with intense competition, requiring breakthroughs in specific niche areas."
        else:
            level = "Very Low Novelty"
            desc = "This target is a mature research direction with limited innovation space, requiring careful evaluation of input-output ratio."
        
        details = f"Total literature: {total}, Clinical trials: {clinical}."
        
        return f"[{level}] {desc} {details}"
    
    def score(self, target: str, years: int = 5) -> NoveltyScore:
        """
        Calculate target novelty score
        
        Args:
            target: Target name or gene symbol
            years: Analysis year range
            
        Returns:
            NoveltyScore object
        """
        # Search literature data
        search_result = self.searcher.search(target, years)
        
        # Calculate scores for each dimension
        breakdown = {
            "research_heat": self._score_research_heat(search_result),
            "uniqueness": self._score_uniqueness(search_result, target),
            "research_depth": self._score_research_depth(search_result),
            "collaboration": self._score_collaboration(search_result),
            "trend": self._score_trend(search_result)
        }
        
        # Calculate total score (weighted average)
        total_score = sum(
            breakdown[key] * self.weights[key] 
            for key in breakdown
        )
        
        # Normalize to 0-100
        novelty_score = round(total_score * 100 / 25, 1)
        
        # Calculate confidence
        confidence = self.calculate_confidence(search_result)
        
        # Build metadata
        metadata = {
            "total_papers": search_result["total_count"],
            "recent_papers": search_result["recent_count"],
            "clinical_trials": search_result["clinical_trials"],
            "analysis_date": datetime.now().strftime("%Y-%m-%d"),
            "years_analyzed": years
        }
        
        # Generate interpretation
        interpretation = self.generate_interpretation(novelty_score, search_result)
        
        return NoveltyScore(
            target=target,
            novelty_score=novelty_score,
            confidence=round(confidence, 2),
            breakdown={k: round(v * 100 / 25, 1) for k, v in breakdown.items()},
            metadata=metadata,
            interpretation=interpretation
        )


def format_text_output(result: NoveltyScore) -> str:
    """Format text output"""
    lines = [
        "=" * 60,
        f"Target Novelty Score Report: {result.target}",
        "=" * 60,
        "",
        f"Overall Score: {result.novelty_score}/100",
        f"Confidence: {result.confidence * 100:.0f}%",
        "",
        "Dimension Scores:",
        f"  - Research Heat:   {result.breakdown['research_heat']:.1f}/100",
        f"  - Uniqueness:     {result.breakdown['uniqueness']:.1f}/100",
        f"  - Research Depth:   {result.breakdown['research_depth']:.1f}/100",
        f"  - Collaboration Network:   {result.breakdown['collaboration']:.1f}/100",
        f"  - Time Trend:   {result.breakdown['trend']:.1f}/100",
        "",
        "Statistics:",
        f"  - Total Literature: {result.metadata['total_papers']:,}",
        f"  - Recent Literature: {result.metadata['recent_papers']:,}",
        f"  - Clinical Trials: {result.metadata['clinical_trials']}",
        f"  - Analysis Date: {result.metadata['analysis_date']}",
        "",
        f"Interpretation: {result.interpretation}",
        "",
        "=" * 60
    ]
    return "\n".join(lines)


def format_csv_output(result: NoveltyScore) -> str:
    """Format CSV output"""
    headers = [
        "target", "novelty_score", "confidence",
        "research_heat", "uniqueness", "research_depth",
        "collaboration", "trend", "total_papers",
        "recent_papers", "clinical_trials", "interpretation"
    ]
    
    values = [
        result.target,
        result.novelty_score,
        result.confidence,
        result.breakdown['research_heat'],
        result.breakdown['uniqueness'],
        result.breakdown['research_depth'],
        result.breakdown['collaboration'],
        result.breakdown['trend'],
        result.metadata['total_papers'],
        result.metadata['recent_papers'],
        result.metadata['clinical_trials'],
        f'"{result.interpretation}"'
    ]
    
    return ",".join(map(str, values))


def main():
    """Main function"""
    parser = argparse.ArgumentParser(
        description="Target Novelty Scoring Tool - Based on Literature Mining",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  python main.py --target "PD-L1"
  python main.py --target "BRCA1" --years 10 --format json
  python main.py --target "EGFR" --output report.json
        """
    )
    
    parser.add_argument(
        "--target", "-t",
        required=True,
        help="Target name or gene symbol (e.g.: PD-L1, BRCA1)"
    )
    parser.add_argument(
        "--db", "-d",
        choices=["pubmed", "pmc", "all"],
        default="pubmed",
        help="Data source (default: pubmed)"
    )
    parser.add_argument(
        "--years", "-y",
        type=int,
        default=5,
        help="Analysis year range (default: 5)"
    )
    parser.add_argument(
        "--output", "-o",
        help="Output file path (default: stdout)"
    )
    parser.add_argument(
        "--format", "-f",
        choices=["text", "json", "csv"],
        default="text",
        help="Output format (default: text)"
    )
    parser.add_argument(
        "--verbose", "-v",
        action="store_true",
        help="Verbose output mode"
    )
    
    args = parser.parse_args()
    
    # Execute scoring
    try:
        scorer = NoveltyScorer()
        result = scorer.score(args.target, args.years)
        
        # Format output
        if args.format == "json":
            output = json.dumps(asdict(result), ensure_ascii=False, indent=2)
        elif args.format == "csv":
            output = format_csv_output(result)
        else:
            output = format_text_output(result)
        
        # Output results
        if args.output:
            with open(args.output, "w", encoding="utf-8") as f:
                f.write(output)
            print(f"Report saved to: {args.output}")
        else:
            print(output)
        
        # Verbose mode
        if args.verbose and args.format == "text":
            print(f"\nRaw Data: {json.dumps(result.metadata, ensure_ascii=False)}")
        
        return 0
        
    except Exception as e:
        print(f"Error: {e}", file=sys.stderr)
        return 1


if __name__ == "__main__":
    sys.exit(main())

ClawHub Coding Data Analysis+2

A@clawhub-aipoch-ai-772015cadb

Response Tone Polisher

Skill

Polishes response letters by transforming defensive or harsh language.

---
name: response-tone-polisher
description: Polishes response letters by transforming defensive or harsh language.
license: MIT
skill-author: AIPOCH
---
# Response Tone Polisher

Polishes response letters to peer reviewers by softening harsh or defensive language while preserving the author's position and scientific integrity.

## When to Use

- Use this skill when the task needs Polishes response letters by transforming defensive or harsh language.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

- **Tone Analysis**: Identifies defensive, confrontational, or overly direct language
- **Polite Transformation**: Converts harsh statements into courteous academic prose
- **Position Preservation**: Maintains the author's scientific stance while improving delivery
- **Context Awareness**: Adapts based on response type (acceptance, partial acceptance, respectful decline)
- **Academic Expression Library**: Built-in collection of polished academic phrasings

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `dataclasses`: `unspecified`. Declared in `requirements.txt`.
- `enum`: `unspecified`. Declared in `requirements.txt`.

## Example Usage

```bash
cd "20260318/scientific-skills/Academic Writing/response-tone-polisher"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Overview` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```

## Overview

This skill analyzes author draft responses to reviewer comments and transforms confrontational or defensive phrasing into professional, diplomatic academic language. It helps researchers maintain positive relationships with reviewers while standing firm on scientifically justified positions.

## Usage Examples

### Basic Usage

```
Input:
Reviewer: The sample size is too small for meaningful conclusions.
Draft Response: I disagree. Our sample size is standard in this field.

Output:
We appreciate the reviewer's concern regarding sample size. While we acknowledge 
that larger samples provide greater statistical power, our sample size is consistent 
with established conventions in this field and meets the requirements for adequate 
power analysis (as detailed in the Methods section).
```

### Defensive Language Transformation

| Original (Defensive) | Polished (Professional) |
|---------------------|------------------------|
| "I will not change this." | "We have carefully considered this suggestion and respectfully maintain our original approach because..." |
| "The reviewer is wrong." | "We respectfully offer a different interpretation..." |
| "This is unnecessary." | "We appreciate this suggestion; however, we believe the current presentation adequately addresses this point." |
| "We already explained this." | "We have expanded our explanation to enhance clarity (Page X, Lines Y-Z)." |
| "That's not our fault." | "We acknowledge this limitation and have added appropriate caveats to the Discussion." |

## Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `reviewer_comment` | str | Yes | The reviewer's original comment or criticism |
| `draft_response` | str | Yes | Author's initial draft response (may contain harsh/defensive language) |
| `response_type` | str | No | One of: `accept`, `partial`, `decline` (default: auto-detect) |
| `polish_level` | str | No | `light`, `moderate`, `heavy` (default: moderate) |
| `preserve_meaning` | bool | No | Ensure scientific position is preserved (default: true) |

## Output Format

```json
{
  "polished_response": "string",
  "original_tone_score": "float (0-1, higher = more defensive)",
  "improvements": [
    {
      "original_phrase": "string",
      "polished_phrase": "string",
      "issue_type": "string"
    }
  ],
  "suggestions": ["string"],
  "politeness_score": "float (0-1)"
}
```

## Tone Patterns Detected

The skill identifies and transforms:

### 1. Direct Refusals
- "No" / "We won't" → "We respectfully decline to..."
- "We can't" → "We are unable to..."

### 2. Defensive Statements
- "But we already..." → "We have now clarified..."
- "This is not correct" → "We respectfully note that..."

### 3. Blame Shifting
- "The reviewer misunderstood" → "We apologize for the lack of clarity; we have revised..."
- "This is standard" → "This approach aligns with established conventions..."

### 4. Emotional Language
- "Unfortunately" (overused) → [removed or softened]
- "Obviously" → [removed]
- "Clearly" → [removed or context-dependent]

## Polite Academic Expressions

### Acknowledging Reviewers
- "We thank the reviewer for this insightful observation."
- "We appreciate the reviewer's careful attention to this detail."
- "We are grateful for this constructive feedback."
- "This is an excellent point."

### Expressing Disagreement Diplomatically
- "We respectfully offer an alternative interpretation..."
- "Upon careful reconsideration, we believe..."
- "While we appreciate this perspective, we note that..."
- "We respectfully maintain our position that..."

### Explaining Limitations
- "We acknowledge this limitation and have addressed it by..."
- "This constraint reflects the trade-off between..."
- "We have added appropriate caveats regarding this limitation."

### Describing Changes
- "We have revised the manuscript to clarify..."
- "We have expanded the relevant section to include..."
- "We have incorporated this suggestion by..."

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Command Line Usage

```text

# Interactive mode
python scripts/main.py --interactive

# File-based
python scripts/main.py \
  --reviewer-comment "comment.txt" \
  --draft-response "draft.txt" \
  --output "polished.txt"

# Direct input
python scripts/main.py \
  --reviewer "The data is insufficient." \
  --draft "You are wrong. We have enough data." \
  --polish-level heavy
```

## Python API

```python
from scripts.main import TonePolisher

polisher = TonePolisher()
result = polisher.polish(
    reviewer_comment="The methodology is flawed.",
    draft_response="No it's not. We did it right.",
    response_type="decline",
    polish_level="moderate"
)

print(result["polished_response"])
```

## References

- `references/polite_expressions.json` - Curated library of academic polite expressions
- `references/tone_patterns.md` - Common defensive patterns and their transformations
- `references/examples/` - Before/after polishing examples

## Limitations

- Does not verify scientific accuracy of responses
- Requires human review for complex nuanced disagreements
- May over-soften; authors should verify position is still clear
- Best for English-language responses

## Quality Checklist

After polishing, verify:
- [ ] Original scientific position is preserved
- [ ] No confrontational language remains
- [ ] Professional tone throughout
- [ ] Clear acknowledgment of reviewer's effort
- [ ] Specific changes are still referenced
- [ ] Response directly addresses the comment

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `response-tone-polisher` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `response-tone-polisher` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/polite_expressions.json
{
  "defensive_patterns": {
    "direct_refusal": {
      "patterns": [
        "I will not change this",
        "We will not modify",
        "We cannot do this",
        "No, we won't",
        "This is not possible"
      ],
      "replacements": [
        "We have carefully considered this suggestion and respectfully maintain our original approach",
        "Upon thorough evaluation, we believe our current approach is appropriate",
        "We respectfully decline to make this change because"
      ]
    },
    "confrontational": {
      "patterns": [
        "I disagree",
        "We disagree",
        "This is wrong",
        "You are incorrect",
        "The reviewer is mistaken"
      ],
      "replacements": [
        "We respectfully offer a different interpretation",
        "We note that an alternative view may be warranted",
        "Upon careful reconsideration, we maintain that"
      ]
    },
    "defensive": {
      "patterns": [
        "But we already explained",
        "We already stated this",
        "This was mentioned before",
        "We previously noted"
      ],
      "replacements": [
        "We have now expanded our explanation to enhance clarity",
        "We have revised this section to provide additional detail",
        "To improve clarity, we have added"
      ]
    },
    "blame_shifting": {
      "patterns": [
        "The reviewer misunderstood",
        "This was misinterpreted",
        "You didn't understand",
        "This is not our fault"
      ],
      "replacements": [
        "We apologize for the lack of clarity; we have revised",
        "We recognize that our explanation was insufficient; we have clarified",
        "We acknowledge this limitation and have addressed it"
      ]
    },
    "dismissive": {
      "patterns": [
        "This is unnecessary",
        "This is not needed",
        "Obviously",
        "Clearly",
        "It is evident that"
      ],
      "replacements": [
        "While we appreciate this suggestion, we believe the current presentation is appropriate",
        "We have considered this and maintain that",
        "We note that"
      ]
    }
  },
  "polite_acknowledgments": {
    "standard": [
      "We thank the reviewer for this insightful observation.",
      "We appreciate the reviewer's careful attention to this detail.",
      "We are grateful for this constructive feedback.",
      "We value this thoughtful comment.",
      "Thank you for raising this important point."
    ],
    "major_comment": [
      "We are grateful for this significant observation.",
      "This is an excellent and important point.",
      "We deeply appreciate this critical feedback."
    ],
    "minor_comment": [
      "Thank you for this helpful suggestion.",
      "We appreciate this refinement.",
      "This is a valuable correction."
    ]
  },
  "respectful_disagreement": {
    "opening": [
      "We respectfully offer an alternative interpretation.",
      "Upon careful reconsideration, we believe that",
      "While we appreciate this perspective, we respectfully note that",
      "We respectfully maintain our position that",
      "We believe an alternative approach may be warranted because"
    ],
    "rationale": [
      "our data strongly support the current interpretation",
      "the established methodology in this field supports our approach",
      "the specific constraints of our study necessitate this approach",
      "additional analysis (described below) confirms our original conclusion"
    ],
    "closing": [
      "We hope this explanation adequately addresses the reviewer's concern.",
      "We have added further discussion of this point to the manuscript.",
      "We believe this approach best serves the scientific integrity of the study."
    ]
  },
  "accepting_changes": {
    "simple": [
      "We have revised the manuscript accordingly.",
      "This has been addressed in the revised version.",
      "We have incorporated this suggestion."
    ],
    "detailed": [
      "We have expanded the {section} section to include {detail} (Page {page}, Lines {lines}).",
      "We have added {content} as suggested (Page {page}).",
      "The {section} has been revised to clarify {point} (Lines {lines})."
    ]
  },
  "explaining_limitations": {
    "sample_size": [
      "We acknowledge that our sample size is limited; however, it is consistent with similar studies in this field.",
      "While we recognize the constraint of our sample size, our power analysis indicates adequate statistical power.",
      "We have added a discussion of this limitation to the manuscript."
    ],
    "methodological": [
      "This methodological choice reflects the trade-off between {factor1} and {factor2}.",
      "We selected this approach because {rationale}, while acknowledging that {alternative} is also valid.",
      "We have added appropriate caveats regarding this methodological choice."
    ],
    "general": [
      "We acknowledge this limitation and have addressed it by {action}.",
      "We recognize this constraint and have noted it in the Discussion section.",
      "This limitation is now explicitly discussed in the revised manuscript."
    ]
  },
  "clarification_phrases": [
    "To enhance clarity, we have",
    "We have revised this section to more clearly state that",
    "The text now explicitly states",
    "We have clarified this point by",
    "For improved clarity, we have added"
  ],
  "location_indicators": [
    "(Page {page}, Lines {start}-{end})",
    "(Page {page})",
    "(Lines {start}-{end})",
    "(Figure {number})",
    "(Table {number}, Page {page})"
  ]
}

FILE:references/tone_patterns.md
# Tone Transformation Patterns

This document catalogs common defensive or harsh language patterns in peer review responses and their polished academic alternatives.

## Pattern Categories

### 1. Direct Refusals

When you must decline a reviewer's suggestion, avoid blunt refusals.

| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "I will not change this." | "We have carefully considered this suggestion and respectfully maintain our original approach because..." |
| "We can't do this." | "We are unable to make this change due to [reason]. However, we have..." |
| "This is not possible." | "Unfortunately, this modification is not feasible within the scope of our study because..." |
| "No, we won't do that." | "We respectfully decline to make this change because [rationale]. Instead, we have..." |
| "I refuse to change this." | "Upon careful evaluation, we believe the current formulation best serves the manuscript because..." |

### 2. Expressing Disagreement

Disagreement is acceptable; rudeness is not.

| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "You're wrong." | "We respectfully offer a different interpretation..." |
| "The reviewer is incorrect." | "We note that an alternative view may be warranted..." |
| "This is not true." | "We respectfully suggest that the evidence may support an alternative conclusion..." |
| "I disagree completely." | "Upon careful reconsideration, we maintain that..." |
| "That's not right." | "We believe the data are best interpreted as..." |
| "The reviewer misunderstood." | "We apologize for the lack of clarity; we have revised the text to clarify that..." |

### 3. Defensive Responses

Don't defend by pointing out what you already did.

| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "But we already explained this." | "We have now expanded our explanation in Section X to provide additional detail..." |
| "We stated this clearly in the original." | "We have revised this section to enhance clarity regarding..." |
| "This was mentioned on page X." | "To make this more prominent, we have added additional emphasis to the discussion on page X..." |
| "We already discussed this limitation." | "We have expanded our discussion of this limitation to better address the reviewer's concern..." |
| "This point was already made." | "We have strengthened the presentation of this point by..." |

### 4. Dismissive Language

Never dismiss a reviewer's comment, even if it seems trivial.

| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "This is unnecessary." | "While we appreciate this suggestion, we believe the current presentation adequately addresses this point because..." |
| "This doesn't matter." | "We have considered this point and believe the current approach is appropriate because..." |
| "This is obvious." | "We have ensured this point is clearly stated in the revised manuscript..." |
| "This is standard practice." | "This approach aligns with established conventions in the field..." |
| "Everyone knows this." | "We have added a brief explanation of this concept for clarity..." |

### 5. Blame Shifting

Take responsibility for any misunderstanding.

| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "The reviewer didn't understand." | "We recognize that our explanation was insufficient; we have revised the text to clarify..." |
| "This was clear in the original." | "We have enhanced the clarity of this section by..." |
| "The figure caption was clear." | "We have revised the figure caption to provide additional clarity..." |
| "This is not our fault." | "We acknowledge this limitation and have added appropriate caveats to the Discussion..." |
| "The reviewer missed this." | "We have made this point more prominent in the revised manuscript..." |

### 6. Emotional Language

Keep emotions out of academic writing.

| ❌ Defensive | ✅ Polished |
|-------------|------------|
| "Unfortunately, we cannot..." | "We are unable to..." |
| "We are disappointed that..." | [Remove entirely] |
| "This is frustrating because..." | "This presents a challenge because..." |
| "We strongly believe..." | "We maintain that..." |
| "We feel that..." | "We believe that..." |

### 7. Hedges and Qualifiers

Academic writing uses measured, precise language.

| ❌ Too Strong | ✅ Measured |
|--------------|------------|
| "This proves that..." | "These results suggest that..." |
| "This clearly shows..." | "These findings indicate that..." |
| "Obviously..." | [Remove or provide evidence] |
| "Undoubtedly..." | "The evidence suggests..." |
| "It is certain that..." | "It is likely that..." |

## Response Templates by Scenario

### Accepting a Suggestion

**Structure:**
1. Acknowledge the value of the suggestion
2. Describe the specific change made
3. Provide location reference

**Template:**
```
We thank the reviewer for this valuable suggestion. We have revised the 
[section] to [describe change]. Specifically, we have [specific action] 
(Page X, Lines Y-Z).
```

### Partial Acceptance

**Structure:**
1. Acknowledge the suggestion
2. Explain what was implemented
3. Explain what was not and why
4. Provide rationale

**Template:**
```
We appreciate the reviewer's suggestion regarding [topic]. We have adopted 
part of this recommendation by [action taken]. However, we have maintained 
our original approach to [aspect] because [brief rationale]. We have added 
a discussion of this rationale to the manuscript (Page X).
```

### Respectful Decline

**Structure:**
1. Thank the reviewer
2. State your position respectfully
3. Provide clear rationale
4. Offer a compromise if possible

**Template:**
```
We thank the reviewer for this suggestion. Upon careful consideration, we 
respectfully maintain our original [approach/formulation/interpretation] 
because [clear rationale supported by evidence]. We hope this explanation 
adequately addresses the reviewer's concern. To enhance clarity, we have 
added additional discussion of this point (Page X, Lines Y-Z).
```

## Key Phrases Library

### Opening Acknowledgments
- "We thank the reviewer for this insightful observation."
- "We appreciate the reviewer's careful attention to this detail."
- "We are grateful for this constructive feedback."
- "This is an excellent point."
- "We value this thoughtful comment."

### Expressing Alternative Views
- "We respectfully offer an alternative interpretation..."
- "Upon careful reconsideration, we believe..."
- "While we appreciate this perspective, we respectfully note that..."
- "We respectfully maintain our position that..."
- "An alternative view might be that..."

### Describing Limitations
- "We acknowledge this limitation and have addressed it by..."
- "This constraint reflects the trade-off between..."
- "We have added appropriate caveats regarding this limitation..."
- "We recognize this constraint and have noted it in the Discussion..."

### Describing Changes
- "We have revised the manuscript to clarify..."
- "We have expanded the relevant section to include..."
- "We have incorporated this suggestion by..."
- "The text has been updated to reflect..."
- "We have added the following clarification..."

### Closing Statements
- "We hope this explanation adequately addresses the reviewer's concern."
- "We believe these revisions have substantially improved the manuscript."
- "We are grateful for the opportunity to strengthen our work."

## Common Pitfalls

### 1. Over-Apologizing
Avoid excessive apologies. One "thank you" is sufficient.

❌ "We sincerely apologize for this error and are deeply sorry..."
✅ "We thank the reviewer for identifying this error. We have corrected..."

### 2. Over-Explaining
Keep rationales concise. Reviewers don't need your entire thought process.

❌ "We spent many weeks considering this and consulted with several colleagues and reviewed the literature extensively and..."
✅ "Based on [specific reference/standard practice], we maintain that..."

### 3. Being Too Vague
Be specific about what you changed.

❌ "We have addressed this concern."
✅ "We have added a paragraph discussing [specific topic] (Page X, Lines Y-Z)."

### 4. Promising Too Much
Don't promise changes you won't make.

❌ "We will conduct additional experiments..." (if you won't)
✅ "We respectfully note that additional experiments are beyond the scope..."

## Quick Reference: Severity Levels

### Light Polish (Defensiveness 0.3-0.5)
- Remove obvious emotional language
- Add basic acknowledgments
- Fix direct refusals

### Moderate Polish (Defensiveness 0.5-0.7)
- Transform confrontational language
- Add rationale for disagreements
- Improve overall flow

### Heavy Polish (Defensiveness 0.7+)
- Complete restructuring of response
- Add comprehensive acknowledgments
- Multiple revision passes
- Suggest alternative compromises

FILE:requirements.txt
dataclasses
enum

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Response Tone Polisher

Polishes response letters to peer reviewers by transforming defensive or harsh 
language into professional, courteous academic prose.

Usage:
    python main.py --interactive
    python main.py --reviewer-comment "comment.txt" --draft-response "draft.txt"
    python main.py --reviewer "The data is insufficient" --draft "We disagree"

Author: AI Assistant
Version: 1.0.0
"""

import argparse
import json
import re
import sys
from dataclasses import dataclass, field
from enum import Enum
from typing import List, Optional, Dict, Tuple
import os


class ResponseType(Enum):
    """Types of response to reviewer comments."""
    ACCEPT = "accept"
    PARTIAL = "partial"
    DECLINE = "decline"


class PolishLevel(Enum):
    """Levels of tone polishing."""
    LIGHT = "light"
    MODERATE = "moderate"
    HEAVY = "heavy"


@dataclass
class ToneIssue:
    """Represents a tone issue found in text."""
    original_phrase: str
    issue_type: str
    severity: float  # 0-1, higher = more problematic
    position: Tuple[int, int]  # start, end positions


@dataclass
class PolishResult:
    """Result of tone polishing."""
    polished_response: str
    original_tone_score: float
    politeness_score: float
    improvements: List[Dict]
    suggestions: List[str]


class TonePolisher:
    """
    Main class for polishing response letter tone.
    
    Transforms defensive or harsh language into professional, 
    courteous academic prose while preserving scientific positions.
    """
    
    # Defensive/harsh patterns and their polished alternatives
    TONE_PATTERNS = {
        # Direct refusals
        r"\b[Ii] will not (?:change|modify|correct|revise|update)\b": {
            "replacement": "We have carefully considered this suggestion and respectfully maintain our original {noun}",
            "issue_type": "direct_refusal",
            "severity": 0.9
        },
        r"\b[Ww]e will not (?:change|modify|correct|revise|update)\b": {
            "replacement": "We have carefully considered this suggestion and respectfully maintain our original {noun}",
            "issue_type": "direct_refusal",
            "severity": 0.9
        },
        r"\b[Nn]o,? (?:we )?(?:will not|won't|cannot|can't)\b": {
            "replacement": "We respectfully decline to {verb} because",
            "issue_type": "direct_refusal",
            "severity": 0.85
        },
        r"\b[Ii] (?:disagree|don't agree)\b": {
            "replacement": "We respectfully offer a different interpretation",
            "issue_type": "confrontational",
            "severity": 0.7
        },
        r"\b[Ww]e (?:disagree|don't agree)\b": {
            "replacement": "We respectfully offer a different interpretation",
            "issue_type": "confrontational",
            "severity": 0.7
        },
        
        # Defensive statements
        r"\b[Tt]he reviewer is (?:wrong|incorrect|mistaken)\b": {
            "replacement": "We respectfully note that",
            "issue_type": "defensive",
            "severity": 0.95
        },
        r"\b[Tt]his is (?:wrong|incorrect|not correct)\b": {
            "replacement": "We respectfully suggest an alternative view",
            "issue_type": "defensive",
            "severity": 0.8
        },
        r"\b[Ww]e (?:already|previously) (?:stated|mentioned|explained|noted)\b": {
            "replacement": "We have now expanded our explanation to enhance clarity",
            "issue_type": "defensive",
            "severity": 0.75
        },
        r"\b[Bb]ut we (?:already|did)\b": {
            "replacement": "We have now clarified",
            "issue_type": "defensive",
            "severity": 0.7
        },
        
        # Blame shifting
        r"\b[Tt]he reviewer (?:misunderstood|misinterpreted|did not understand)\b": {
            "replacement": "We apologize for the lack of clarity; we have revised",
            "issue_type": "blame_shifting",
            "severity": 0.9
        },
        r"\b[Tt]his was not our (?:fault|responsibility|error)\b": {
            "replacement": "We acknowledge this limitation and have added appropriate caveats",
            "issue_type": "blame_shifting",
            "severity": 0.85
        },
        
        # Dismissive language
        r"\b[Tt]his is (?:unnecessary|not needed|redundant)\b": {
            "replacement": "While we appreciate this suggestion, we believe the current presentation adequately addresses this point",
            "issue_type": "dismissive",
            "severity": 0.75
        },
        r"\b[Tt]his is (?:obvious|clear|evident)\b": {
            "replacement": "This point is addressed",
            "issue_type": "dismissive",
            "severity": 0.6
        },
        r"\b[Ii]t is (?:obvious|clear|evident) that\b": {
            "replacement": "We note that",
            "issue_type": "dismissive",
            "severity": 0.65
        },
        
        # Overly casual/direct
        r"\b[Ii]t's not our (?:fault|problem)\b": {
            "replacement": "This reflects an inherent limitation that we have now addressed",
            "issue_type": "unprofessional",
            "severity": 0.9
        },
        r"\b[Nn]ot our (?:fault|responsibility)\b": {
            "replacement": "We acknowledge this constraint",
            "issue_type": "unprofessional",
            "severity": 0.85
        },
        
        # Emotional/overly apologetic
        r"\b[Uu]nfortunately,? we (?:cannot|can't|are unable to)\b": {
            "replacement": "We are unable to",
            "issue_type": "emotional",
            "severity": 0.4
        },
        r"\b[Ww]e (?:cannot|can't) (?:possibly|really)\b": {
            "replacement": "We are unable to",
            "issue_type": "emotional",
            "severity": 0.5
        },
    }
    
    # Polite academic expression library
    POLITE_EXPRESSIONS = {
        "acknowledgments": [
            "We thank the reviewer for this insightful observation.",
            "We appreciate the reviewer's careful attention to this detail.",
            "We are grateful for this constructive feedback.",
            "This is an excellent point.",
            "We value this thoughtful comment.",
        ],
        "partial_agreement": [
            "We thank the reviewer for this suggestion. While we appreciate this perspective, we note that",
            "We appreciate this recommendation. Upon careful consideration, we believe",
            "We have carefully evaluated this suggestion. However,",
        ],
        "respectful_disagreement": [
            "We respectfully offer an alternative interpretation",
            "Upon careful reconsideration, we maintain that",
            "While we appreciate this perspective, we respectfully note that",
            "We respectfully maintain our position that",
            "We believe an alternative approach may be warranted",
        ],
        "limitations": [
            "We acknowledge this limitation and have addressed it by",
            "This constraint reflects the trade-off between",
            "We have added appropriate caveats regarding this limitation",
            "We recognize this constraint and have noted it in",
        ],
        "changes_made": [
            "We have revised the manuscript to clarify",
            "We have expanded the relevant section to include",
            "We have incorporated this suggestion by",
            "We have updated the text to reflect",
            "We have added the following clarification",
        ],
        "clarification": [
            "To enhance clarity, we have",
            "We have revised this section to more clearly state",
            "The text now explicitly states",
            "We have clarified this point by",
        ]
    }
    
    def __init__(self, polish_level: PolishLevel = PolishLevel.MODERATE):
        self.polish_level = polish_level
        self.context_noun = "approach"  # Default noun for replacements
        
    def analyze_tone(self, text: str) -> Tuple[float, List[ToneIssue]]:
        """
        Analyze the tone of a draft response.
        
        Returns:
            Tuple of (defensiveness_score, list_of_issues)
        """
        issues = []
        total_severity = 0
        
        for pattern, info in self.TONE_PATTERNS.items():
            matches = list(re.finditer(pattern, text, re.IGNORECASE))
            for match in matches:
                issue = ToneIssue(
                    original_phrase=match.group(),
                    issue_type=info["issue_type"],
                    severity=info["severity"],
                    position=(match.start(), match.end())
                )
                issues.append(issue)
                total_severity += info["severity"]
        
        # Calculate overall defensiveness score (0-1)
        word_count = len(text.split())
        if word_count > 0:
            defensiveness_score = min(1.0, total_severity / (word_count / 20 + 1))
        else:
            defensiveness_score = 0.0
            
        return defensiveness_score, issues
    
    def polish(
        self,
        reviewer_comment: str,
        draft_response: str,
        response_type: Optional[ResponseType] = None,
        polish_level: Optional[PolishLevel] = None
    ) -> PolishResult:
        """
        Polish a draft response to make it more professional and courteous.
        
        Args:
            reviewer_comment: The original reviewer comment
            draft_response: The author's draft response
            response_type: Type of response (accept/partial/decline)
            polish_level: Level of polishing to apply
            
        Returns:
            PolishResult containing polished response and metadata
        """
        if polish_level:
            self.polish_level = polish_level
            
        # Auto-detect response type if not provided
        if not response_type:
            response_type = self._detect_response_type(draft_response)
        
        # Analyze original tone
        original_tone_score, issues = self.analyze_tone(draft_response)
        
        # Generate polished response
        polished = self._generate_polished_response(
            reviewer_comment, draft_response, response_type, issues
        )
        
        # Calculate politeness score
        politeness_score = self._calculate_politeness(polished)
        
        # Generate improvements list
        improvements = self._generate_improvements_list(issues, polished)
        
        # Generate additional suggestions
        suggestions = self._generate_suggestions(reviewer_comment, polished)
        
        return PolishResult(
            polished_response=polished,
            original_tone_score=original_tone_score,
            politeness_score=politeness_score,
            improvements=improvements,
            suggestions=suggestions
        )
    
    def _detect_response_type(self, draft_response: str) -> ResponseType:
        """Auto-detect the type of response based on content."""
        decline_indicators = [
            r"\bnot change\b", r"\bnot modify\b", r"\bnot revise\b",
            r"\bdisagree\b", r"\bwon't\b", r"\bcannot\b",
            r"\bmaintain\b", r"\boriginal\b"
        ]
        accept_indicators = [
            r"\bhave revised\b", r"\bhave changed\b", r"\bhave updated\b",
            r"\bthank you\b", r"\bagree\b", r"\bmodified\b"
        ]
        
        decline_count = sum(1 for p in decline_indicators if re.search(p, draft_response, re.I))
        accept_count = sum(1 for p in accept_indicators if re.search(p, draft_response, re.I))
        
        if decline_count > accept_count:
            return ResponseType.DECLINE
        elif accept_count > 0:
            return ResponseType.ACCEPT
        else:
            return ResponseType.PARTIAL
    
    def _generate_polished_response(
        self,
        reviewer_comment: str,
        draft_response: str,
        response_type: ResponseType,
        issues: List[ToneIssue]
    ) -> str:
        """Generate the polished response text."""
        
        # Start with acknowledgment
        result_parts = []
        
        # Add appropriate opening based on response type
        if response_type == ResponseType.ACCEPT:
            opener = self._select_expression("acknowledgments")
            result_parts.append(opener)
            result_parts.append(self._transform_to_acceptance(draft_response))
        elif response_type == ResponseType.PARTIAL:
            opener = self._select_expression("partial_agreement")
            result_parts.append(opener)
            result_parts.append(self._transform_to_partial(draft_response))
        else:  # DECLINE
            opener = self._select_expression("acknowledgments")
            result_parts.append(opener)
            result_parts.append(self._transform_to_decline(draft_response))
        
        # Apply pattern-based replacements
        polished = " ".join(result_parts)
        
        # Apply specific pattern replacements
        for pattern, info in self.TONE_PATTERNS.items():
            replacement = info["replacement"]
            if "{noun}" in replacement:
                replacement = replacement.format(noun=self.context_noun)
            if "{verb}" in replacement:
                # Extract verb from context - simplified
                replacement = replacement.format(verb="make this change")
            
            polished = re.sub(pattern, replacement, polished, flags=re.IGNORECASE)
        
        # Clean up and format
        polished = self._clean_text(polished)
        
        return polished
    
    def _select_expression(self, category: str) -> str:
        """Select a polite expression from the library."""
        import random
        expressions = self.POLITE_EXPRESSIONS.get(category, ["Thank you for this comment."])
        return random.choice(expressions)
    
    def _transform_to_acceptance(self, draft: str) -> str:
        """Transform draft into acceptance language."""
        # Remove defensive prefixes
        cleaned = re.sub(r"^\s*(?:[Bb]ut|[Hh]owever)\s*,?\s*", "", draft)
        
        # Add change language if not present
        if not re.search(r"\b(?:have revised|have changed|have updated|have modified)\b", cleaned, re.I):
            return f"We have revised the manuscript as suggested. {cleaned}"
        return cleaned
    
    def _transform_to_partial(self, draft: str) -> str:
        """Transform draft into partial acceptance language."""
        # Start with partial acceptance framing
        return draft
    
    def _transform_to_decline(self, draft: str) -> str:
        """Transform draft into respectful decline language."""
        # Replace harsh refusals with polite alternatives
        return draft
    
    def _clean_text(self, text: str) -> str:
        """Clean up and format the text."""
        # Fix spacing
        text = re.sub(r"\s+", " ", text)
        # Fix punctuation spacing
        text = re.sub(r"\s+([.,;:!?])", r"\1", text)
        # Capitalize first letter
        text = text.strip()
        if text:
            text = text[0].upper() + text[1:]
        # Ensure period at end
        if text and not text.endswith((".", "!", "?")):
            text += "."
        return text
    
    def _calculate_politeness(self, text: str) -> float:
        """Calculate a politeness score for the text."""
        polite_indicators = [
            r"\bthank\b", r"\bappreciate\b", r"\bgrateful\b",
            r"\brespectfully\b", r"\bcarefully considered\b",
            r"\bvaluable\b", r"\binsightful\b", r"\bconstructive\b"
        ]
        
        rude_indicators = [
            r"\bwrong\b", r"\bincorrect\b", r"\bdisagree\b(?!\s+with)",
            r"\bnot\s+(?:correct|true|accurate)\b", r"\byou\s+(?:are|have)\s+(?:wrong|mistaken)"
        ]
        
        polite_count = sum(1 for p in polite_indicators if re.search(p, text, re.I))
        rude_count = sum(1 for p in rude_indicators if re.search(p, text, re.I))
        
        score = 0.5 + (polite_count * 0.1) - (rude_count * 0.2)
        return max(0.0, min(1.0, score))
    
    def _generate_improvements_list(
        self,
        issues: List[ToneIssue],
        polished: str
    ) -> List[Dict]:
        """Generate a list of improvements made."""
        improvements = []
        for issue in issues:
            # Find what this issue was transformed to (simplified)
            improvements.append({
                "original_phrase": issue.original_phrase,
                "polished_phrase": f"[Transformed {issue.issue_type}]",
                "issue_type": issue.issue_type,
                "severity": issue.severity
            })
        return improvements
    
    def _generate_suggestions(self, reviewer_comment: str, polished: str) -> List[str]:
        """Generate additional suggestions for improvement."""
        suggestions = []
        
        # Check for missing location references
        if not re.search(r"\b(?:[Pp]age|[Ll]ine|[Ff]igure|[Tt]able)\b", polished):
            suggestions.append("Consider adding page/line references to help reviewers locate changes.")
        
        # Check for rationale in decline situations
        if "respectfully" in polished.lower() and "because" not in polished.lower():
            suggestions.append("When declining suggestions, provide a brief rationale.")
        
        # Check length
        word_count = len(polished.split())
        if word_count < 20:
            suggestions.append("The response is quite brief; consider adding more detail about the changes made.")
        
        return suggestions


def interactive_mode():
    """Run in interactive mode to collect input from user."""
    print("=" * 70)
    print("Response Tone Polisher - Interactive Mode")
    print("=" * 70)
    print()
    print("This tool helps polish your responses to peer reviewers.")
    print("It transforms defensive or harsh language into professional, courteous prose.")
    print()
    
    # Get polish level
    print("Select polish level:")
    print("  1. Light (minimal changes)")
    print("  2. Moderate (recommended)")
    print("  3. Heavy (aggressive polishing)")
    level_choice = input("Choice (1-3): ").strip() or "2"
    
    level_map = {
        "1": PolishLevel.LIGHT,
        "2": PolishLevel.MODERATE,
        "3": PolishLevel.HEAVY
    }
    polish_level = level_map.get(level_choice, PolishLevel.MODERATE)
    
    # Get reviewer comment
    print("\n" + "-" * 70)
    print("Paste the REVIEWER COMMENT below (type 'END' on new line when done):")
    print("-" * 70)
    lines = []
    while True:
        line = input()
        if line.strip() == "END":
            break
        lines.append(line)
    reviewer_comment = "\n".join(lines)
    
    # Get draft response
    print("\n" + "-" * 70)
    print("Paste your DRAFT RESPONSE below (type 'END' on new line when done):")
    print("-" * 70)
    lines = []
    while True:
        line = input()
        if line.strip() == "END":
            break
        lines.append(line)
    draft_response = "\n".join(lines)
    
    # Process
    print("\n" + "=" * 70)
    print("Polishing your response...")
    print("=" * 70)
    
    polisher = TonePolisher(polish_level=polish_level)
    result = polisher.polish(reviewer_comment, draft_response)
    
    # Display results
    print("\n📊 TONE ANALYSIS:")
    print(f"   Original defensiveness score: {result.original_tone_score:.2f}/1.0")
    print(f"   Politeness score: {result.politeness_score:.2f}/1.0")
    
    if result.improvements:
        print(f"\n🔧 IMPROVEMENTS MADE ({len(result.improvements)}):")
        for i, imp in enumerate(result.improvements[:5], 1):
            print(f"   {i}. [{imp['issue_type'].replace('_', ' ').title()}]")
            print(f"      Original: \"{imp['original_phrase'][:50]}...\" " if len(imp['original_phrase']) > 50 else f"      Original: \"{imp['original_phrase']}\"")
    
    if result.suggestions:
        print(f"\n💡 SUGGESTIONS:")
        for i, sug in enumerate(result.suggestions, 1):
            print(f"   {i}. {sug}")
    
    print("\n" + "=" * 70)
    print("✨ POLISHED RESPONSE:")
    print("=" * 70)
    print(result.polished_response)
    
    # Save option
    save = input("\n\nSave to file? (filename or 'n'): ").strip()
    if save and save.lower() != 'n':
        output = {
            "polished_response": result.polished_response,
            "original_tone_score": result.original_tone_score,
            "politeness_score": result.politeness_score,
            "improvements": result.improvements,
            "suggestions": result.suggestions
        }
        with open(save, 'w') as f:
            json.dump(output, f, indent=2)
        print(f"✅ Saved to {save}")


def main():
    parser = argparse.ArgumentParser(
        description="Polish response letters by transforming defensive language into professional prose"
    )
    parser.add_argument(
        "--reviewer-comment", "-r",
        help="File containing reviewer comment, or the comment text directly"
    )
    parser.add_argument(
        "--draft-response", "-d",
        help="File containing draft response, or the text directly"
    )
    parser.add_argument(
        "--output", "-o",
        help="Output file for the polished response"
    )
    parser.add_argument(
        "--response-type", "-t",
        choices=["accept", "partial", "decline"],
        help="Type of response (default: auto-detect)"
    )
    parser.add_argument(
        "--polish-level", "-p",
        choices=["light", "moderate", "heavy"],
        default="moderate",
        help="Level of polishing (default: moderate)"
    )
    parser.add_argument(
        "--interactive", "-i",
        action="store_true",
        help="Run in interactive mode"
    )
    parser.add_argument(
        "--json", "-j",
        action="store_true",
        help="Output as JSON"
    )
    
    args = parser.parse_args()
    
    if args.interactive or (not args.reviewer_comment and not args.draft_response):
        interactive_mode()
        return
    
    # Read inputs
    reviewer_comment = args.reviewer_comment
    draft_response = args.draft_response
    
    # Check if inputs are files
    if reviewer_comment and os.path.isfile(reviewer_comment):
        with open(reviewer_comment, 'r') as f:
            reviewer_comment = f.read()
    
    if draft_response and os.path.isfile(draft_response):
        with open(draft_response, 'r') as f:
            draft_response = f.read()
    
    if not reviewer_comment or not draft_response:
        print("Error: Both --reviewer-comment and --draft-response are required.")
        sys.exit(1)
    
    # Map response type
    response_type = None
    if args.response_type:
        response_type = ResponseType(args.response_type)
    
    # Map polish level
    polish_level = PolishLevel(args.polish_level)
    
    # Process
    polisher = TonePolisher(polish_level=polish_level)
    result = polisher.polish(reviewer_comment, draft_response, response_type, polish_level)
    
    # Output
    if args.json:
        output = {
            "polished_response": result.polished_response,
            "original_tone_score": result.original_tone_score,
            "politeness_score": result.politeness_score,
            "improvements": result.improvements,
            "suggestions": result.suggestions
        }
        output_text = json.dumps(output, indent=2)
    else:
        output_text = result.polished_response
    
    if args.output:
        with open(args.output, 'w') as f:
            f.write(output_text)
        print(f"Polished response saved to {args.output}")
    else:
        print(output_text)


if __name__ == "__main__":
    main()

ClawHub Coding Data Analysis+2

A@clawhub-aipoch-ai-772015cadb

Protocol Deviation Classifier

Skill

Determine whether an incident in a clinical trial is a "major deviation.

---
name: protocol-deviation-classifier
description: Determine whether an incident in a clinical trial is a "major deviation.
license: MIT
skill-author: AIPOCH
---
# Protocol Deviation Classifier

Clinical trial protocol deviation classification tool, based on GCP and ICH E6 guidelines, automatically determines whether deviations belong to "major deviations" or "minor deviations".

## When to Use

- Use this skill when the task needs Determine whether an incident in a clinical trial is a "major deviation.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

See `## Features` above for related details.

- Scope-focused workflow aligned to: Determine whether an incident in a clinical trial is a "major deviation.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

- Python 3.8+
- No third-party dependencies (pure Python standard library implementation)

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Data Analytics/protocol-deviation-classifier"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py --input "Audit validation sample with explicit symptoms, history, assessment, and next-step plan." --format json
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Features

- **Automatic Classification**: Automatically determines severity based on deviation description
- **Risk Assessment**: Assesses impact on subject safety, data integrity, and scientific validity
- **Regulatory Basis**: Classification basis complies with GCP, ICH E6, and FDA/EMA guidelines
- **Report Generation**: Generates deviation classification reports that meet regulatory requirements
- **Chinese Support**: Full support for Chinese clinical trial scenarios

## Deviation Classification Standards

### Major/Critical Deviation

Deviations that may affect trial data integrity, subject safety, or trial scientific validity:

| Category | Examples |
|------|------|
| Informed Consent | Performing research procedures without informed consent, using expired/incorrect informed consent forms |
| Inclusion/Exclusion Criteria | Enrolling subjects who don't meet inclusion criteria, enrolling subjects who meet exclusion criteria |
| Investigational Product | Overdose administration, contraindicated concomitant medication, incorrect route of administration, randomization error |
| Safety | Not performing safety monitoring as required by protocol, missing SAE/SUSAR reports, delayed reporting |
| Blinding | Unblinding by unauthorized personnel, unrecorded emergency unblinding procedures |
| Data Integrity | Falsifying/fabricating data, systematic missing of critical data |
| Prohibited Operations | Violating key operational procedures of trial protocol, not performing key efficacy assessments |

### Minor Deviation

Deviations unlikely to affect trial data integrity, subject safety, or trial scientific validity:

| Category | Examples |
|------|------|
| Visit Window | Slightly exceeding visit time window (e.g., within a few days), delay of non-critical visits |
| Sample Collection | Minor timing deviations in non-critical sample collection, slight delays in sample processing |
| Questionnaire Completion | Quality of life questionnaires/diary cards submitted a few days late |
| Data Recording | Delays in non-critical data recording, spelling/formatting errors |
| Procedure Execution | Adjustment of secondary procedure execution order, omission of non-critical assessments (e.g., height measurement) |
| Documentation | Delays in source document signatures, missing secondary documents (e.g., non-critical examination reports) |

## Usage

### Python API

```python
from scripts.main import DeviationClassifier

# Initialize classifier
classifier = DeviationClassifier()

# Classify single deviation
result = classifier.classify(
    description="Subject visit delayed by 2 days",
    deviation_type="Visit Window"
)
print(result.classification)  # "Minor Deviation"
print(result.confidence)      # 0.92
print(result.rationale)       # Classification rationale explanation

# Batch classification
deviations = [
    {"description": "Blood sample collected without informed consent", "type": "Informed Consent"},
    {"description": "Quality of life questionnaire submitted 3 days late", "type": "Data Collection"}
]
batch_results = classifier.classify_batch(deviations)

# Generate report
report = classifier.generate_report(batch_results)
```

### CLI Usage

```text

# Classify single deviation
python scripts/main.py classify --description "Subject visit delayed by 2 days" --type "Visit Window"

# Batch classification from file
python scripts/main.py batch --input deviations.json --output report.json

# Interactive classification
python scripts/main.py interactive

# Assess deviation impact
python scripts/main.py assess \
  --description "Subject accidentally took double dose of investigational drug" \
  --safety-impact high \
  --data-impact medium \
  --scientific-impact medium
```

### Input Format

**JSON Input File Format:**

```json
[
  {
    "id": "DEV-001",
    "description": "Subject visit delayed by 2 days",
    "type": "Visit Window",
    "occurrence_date": "2024-01-15",
    "severity_factors": {
      "safety_impact": "none",
      "data_impact": "low",
      "scientific_impact": "low"
    }
  },
  {
    "id": "DEV-002",
    "description": "Blood collection performed without informed consent",
    "type": "Informed Consent",
    "severity_factors": {
      "safety_impact": "high",
      "data_impact": "high",
      "scientific_impact": "high"
    }
  }
]
```

### Output Format

**Classification Result:**

```json
{
  "id": "DEV-001",
  "classification": "Minor Deviation",
  "classification_en": "Minor Deviation",
  "confidence": 0.92,
  "rationale": "Visit time window slightly delayed (2 days), does not affect subject safety, data integrity, or trial scientific validity.",
  "risk_factors": {
    "safety_risk": "none",
    "data_integrity_risk": "low",
    "scientific_validity_risk": "none"
  },
  "regulatory_basis": [
    "ICH E6(R2) Section 4.5",
    "GCP Section 6.4.4"
  ],
  "recommended_actions": [
    "Document in file",
    "Track trends"
  ]
}
```

## Classification Algorithm

Classification based on the following assessment dimensions:

1. **Subject Safety Impact** (Safety Impact)
   - None: No impact
   - Low: Minor impact
   - Medium: Moderate impact
   - High: Serious impact

2. **Data Integrity Impact** (Data Integrity Impact)
   - None: No impact
   - Low: Minor impact on non-critical data
   - Medium: Partial impact on critical data
   - High: Serious damage to critical data

3. **Trial Scientific Validity Impact** (Scientific Validity Impact)
   - None: No impact
   - Low: Minor impact on statistical power
   - Medium: May affect primary endpoint
   - High: Seriously affects trial conclusion

**Classification Rules:**
- Any dimension is High → Major Deviation
- Safety dimension is Medium and Data/Science either is Medium+ → Major Deviation
- Other cases → Minor Deviation

## Regulatory Basis

- ICH E6(R2) Good Clinical Practice Guideline
- ICH E6(R3) Good Clinical Practice Guideline (Draft)
- FDA 21 CFR Part 312 (IND Regulations)
- FDA Guidance for Industry: Oversight of Clinical Investigations
- EMA Reflection Paper on Risk Based Quality Management
- NMPA Good Clinical Practice for Drug Clinical Trials

## Notes

1. This tool provides classification recommendations, final determination must be confirmed by clinical quality assurance personnel
2. Serious/critical deviations must be reported to sponsor and ethics committee immediately
3. It is recommended to regularly review deviation trends and implement CAPA (Corrective and Preventive Actions)
4. Classification standards may vary by regulatory agency, trial type, and protocol requirements

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `protocol-deviation-classifier` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `protocol-deviation-classifier` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

## Inputs to Collect

- Required inputs: the user goal, the primary data or source file, and the requested output format.
- Optional inputs: output directory, formatting preferences, and validation constraints.
- If a required input is unavailable, return a short clarification request before continuing.

## Output Contract

- Return a short summary, the main deliverables, and any assumptions that materially affect interpretation.
- If execution is partial, label what succeeded, what failed, and the next safe recovery step.
- Keep the final answer within the documented scope of the skill.

## Validation and Safety Rules

- Validate identifiers, file paths, and user-provided parameters before execution.
- Do not fabricate results, metrics, citations, or downstream conclusions.
- Use safe fallback behavior when dependencies, credentials, or required inputs are missing.
- Surface any execution failure with a concise diagnosis and recovery path.

FILE:references/runtime_checklist.md
# Runtime Checklist

- Category: `Data Analysis`
- Validate the user goal, required inputs, and output format before taking action.
- Ask a targeted clarification question when a required input is missing.
- Keep the response scoped to the documented workflow and state assumptions explicitly.
- Run a non-destructive smoke check before any file-dependent or data-dependent command.
- Recommended smoke check: `python -m py_compile scripts/main.py`
- If execution fails, stop and return a concise recovery path instead of fabricating results.

FILE:requirements.txt
dataclasses
enum

FILE:scripts/main.py
#!/usr/bin/env python3
"""Protocol Deviation Classifier
Clinical trial protocol deviation classification tool

Based on the GCP/ICH E6 guidelines, it can automatically determine whether a deviation is a "major deviation" or a "minor deviation".
Technical: Risk-based quality management, GCP compliance assessment, deviation classification"""

import argparse
import json
import sys
import re
from enum import Enum
from dataclasses import dataclass, field, asdict
from typing import Dict, List, Optional, Any, Tuple
from datetime import datetime


class Classification(Enum):
    """Deviation classification enumeration"""
    MAJOR = "major"
    MINOR = "minor"
    CRITICAL = "critical"
    
    def __str__(self):
        mapping = {
            Classification.MAJOR: "significant deviation",
            Classification.MINOR: "small deviation",
            Classification.CRITICAL: "critical deviation"
        }
        return mapping.get(self, self.value)
    
    @property
    def en_name(self):
        mapping = {
            Classification.MAJOR: "Major Deviation",
            Classification.MINOR: "Minor Deviation",
            Classification.CRITICAL: "Critical Deviation"
        }
        return mapping.get(self, self.value.title())


class RiskLevel(Enum):
    """Risk level enumeration"""
    NONE = "none"
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    
    def __str__(self):
        mapping = {
            RiskLevel.NONE: "none",
            RiskLevel.LOW: "Low",
            RiskLevel.MEDIUM: "medium",
            RiskLevel.HIGH: "high"
        }
        return mapping.get(self, self.value)
    
    @property
    def score(self):
        """Returns the risk score used in the calculation"""
        scores = {
            RiskLevel.NONE: 0,
            RiskLevel.LOW: 1,
            RiskLevel.MEDIUM: 2,
            RiskLevel.HIGH: 3
        }
        return scores.get(self, 0)


@dataclass
class DeviationEvent:
    """Plan deviation event data class"""
    id: str = ""
    description: str = ""
    deviation_type: str = ""
    occurrence_date: Optional[str] = None
    site_id: Optional[str] = None
    subject_id: Optional[str] = None
    safety_impact: RiskLevel = RiskLevel.NONE
    data_impact: RiskLevel = RiskLevel.NONE
    scientific_impact: RiskLevel = RiskLevel.NONE
    
    @classmethod
    def from_dict(cls, data: Dict) -> 'DeviationEvent':
        """Create event from dictionary"""
        def parse_risk(value):
            if isinstance(value, str):
                try:
                    return RiskLevel(value.lower())
                except ValueError:
                    return RiskLevel.NONE
            return RiskLevel.NONE
        
        factors = data.get('severity_factors', {})
        return cls(
            id=data.get('id', ''),
            description=data.get('description', ''),
            deviation_type=data.get('type', data.get('deviation_type', '')),
            occurrence_date=data.get('occurrence_date'),
            site_id=data.get('site_id'),
            subject_id=data.get('subject_id'),
            safety_impact=parse_risk(factors.get('safety_impact', 'none')),
            data_impact=parse_risk(factors.get('data_impact', 'none')),
            scientific_impact=parse_risk(factors.get('scientific_impact', 'none'))
        )


@dataclass
class ClassificationResult:
    """Classification result data class"""
    id: str
    classification: Classification
    confidence: float
    rationale: str
    safety_risk: RiskLevel
    data_integrity_risk: RiskLevel
    scientific_validity_risk: RiskLevel
    risk_score: int
    regulatory_basis: List[str] = field(default_factory=list)
    recommended_actions: List[str] = field(default_factory=list)
    key_indicators: List[str] = field(default_factory=list)
    
    def to_dict(self) -> Dict:
        """Convert to dictionary"""
        return {
            "id": self.id,
            "classification": str(self.classification),
            "classification_en": self.classification.en_name,
            "confidence": round(self.confidence, 2),
            "rationale": self.rationale,
            "risk_factors": {
                "safety_risk": str(self.safety_risk),
                "data_integrity_risk": str(self.data_integrity_risk),
                "scientific_validity_risk": str(self.scientific_validity_risk)
            },
            "risk_score": self.risk_score,
            "regulatory_basis": self.regulatory_basis,
            "recommended_actions": self.recommended_actions,
            "key_indicators": self.key_indicators
        }


class DeviationClassifier:
    """Scheme deviation classifier
    
    Based on GCP/ICH E6 guidelines, automatically determine the severity of clinical trial deviations."""
    
    # Major deviation keyword pattern (Chinese and English)
    MAJOR_INDICATORS = {
        'informed_consent': [
            'informed consent', 'Not obtained.*Consent', 'consent form', r'informed consent',
            'not signed', r'consent', r'signed.*consent'
        ],
        'eligibility': [
            'Inclusion criteria', 'Exclusion criteria', 'Not in compliance.*Enter the group', r'eligibility',
            r'inclusion.*criteria', r'exclusion.*criteria', r'ineligible'
        ],
        'dosing': [
            'overdose', 'double dose', r'overdose', 'excess', 'Error.*Dose',
            r'wrong dose', r'double dose', r'dosing error'
        ],
        'concomitant': [
            'Concomitant medication', 'Contraindications.* Medication', r'concomitant', r'prohibited medication',
            r'forbidden drug'
        ],
        'randomization': [
            'Randomize.*Error', 'Wrong.*Random', r'randomization error',
            r'wrong randomization'
        ],
        'safety_reporting': [
            r'SAE', r'SUSAR', 'Security.*Report', 'False negative', 'Delayed reporting',
            r'serious adverse event', r'safety reporting'
        ],
        'blinding': [
            'Break the blindness', r'unblind', 'Breaking the blindness.*Not yet', 'Unauthorized.*Break the blind'
        ],
        'data_integrity': [
            'forgery', 'tamper', 'data.*false', r'falsified',
            r'fabricated', 'Data fraud'
        ],
        'critical_procedures': [
            'Key.* not executed', 'Not.*Key', 'Missing.*Primary endpoint',
            r'primary endpoint.*missed', r'critical procedure'
        ]
    }
    
    # Small deviation keyword pattern
    MINOR_INDICATORS = {
        'visit_window': [
            'Visit.*Delay', 'Visit.* in advance', r'visit.*window',
            'Visit.*[12].*days', r'visit.*[12].*day'
        ],
        'sample_collection': [
            'sample.*delay', 'sampling.*time', r'sample.*delay',
            'Non-critical.*Sample', r'non-critical sample'
        ],
        'questionnaire': [
            'Questionnaire', 'diary cards', r'questionnaire', r'diary card',
            r'QOL', 'quality of life'
        ],
        'documentation': [
            'signature.*delay', 'Documentation.* is missing', r'document.*missing',
            r'signature.*delay', 'record.*delay'
        ],
        'non_critical': [
            'non-critical', 'secondary', 'slight', r'non-critical',
            r'minor', r'slight'
        ]
    }
    
    # regulatory basis
    REGULATORY_BASIS = {
        'major': [
            "ICH E6(R2) Section 4.5 - Subject Safety",
            "ICH E6(R2) Section 4.9 - Informed Consent",
            "GCP Section 6.2 - Subject Rights",
            "FDA 21 CFR Part 312.60 - General Investigator Obligations"
        ],
        'minor': [
            "ICH E6(R2) Section 4.6 - Investigational Product",
            "ICH E6(R2) Section 5.1 - Trial Management",
            "GCP Section 6.4.4 - Protocol Compliance"
        ],
        'critical': [
            "ICH E6(R2) Section 2.13 - Data Integrity",
            "ICH E6(R2) Section 4.1.5 - Fraud Prevention",
            "FDA 21 CFR Part 312.70 - Disqualification of Investigators"
        ]
    }
    
    def __init__(self):
        self._compile_patterns()
    
    def _compile_patterns(self):
        """Compile regular expression pattern"""
        self.major_patterns = {
            category: [re.compile(p, re.IGNORECASE) for p in patterns]
            for category, patterns in self.MAJOR_INDICATORS.items()
        }
        self.minor_patterns = {
            category: [re.compile(p, re.IGNORECASE) for p in patterns]
            for category, patterns in self.MINOR_INDICATORS.items()
        }
    
    def classify(
        self,
        description: str,
        deviation_type: str = "",
        event_id: str = "",
        safety_impact: Optional[RiskLevel] = None,
        data_impact: Optional[RiskLevel] = None,
        scientific_impact: Optional[RiskLevel] = None
    ) -> ClassificationResult:
        """Classify individual deviation events
        
        Args:
            description: Deviation description
            deviation_type: deviation type
            event_id: event ID
            safety_impact: Safety impact level (if known)
            data_impact: Data integrity impact level (if known)
            scientific_impact: scientific impact level (if known)
        
        Returns:
            ClassificationResult: classification result"""
        # If no impact level is provided, it will be automatically determined based on the description.
        if safety_impact is None:
            safety_impact = self._assess_safety_impact(description, deviation_type)
        if data_impact is None:
            data_impact = self._assess_data_impact(description, deviation_type)
        if scientific_impact is None:
            scientific_impact = self._assess_scientific_impact(description, deviation_type)
        
        # Calculate risk score
        risk_score = (
            safety_impact.score * 3 +
            data_impact.score * 2 +
            scientific_impact.score * 2
        )
        
        # Apply classification rules
        classification, confidence = self._apply_classification_rules(
            safety_impact, data_impact, scientific_impact, risk_score, description
        )
        
        # Generate classification reasons
        rationale = self._generate_rationale(
            classification, safety_impact, data_impact, scientific_impact, description
        )
        
        # Get key metrics
        key_indicators = self._extract_key_indicators(description, deviation_type)
        
        # Obtain regulatory basis
        regulatory_basis = self._get_regulatory_basis(classification, description)
        
        # Generate recommended actions
        recommended_actions = self._get_recommended_actions(classification)
        
        return ClassificationResult(
            id=event_id or self._generate_event_id(),
            classification=classification,
            confidence=confidence,
            rationale=rationale,
            safety_risk=safety_impact,
            data_integrity_risk=data_impact,
            scientific_validity_risk=scientific_impact,
            risk_score=risk_score,
            regulatory_basis=regulatory_basis,
            recommended_actions=recommended_actions,
            key_indicators=key_indicators
        )
    
    def classify_batch(self, events: List[Dict]) -> List[ClassificationResult]:
        """Batch classification deviation event
        
        Args:
            events: Dictionary list of deviation events
        
        Returns:
            List[ClassificationResult]: Classification result list"""
        results = []
        for event_data in events:
            event = DeviationEvent.from_dict(event_data)
            result = self.classify(
                description=event.description,
                deviation_type=event.deviation_type,
                event_id=event.id,
                safety_impact=event.safety_impact,
                data_impact=event.data_impact,
                scientific_impact=event.scientific_impact
            )
            results.append(result)
        return results
    
    def _assess_safety_impact(self, description: str, deviation_type: str) -> RiskLevel:
        """Assess the impact on subject safety"""
        text = f"{description} {deviation_type}".lower()
        
        # High security impact keywords
        high_risk = [
            r'overdose', 'overdose', 'double dose', 'Contraindications', 'allergic reaction',
            'serious adverse events', r'sae', 'die', 'life-threatening', 'Hospitalized',
            r'death', r'life-threatening', r'hospitalization'
        ]
        
        # Medium safety impact keywords
        medium_risk = [
            'adverse reactions', 'side effect', r'adverse event', 'Concomitant medication',
            r'concomitant', 'drug interactions', r'drug interaction',
            'Dosage adjustment', r'dose adjustment'
        ]
        
        for pattern in high_risk:
            if re.search(pattern, text, re.IGNORECASE):
                return RiskLevel.HIGH
        
        for pattern in medium_risk:
            if re.search(pattern, text, re.IGNORECASE):
                return RiskLevel.MEDIUM
        
        # If informed consent is involved but no specific harm is involved
        if re.search('informed consent|consent', text, re.IGNORECASE):
            return RiskLevel.HIGH
        
        return RiskLevel.NONE
    
    def _assess_data_impact(self, description: str, deviation_type: str) -> RiskLevel:
        """Assess the impact on data integrity"""
        text = f"{description} {deviation_type}".lower()
        
        # High data impact
        high_patterns = [
            'forgery|tampering|false|falsif|fabricat',
            'data.*lost|data.*lost',
            'critical.*data.*missing|critical.*data.*missing'
        ]
        
        # Medium data impact
        medium_patterns = [
            'primary endpoint|primary endpoint',
            'critical visit|critical visit',
            'not.*assess|not.*assess'
        ]
        
        for pattern in high_patterns:
            if re.search(pattern, text, re.IGNORECASE):
                return RiskLevel.HIGH
        
        for pattern in medium_patterns:
            if re.search(pattern, text, re.IGNORECASE):
                return RiskLevel.MEDIUM
        
        return RiskLevel.LOW
    
    def _assess_scientific_impact(self, description: str, deviation_type: str) -> RiskLevel:
        """Assessing the impact on the scientific validity of the trial"""
        text = f"{description} {deviation_type}".lower()
        
        # Check whether it affects the primary endpoint or randomization
        high_patterns = [
            'randomize.*error|error.*random|randomiz',
            'primary endpoint|primary endpoint',
            'Enter the group.*Error|Wrong.*Enter the group|Inconsistent.*Enter the group|ineligible'
        ]
        
        # medium impact
        medium_patterns = [
            'Visit.*missing|missed visit',
            'Efficacy.*Assessment|efficacy assessment',
            'Break the blind|unblind'
        ]
        
        for pattern in high_patterns:
            if re.search(pattern, text, re.IGNORECASE):
                return RiskLevel.HIGH
        
        for pattern in medium_patterns:
            if re.search(pattern, text, re.IGNORECASE):
                return RiskLevel.MEDIUM
        
        return RiskLevel.NONE
    
    def _apply_classification_rules(
        self,
        safety_impact: RiskLevel,
        data_impact: RiskLevel,
        scientific_impact: RiskLevel,
        risk_score: int,
        description: str
    ) -> Tuple[Classification, float]:
        """Apply classification rules
        
        Classification rules:
        - Any dimension is High → significant deviation
        - Security dimension is Medium and data/science is either Medium+ → major deviation
        - Involving informed consent issues → Major deviations
        - Involving data falsification → critical deviation
        - Other cases → minor deviations"""
        text = description.lower()
        
        # Check whether it is a critical deviation (data fraud)
        if re.search('forgery|tampering|false|falsif|fabricat', text):
            return Classification.CRITICAL, 0.98
        
        # Check for major deviations
        if safety_impact == RiskLevel.HIGH:
            return Classification.MAJOR, 0.95
        
        if data_impact == RiskLevel.HIGH or scientific_impact == RiskLevel.HIGH:
            return Classification.MAJOR, 0.90
        
        if safety_impact == RiskLevel.MEDIUM and (
            data_impact.score >= 2 or scientific_impact.score >= 2
        ):
            return Classification.MAJOR, 0.85
        
        # Check issues related to informed consent
        if re.search('Informed consent|Not obtained.*Consent|consent', text, re.IGNORECASE):
            if not re.search('non-critical|minor|delay|delay', text, re.IGNORECASE):
                return Classification.MAJOR, 0.92
        
        # Minor deviations in other cases
        if risk_score <= 4:
            confidence = 0.90 - (risk_score * 0.05)
        else:
            confidence = 0.70
        
        return Classification.MINOR, max(0.65, confidence)
    
    def _generate_rationale(
        self,
        classification: Classification,
        safety_impact: RiskLevel,
        data_impact: RiskLevel,
        scientific_impact: RiskLevel,
        description: str
    ) -> str:
        """Generate classification reasons"""
        reasons = []
        
        if classification == Classification.CRITICAL:
            reasons.append("Involving data forgery or tampering, seriously affecting the credibility of test data.")
        elif classification == Classification.MAJOR:
            reasons.append("This deviation has the following high-risk characteristics:")
            if safety_impact == RiskLevel.HIGH:
                reasons.append("-Seriously affecting subject safety")
            if data_impact == RiskLevel.HIGH:
                reasons.append("- Serious damage to data integrity")
            if scientific_impact == RiskLevel.HIGH:
                reasons.append("-Seriously damaging the scientific nature of the test")
            if safety_impact == RiskLevel.MEDIUM:
                reasons.append("- Moderate impact on subject safety")
            
            # Check informed consent
            if re.search('informed consent|consent', description, re.IGNORECASE):
                reasons.append("- Involving violations of informed consent procedures")
        else:
            reasons.append("This deviation has the following characteristics:")
            if safety_impact == RiskLevel.NONE:
                reasons.append("- Does not affect subject safety")
            if data_impact == RiskLevel.LOW:
                reasons.append("- Minor impact on data integrity")
            if scientific_impact == RiskLevel.NONE:
                reasons.append("- Does not affect the scientific nature of the test")
            
            # Check for minor time delays
            if re.search('delay|delay|postpone|postpone', description, re.IGNORECASE):
                reasons.append("- It is only a procedural delay and does not affect the core elements of the test")
        
        return "\n".join(reasons)
    
    def _extract_key_indicators(self, description: str, deviation_type: str) -> List[str]:
        """Extract key indicators"""
        indicators = []
        text = f"{description} {deviation_type}".lower()
        
        # Check for significant deviation indicators
        for category, patterns in self.major_patterns.items():
            for pattern in patterns:
                if pattern.search(text):
                    indicator_map = {
                        'informed_consent': 'Informed consent issues',
                        'eligibility': 'Inclusion/exclusion criteria violations',
                        'dosing': 'Administration/Dosage Issues',
                        'concomitant': 'Concomitant medication violations',
                        'randomization': 'randomization problem',
                        'safety_reporting': 'Security reporting issues',
                        'blinding': 'blind violation',
                        'data_integrity': 'Data integrity issues',
                        'critical_procedures': 'Missing key procedures'
                    }
                    ind = indicator_map.get(category, category)
                    if ind not in indicators:
                        indicators.append(ind)
                    break
        
        # Check for minor deviation indicators
        for category, patterns in self.minor_patterns.items():
            for pattern in patterns:
                if pattern.search(text):
                    indicator_map = {
                        'visit_window': 'Visit time window deviation',
                        'sample_collection': 'Sample collection bias',
                        'questionnaire': 'Questionnaire/Diary Card Bias',
                        'documentation': 'Document/Signature Delay',
                        'non_critical': 'Non-critical procedural deviations'
                    }
                    ind = indicator_map.get(category, category)
                    if ind not in indicators:
                        indicators.append(ind)
                    break
        
        return indicators[:5]  # Returns up to 5 indicators
    
    def _get_regulatory_basis(self, classification: Classification, description: str) -> List[str]:
        """Obtain regulatory basis"""
        basis = []
        text = description.lower()
        
        if classification == Classification.CRITICAL:
            basis = self.REGULATORY_BASIS['critical'].copy()
        elif classification == Classification.MAJOR:
            basis = self.REGULATORY_BASIS['major'].copy()
            
            # Add specific regulations based on description
            if re.search('informed consent|consent', text, re.IGNORECASE):
                basis.append("ICH E6(R2) Section 4.8 - Informed Consent Requirements")
            if re.search('Randomization|randomiz', text, re.IGNORECASE):
                basis.append("ICH E9 - Statistical Principles for Clinical Trials")
        else:
            basis = self.REGULATORY_BASIS['minor'].copy()
        
        return basis
    
    def _get_recommended_actions(self, classification: Classification) -> List[str]:
        """Get recommended actions"""
        if classification == Classification.CRITICAL:
            return [
                "Notify the sponsor and ethics committee immediately",
                "Initiate root cause investigation",
                "Implement corrective and preventive actions (CAPA)",
                "Consider blacklisting researchers",
                "Assess impact on overall trial data"
            ]
        elif classification == Classification.MAJOR:
            return [
                "Recorded in the deviation log",
                "Report to the sponsor within 24 hours",
                "Report to ethics committee (if required by protocol)",
                "Assess whether remedial action is needed",
                "Track trends and assess whether there is a systemic issue"
            ]
        else:
            return [
                "Recorded in the deviation log",
                "Follow trends",
                "Solved at the research center level",
                "Periodic summary reporting (e.g. quarterly)"
            ]
    
    def _generate_event_id(self) -> str:
        """Generate event ID"""
        timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
        return f"DEV-{timestamp}"
    
    def generate_report(self, results: List[ClassificationResult]) -> Dict:
        """Generate deviation classification report
        
        Args:
            results: list of classified results
        
        Returns:
            Dict: report dictionary"""
        if not results:
            return {"error": "No results to report"}
        
        total = len(results)
        major_count = sum(1 for r in results if r.classification == Classification.MAJOR)
        minor_count = sum(1 for r in results if r.classification == Classification.MINOR)
        critical_count = sum(1 for r in results if r.classification == Classification.CRITICAL)
        
        # Summarize various types of deviations
        by_category = {}
        for result in results:
            for indicator in result.key_indicators:
                by_category[indicator] = by_category.get(indicator, 0) + 1
        
        return {
            "report_date": datetime.now().isoformat(),
            "summary": {
                "total_deviations": total,
                "critical_count": critical_count,
                "major_count": major_count,
                "minor_count": minor_count,
                "major_rate": round(major_count / total * 100, 1) if total > 0 else 0,
                "minor_rate": round(minor_count / total * 100, 1) if total > 0 else 0
            },
            "category_breakdown": by_category,
            "critical_deviations": [
                r.to_dict() for r in results 
                if r.classification == Classification.CRITICAL
            ],
            "major_deviations": [
                r.to_dict() for r in results 
                if r.classification == Classification.MAJOR
            ],
            "minor_deviations": [
                r.to_dict() for r in results 
                if r.classification == Classification.MINOR
            ],
            "recommendations": self._generate_summary_recommendations(results)
        }
    
    def _generate_summary_recommendations(self, results: List[ClassificationResult]) -> List[str]:
        """Generate summary recommendations"""
        recommendations = []
        
        critical_count = sum(1 for r in results if r.classification == Classification.CRITICAL)
        major_count = sum(1 for r in results if r.classification == Classification.MAJOR)
        total = len(results)
        
        if critical_count > 0:
            recommendations.append(
                f"⚠️ Discover{critical_count}key deviation，Recommend immediate root cause analysis"
            )
        
        if total > 0:
            major_rate = major_count / total * 100
            if major_rate > 20:
                recommendations.append(
                    f"significant deviation rate({major_rate:.1f}%)On the high side，It is recommended to strengthen research center training"
                )
            elif major_rate > 10:
                recommendations.append(
                    f"significant deviation rate({major_rate:.1f}%)medium，It is recommended to pay close attention to"
                )
        
        # Check trends
        safety_issues = sum(
            1 for r in results 
            if r.safety_risk in [RiskLevel.HIGH, RiskLevel.MEDIUM]
        )
        if safety_issues > 3:
            recommendations.append(
                f"Discover{safety_issues}deviations involving safety issues，Recommend evaluation of subject protection measures"
            )
        
        if not recommendations:
            recommendations.append("Deviation control is generally good and it is recommended to continue to maintain it.")
        
        return recommendations


def main():
    """CLI entry point"""
    parser = argparse.ArgumentParser(
        description="Protocol Deviation Classifier - clinical trial protocol deviation classification tool",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""Example:
  # Classify individual deviations
  python scripts/main.py classify -d "Subject visit delayed by 2 days"
  
  # Batch classification from files
  python scripts/main.py batch -i deviations.json -o report.json
  
  #Interactive classification
  python scripts/main.py interactive"""
    )
    
    subparsers = parser.add_subparsers(dest="command", help="Available commands")
    
    # classify command
    classify_parser = subparsers.add_parser("classify", help="Classify individual deviations")
    classify_parser.add_argument("-d", "--description", required=True, help="Deviation description")
    classify_parser.add_argument("-t", "--type", default="", help="Deviation type")
    classify_parser.add_argument("--id", default="", help="Event ID")
    classify_parser.add_argument("--safety-impact", 
                                 choices=["none", "low", "medium", "high"],
                                 default="none", help="security impact level")
    classify_parser.add_argument("--data-impact",
                                 choices=["none", "low", "medium", "high"],
                                 default="low", help="Data integrity impact level")
    classify_parser.add_argument("--scientific-impact",
                                 choices=["none", "low", "medium", "high"],
                                 default="none", help="scientific impact level")
    classify_parser.add_argument("-o", "--output", choices=["json", "table"], 
                                 default="table", help="Output format")
    
    # batch command
    batch_parser = subparsers.add_parser("batch", help="Batch classification")
    batch_parser.add_argument("-i", "--input", required=True, help="Enter JSON file path")
    batch_parser.add_argument("-o", "--output", default="", help="Output file path")
    batch_parser.add_argument("--format", choices=["json", "report"],
                              default="json", help="Output format")
    
    # report command
    report_parser = subparsers.add_parser("report", help="Generate summary report")
    report_parser.add_argument("-i", "--input", required=True, help="Classification results JSON file")
    
    # interactive command
    subparsers.add_parser("interactive", help="interactive classification")
    
    # demo command
    subparsers.add_parser("demo", help="Run sample classification")
    
    args = parser.parse_args()
    
    if not args.command:
        parser.print_help()
        sys.exit(1)
    
    classifier = DeviationClassifier()
    
    try:
        if args.command == "classify":
            # Analyze impact levels
            safety = RiskLevel(args.safety_impact)
            data = RiskLevel(args.data_impact)
            scientific = RiskLevel(args.scientific_impact)
            
            result = classifier.classify(
                description=args.description,
                deviation_type=args.type,
                event_id=args.id,
                safety_impact=safety,
                data_impact=data,
                scientific_impact=scientific
            )
            
            if args.output == "json":
                print(json.dumps(result.to_dict(), ensure_ascii=False, indent=2))
            else:
                _print_table_result(result)
        
        elif args.command == "batch":
            # Read input file
            with open(args.input, 'r', encoding='utf-8') as f:
                events = json.load(f)
            
            results = classifier.classify_batch(events)
            
            if args.format == "report":
                report = classifier.generate_report(results)
                output = report
            else:
                output = [r.to_dict() for r in results]
            
            if args.output:
                with open(args.output, 'w', encoding='utf-8') as f:
                    json.dump(output, f, ensure_ascii=False, indent=2)
                print(f"English: {args.output}")
            else:
                print(json.dumps(output, ensure_ascii=False, indent=2))
        
        elif args.command == "report":
            with open(args.input, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            # Assume that the input is a list of classification results
            if isinstance(data, list):
                # Convert to ClassificationResult object
                results = []
                for item in data:
                    result = ClassificationResult(
                        id=item.get('id', ''),
                        classification=Classification(item.get('classification_en', '').lower().split()[0]),
                        confidence=item.get('confidence', 0),
                        rationale=item.get('rationale', ''),
                        safety_risk=RiskLevel.NONE,
                        data_integrity_risk=RiskLevel.NONE,
                        scientific_validity_risk=RiskLevel.NONE,
                        risk_score=item.get('risk_score', 0),
                        regulatory_basis=item.get('regulatory_basis', []),
                        recommended_actions=item.get('recommended_actions', []),
                        key_indicators=item.get('key_indicators', [])
                    )
                    results.append(result)
                
                report = classifier.generate_report(results)
                print(json.dumps(report, ensure_ascii=False, indent=2))
        
        elif args.command == "interactive":
            _run_interactive(classifier)
        
        elif args.command == "demo":
            _run_demo(classifier)
    
    except Exception as e:
        print(f"mistake: {e}", file=sys.stderr)
        sys.exit(1)


def _print_table_result(result: ClassificationResult):
    """Print results in table format"""
    print("\n" + "=" * 60)
    print("Plan deviation classification result")
    print("=" * 60)
    print(f"{'Event ID:':<20} {result.id}")
    print(f"{'Classification results:':<20} {result.classification} ({result.classification.en_name})")
    print(f"{'Confidence:':<20} {result.confidence*100:.1f}%")
    print(f"{'Risk score:':<20} {result.risk_score}")
    print("-" * 60)
    print("risk assessment:")
    print(f"  - Subject safety: {result.safety_risk}")
    print(f"  - data integrity: {result.data_integrity_risk}")
    print(f"  - Experimental science: {result.scientific_validity_risk}")
    print("-" * 60)
    print("Reason for classification:")
    print(result.rationale)
    if result.key_indicators:
        print("-" * 60)
        print("Key indicators:")
        for indicator in result.key_indicators:
            print(f"  • {indicator}")
    print("-" * 60)
    print("Recommended actions:")
    for action in result.recommended_actions:
        print(f"  • {action}")
    print("=" * 60)


def _run_interactive(classifier: DeviationClassifier):
    """Run interactive classification"""
    print("\n" + "=" * 60)
    print("Solution Deviation Classifier - Interactive Mode")
    print("=" * 60)
    print("Type 'quit' or 'q' to exit")
    
    while True:
        print("\n" + "-" * 40)
        description = input("Please enter a description of the deviation:").strip()
        
        if description.lower() in ['quit', 'q', 'exit']:
            print("goodbye!")
            break
        
        if not description:
            print("Description cannot be empty, please re-enter.")
            continue
        
        deviation_type = input("Please enter deviation type (optional):").strip()
        
        print("Analyzing...")
        result = classifier.classify(
            description=description,
            deviation_type=deviation_type
        )
        
        _print_table_result(result)


def _run_demo(classifier: DeviationClassifier):
    """Run sample classification"""
    demo_cases = [
        {
            "id": "DEV-001",
            "description": "Subject visit will be delayed by 2 days",
            "type": "visit window"
        },
        {
            "id": "DEV-002",
            "description": "Collecting blood samples without obtaining informed consent",
            "type": "informed consent"
        },
        {
            "id": "DEV-003",
            "description": "Subject mistakenly takes double dose of study drug",
            "type": "Medication error"
        },
        {
            "id": "DEV-004",
            "description": "Submission of quality of life questionnaire delayed by 3 days",
            "type": "data collection"
        },
        {
            "id": "DEV-005",
            "description": "Subjects who did not meet the inclusion criteria were enrolled (age exceeded the limit)",
            "type": "Inclusion criteria"
        }
    ]
    
    print("\n" + "=" * 60)
    print("Scheme deviation classifier - example run")
    print("=" * 60)
    
    results = []
    for case in demo_cases:
        print(f"\n【Case {case['id']}】")
        print(f"describe: {case['description']}")
        print(f"type: {case['type']}")
        
        result = classifier.classify(
            description=case['description'],
            deviation_type=case['type'],
            event_id=case['id']
        )
        results.append(result)
        
        print(f"→ Classification results: {result.classification} (Confidence: {result.confidence*100:.0f}%)")
        print(f"   risk score: {result.risk_score}")
    
    # Generate summary report
    print("\n" + "=" * 60)
    print("summary report")
    print("=" * 60)
    
    report = classifier.generate_report(results)
    summary = report['summary']
    
    print(f"Total deviations: {summary['total_deviations']}")
    print(f"critical deviation: {summary['critical_count']}")
    print(f"significant deviation: {summary['major_count']} ({summary['major_rate']}%)")
    print(f"small deviation: {summary['minor_count']} ({summary['minor_rate']}%)")
    
    print("suggestion:")
    for rec in report['recommendations']:
        print(f"  • {rec}")


if __name__ == "__main__":
    main()

ClawHub Data Analysis Research+2

A@clawhub-aipoch-ai-772015cadb

eCTD XML Compiler

Skill

Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.

---
name: ectd-xml-compiler
description: Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
license: MIT
skill-author: AIPOCH
---
# eCTD XML Compiler

ID: 197

Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.

## When to Use

- Use this skill when the task needs Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

- Scope-focused workflow aligned to: Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

```
python-docx>=0.8.11    # Word document parsing
PyPDF2>=3.0.0          # PDF text extraction
lxml>=4.9.0            # XML processing
```

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Academic Writing/ectd-xml-compiler"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Overview

eCTD (electronic Common Technical Document) is the electronic Common Technical Document standard established by ICH for submitting drug registration applications to regulatory agencies such as FDA and EMA.

This tool parses uploaded drug application documents (Word/PDF) and converts them into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.

## eCTD Structure

```
eCTD/
├── m1/  # Module 1: Administrative Information and Prescribing Information (region-specific)
│   ├── m1.xml
│   └── ...
├── m2/  # Module 2: CTD Summaries
│   ├── m2.xml
│   └── ...
├── m3/  # Module 3: Quality
│   ├── m3.xml
│   └── ...
├── m4/  # Module 4: Nonclinical Study Reports
│   ├── m4.xml
│   └── ...
├── m5/  # Module 5: Clinical Study Reports
│   ├── m5.xml
│   └── ...
├── index.xml      # Master index file
├── index-md5.txt  # MD5 checksum file
└── dtd/           # DTD files
```

## Usage

```text
python skills/ectd-xml-compiler/scripts/main.py [options] <input_files...>
```

### Arguments

| Argument | Description |
|----------|-------------|
| `input_files` | Input Word/PDF file paths (supports multiple) |

### Options

| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--output` | `-o` | Output directory path | `./ectd-output` |
| `--module` | `-m` | Target module (m1-m5, auto) | `auto` |
| `--region` | `-r` | Target region (FDA, EMA, ICH) | `ICH` |
| `--version` | `-v` | eCTD version (3.2.2, 4.0) | `4.0` |
| `--dtd-path` | `-d` | Custom DTD path | Built-in DTD |
| `--validate` |  | Validate generated XML | `False` |

### Examples

```text

# Basic usage - auto-detect module
python skills/ectd-xml-compiler/scripts/main.py document1.docx document2.pdf

# Specify output directory and module
python skills/ectd-xml-compiler/scripts/main.py -o ./my-ectd -m m3 quality-doc.docx

# FDA submission format
python skills/ectd-xml-compiler/scripts/main.py -r FDA -v 3.2.2 *.pdf

# Validate generated XML
python skills/ectd-xml-compiler/scripts/main.py --validate submission.pdf
```

## Input Document Processing

### Supported Formats
- Microsoft Word (.docx, .doc)
- PDF (.pdf)

### Document Parsing Logic
1. **Title Recognition**: Extract heading hierarchy based on font size and style
2. **TOC Mapping**: Auto-recognize section numbers (e.g., 3.2.S.1.1)
3. **Metadata Extraction**: Extract author, date, version, and other information
4. **Content Classification**: Map to corresponding eCTD modules based on keyword matching

### Module Auto-Recognition Rules

| Keyword Pattern | Target Module |
|------------|----------|
| Administrative, Label, Package Insert | m1 |
| Summary, summary, Overview | m2 |
| Quality, quality, CMC, API, Drug Product | m3 |
| Nonclinical, Toxicology, Pharmacokinetics | m4 |
| Clinical, clinical, Study, Trial | m5 |

## Output Structure

Generated eCTD skeleton contains:

### index.xml
Master index file containing references and sequence information for all modules.

### Module XML (m1.xml - m5.xml)
XML skeleton for each module, containing:
- Document hierarchy structure (`<leaf>`, `<node>`)
- Cross-references (`<cross-reference>`)
- Attribute definitions (ID, version, operation type)

### MD5 Checksums
MD5 checksum values for each file to ensure integrity.

## Installation

```text

# Install dependencies
pip install python-docx PyPDF2 lxml
```

## Validation

Using `--validate` option can validate generated XML:
- DTD structure validation
- Required elements and attributes check
- Cross-reference integrity check

## References

- [ICH eCTD Specification v4.0](https://www.ich.org/page/ectd)
- [FDA eCTD Guidance](https://www.fda.gov/drugs/electronic-regulatory-submission-and-review/ectd-technical-conformance-guide)
- [EMA eCTD Requirements](https://www.ema.europa.eu/en/human-regulatory/marketing-authorisation/application-procedures/electronic-application-forms)

## License

MIT License

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

No additional Python packages required.

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `ectd-xml-compiler` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `ectd-xml-compiler` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## References

- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/audit-reference.md
# Audit Reference

## Scope

- Skill: `ectd-xml-compiler`
- Core purpose: Automatically convert uploaded drug application documents (Word/PDF) into XML skeleton structure compliant with eCTD 4.0/3.2.2 specifications.
- Use only within the documented workflow and category boundary defined in `SKILL.md`

## Supported Audit Paths

- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`

## Fallback Boundary

If required inputs are incomplete, the skill should still return:

- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable

FILE:requirements.txt
docx
lxml
pypdf2

FILE:scripts/main.py
#!/usr/bin/env python3
"""eCTD XML Compiler
Automatically convert drug application documents (Word/PDF) into eCTD XML skeleton structures that comply with FDA/EMA requirements.

ID: 197"""

import argparse
import hashlib
import os
import re
import sys
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from xml.etree import ElementTree as ET
from xml.dom import minidom

# Optional imports with fallbacks
try:
    from docx import Document
    HAS_DOCX = True
except ImportError:
    HAS_DOCX = False

try:
    import PyPDF2
    HAS_PYPDF2 = True
except ImportError:
    HAS_PYPDF2 = False

try:
    from lxml import etree
    HAS_LXML = True
except ImportError:
    HAS_LXML = False


class ECTDError(Exception):
    """eCTD handling errors"""
    pass


class DocumentParser:
    """Document parser base class"""
    
    def __init__(self, file_path: str):
        self.file_path = file_path
        self.content = ""
        self.metadata = {}
        self.headings = []
    
    def parse(self) -> Dict:
        """Parse the document and return structured data"""
        raise NotImplementedError
    
    def detect_module(self) -> str:
        """Automatically detect the eCTD module it belongs to according to the content"""
        content_lower = self.content.lower()
        
        # Module detection rules
        module_keywords = {
            "m1": ["administrative", "Label", "manual", "labeling", "administrative", "form 356h"],
            "m2": ["summary", "summary", "Overview", "introduction", "overview"],
            "m3": ["quality", "quality", "cmc", "API", "preparation", "manufacture", "control"],
            "m4": ["non-clinical", "Toxicology", "Drug generation", "nonclinical", "toxicology", "pharmacology"],
            "m5": ["clinical", "clinical", "study report", "efficacy", "safety"],
        }
        
        scores = {mod: 0 for mod in module_keywords}
        for module, keywords in module_keywords.items():
            for keyword in keywords:
                if keyword.lower() in content_lower:
                    scores[module] += 1
        
        # Returns the module with the highest score, default is m3
        if max(scores.values()) > 0:
            return max(scores, key=scores.get)
        return "m3"


class WordParser(DocumentParser):
    """Word document parser"""
    
    def parse(self) -> Dict:
        if not HAS_DOCX:
            raise ECTDError("python-docx is not installed and the Word document cannot be parsed. Please run: pip install python-docx")
        
        doc = Document(self.file_path)
        
        # Extract text content
        paragraphs = []
        for para in doc.paragraphs:
            text = para.text.strip()
            if text:
                paragraphs.append(text)
                # Detect title (based on style)
                if para.style and para.style.name:
                    style_name = para.style.name.lower()
                    if 'heading' in style_name or 'title' in style_name:
                        level = self._extract_heading_level(style_name)
                        self.headings.append({
                            'level': level,
                            'text': text,
                            'style': para.style.name
                        })
        
        self.content = "\n".join(paragraphs)
        
        # Extract metadata
        self.metadata = {
            'author': doc.core_properties.author if doc.core_properties else '',
            'created': doc.core_properties.created.isoformat() if doc.core_properties and doc.core_properties.created else '',
            'modified': doc.core_properties.modified.isoformat() if doc.core_properties and doc.core_properties.modified else '',
            'title': doc.core_properties.title if doc.core_properties else '',
        }
        
        return {
            'content': self.content,
            'metadata': self.metadata,
            'headings': self.headings,
            'paragraphs': paragraphs
        }
    
    def _extract_heading_level(self, style_name: str) -> int:
        """Extract heading level from style name"""
        match = re.search(r'heading\s*(\d)', style_name, re.IGNORECASE)
        if match:
            return int(match.group(1))
        match = re.search('Title\\s*(\\d)', style_name)
        if match:
            return int(match.group(1))
        return 1


class PDFParser(DocumentParser):
    """PDF document parser"""
    
    def parse(self) -> Dict:
        if not HAS_PYPDF2:
            raise ECTDError("PyPDF2 is not installed and the PDF document cannot be parsed. Please run: pip install PyPDF2")
        
        with open(self.file_path, 'rb') as f:
            reader = PyPDF2.PdfReader(f)
            
            # Extract metadata
            meta = reader.metadata
            self.metadata = {
                'author': meta.get('/Author', '') if meta else '',
                'creator': meta.get('/Creator', '') if meta else '',
                'producer': meta.get('/Producer', '') if meta else '',
                'title': meta.get('/Title', '') if meta else '',
                'num_pages': len(reader.pages),
            }
            
            # Extract text
            paragraphs = []
            for page in reader.pages:
                text = page.extract_text()
                if text:
                    # Simple segmentation
                    for line in text.split('\n'):
                        line = line.strip()
                        if line:
                            paragraphs.append(line)
            
            self.content = "\n".join(paragraphs)
            
            # Try to identify headers (based on line length and formatting)
            self.headings = self._extract_headings(paragraphs)
        
        return {
            'content': self.content,
            'metadata': self.metadata,
            'headings': self.headings,
            'paragraphs': paragraphs
        }
    
    def _extract_headings(self, paragraphs: List[str]) -> List[Dict]:
        """Extract possible titles from PDF paragraphs"""
        headings = []
        for para in paragraphs[:50]:  # Only check first 50 rows
            # Titles are usually shorter and may include chapter numbers
            if len(para) < 100:
                # Detect chapter numbering pattern (e.g. 3.2.S.1.1 or 1.1 or 2.3.4)
                if re.match(r'^\d+(\.\d+)*\s+[A-Za-z]', para) or \
                   re.match(r'^[\d\.]+\s+[\u4e00-\u9fa5]', para):
                    level = para.count('.') + 1
                    headings.append({
                        'level': min(level, 6),
                        'text': para,
                        'style': 'extracted'
                    })
        return headings


class ECTDCompiler:
    """eCTD XML Compiler"""
    
    def __init__(self, output_dir: str, region: str = "ICH", version: str = "4.0"):
        self.output_dir = Path(output_dir)
        self.region = region
        self.version = version
        self.modules = {f"m{i}": [] for i in range(1, 6)}
        self.file_hashes = {}
    
    def compile_document(self, file_path: str, target_module: Optional[str] = None):
        """Compile a single document"""
        file_path = Path(file_path)
        
        if not file_path.exists():
            raise ECTDError(f"File does not exist: {file_path}")
        
        # Select parser
        suffix = file_path.suffix.lower()
        if suffix in ['.docx', '.doc']:
            parser = WordParser(str(file_path))
        elif suffix == '.pdf':
            parser = PDFParser(str(file_path))
        else:
            raise ECTDError(f"Unsupported file format: {suffix}")
        
        # Parse document
        data = parser.parse()
        
        # Determine target module
        if target_module is None or target_module == "auto":
            module = parser.detect_module()
        else:
            module = target_module
        
        # Store parsing results
        self.modules[module].append({
            'file_path': str(file_path),
            'file_name': file_path.name,
            'data': data,
            'module': module
        })
        
        # Calculate file hash
        self.file_hashes[file_path.name] = self._calculate_md5(file_path)
        
        return module
    
    def _calculate_md5(self, file_path: Path) -> str:
        """Calculate MD5 hash of file"""
        hash_md5 = hashlib.md5()
        with open(file_path, "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                hash_md5.update(chunk)
        return hash_md5.hexdigest()
    
    def generate_xml(self):
        """Generate eCTD XML skeleton"""
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # Create module directory and XML
        for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
            module_dir = self.output_dir / module_dir
            module_dir.mkdir(exist_ok=True)
            self._generate_module_xml(module_id, module_dir)
        
        # Generate main index file
        self._generate_index_xml()
        
        # Generate MD5 check file
        self._generate_md5_file()
        
        return self.output_dir
    
    def _generate_module_xml(self, module_id: str, module_dir: Path):
        """Generate module XML file"""
        root = ET.Element("ectd:ectd")
        root.set("xmlns:ectd", f"http://www.ich.org/ectd/{self.version}")
        root.set("xmlns:xlink", "http://www.w3.org/1999/xlink")
        
        # Add module root node
        module_elem = ET.SubElement(root, f"ectd:{module_id}")
        
        documents = self.modules.get(module_id, [])
        
        # If there is a document, create a leaf node
        for doc in documents:
            leaf = ET.SubElement(module_elem, "ectd:leaf")
            leaf.set("ID", f"{module_id}-{doc['file_name']}")
            leaf.set("xlink:href", f"{module_id}/{doc['file_name']}")
            leaf.set("operation", "new")
            
            # Add title information
            title = ET.SubElement(leaf, "ectd:title")
            title.text = doc['data']['metadata'].get('title', '') or doc['file_name']
            
            # Add document properties
            if doc['data']['headings']:
                for heading in doc['data']['headings'][:5]:  # Only include the first 5 titles
                    node = ET.SubElement(leaf, "ectd:node")
                    node.set("level", str(heading['level']))
                    node.text = heading['text']
        
        # If there is no document, add placeholder instructions
        if not documents:
            placeholder = ET.SubElement(module_elem, "ectd:placeholder")
            placeholder.text = f"No documents assigned to {module_id}"
        
        # format and write
        xml_str = self._prettify_xml(root)
        xml_path = module_dir / f"{module_id}.xml"
        with open(xml_path, 'w', encoding='utf-8') as f:
            f.write(xml_str)
    
    def _generate_index_xml(self):
        """Generate main index file"""
        root = ET.Element("ectd:ectd")
        root.set("xmlns:ectd", f"http://www.ich.org/ectd/{self.version}")
        root.set("xmlns:xlink", "http://www.w3.org/1999/xlink")
        root.set("dtd-version", self.version)
        
        # Add header information
        header = ET.SubElement(root, "ectd:header")
        
        submission = ET.SubElement(header, "ectd:submission")
        submission.set("type", "initial")
        submission.set("date", datetime.now().strftime("%Y-%m-%d"))
        
        applicant = ET.SubElement(header, "ectd:applicant")
        applicant.text = "[Applicant name]"
        
        product = ET.SubElement(header, "ectd:product")
        product.set("name", "[drug name]")
        
        # Add module reference
        modules_elem = ET.SubElement(root, "ectd:modules")
        for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
            module_ref = ET.SubElement(modules_elem, "ectd:module")
            module_ref.set("id", module_id)
            module_ref.set("xlink:href", f"{module_id}/{module_id}.xml")
            module_ref.set("title", self._get_module_title(module_id))
        
        # format and write
        xml_str = self._prettify_xml(root)
        index_path = self.output_dir / "index.xml"
        with open(index_path, 'w', encoding='utf-8') as f:
            f.write(xml_str)
    
    def _get_module_title(self, module_id: str) -> str:
        """Get module title"""
        titles = {
            "m1": "Administrative Information and Prescribing Information",
            "m2": "Common Technical Document Summaries",
            "m3": "Quality",
            "m4": "Nonclinical Study Reports",
            "m5": "Clinical Study Reports",
        }
        return titles.get(module_id, module_id)
    
    def _generate_md5_file(self):
        """Generate MD5 check file"""
        md5_path = self.output_dir / "index-md5.txt"
        with open(md5_path, 'w', encoding='utf-8') as f:
            f.write("# eCTD MD5 Checksums\n")
            f.write(f"# Generated: {datetime.now().isoformat()}\n")
            f.write(f"# Version: {self.version}\n")
            f.write(f"# Region: {self.region}\n\n")
            
            for filename, hash_value in sorted(self.file_hashes.items()):
                f.write(f"{hash_value}  {filename}\n")
            
            # Add generated XML file hash
            for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
                xml_file = self.output_dir / module_id / f"{module_id}.xml"
                if xml_file.exists():
                    hash_value = self._calculate_md5(xml_file)
                    f.write(f"{hash_value}  {module_id}/{module_id}.xml\n")
            
            # index.xml
            index_file = self.output_dir / "index.xml"
            if index_file.exists():
                hash_value = self._calculate_md5(index_file)
                f.write(f"{hash_value}  index.xml\n")
    
    def _prettify_xml(self, elem) -> str:
        """Formatted XML output"""
        rough_string = ET.tostring(elem, encoding='unicode')
        reparsed = minidom.parseString(rough_string)
        return reparsed.toprettyxml(indent="  ")
    
    def validate(self) -> List[str]:
        """Validate the generated XML structure"""
        errors = []
        
        # Basic verification
        index_file = self.output_dir / "index.xml"
        if not index_file.exists():
            errors.append("Missing main index file index.xml")
        
        for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
            module_xml = self.output_dir / module_id / f"{module_id}.xml"
            if not module_xml.exists():
                errors.append(f"Missing module file {module_id}.xml")
        
        # DTD validation using lxml (if available)
        if HAS_LXML:
            errors.extend(self._dtd_validation())
        
        return errors
    
    def _dtd_validation(self) -> List[str]:
        """DTD verification"""
        errors = []
        # Here you can add the actual DTD validation logic
        # Need to load ICH eCTD DTD file
        return errors


def create_parser() -> argparse.ArgumentParser:
    """Create a command line argument parser"""
    parser = argparse.ArgumentParser(
        description="eCTD XML Compiler - Convert drug submission documents to eCTD XML skeleton structures",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""Example:
  %(prog)s document.docx report.pdf
  %(prog)s -o ./my-ectd -m m3 quality-doc.docx
  %(prog)s -r FDA -v 3.2.2 *.pdf
  %(prog)s --validate submission.pdf"""
    )
    
    parser.add_argument(
        'input_files',
        nargs='+',
        help='Input Word/PDF file path (supports multiple)'
    )
    
    parser.add_argument(
        '-o', '--output',
        default='./ectd-output',
        help='Output directory path (default: ./ectd-output)'
    )
    
    parser.add_argument(
        '-m', '--module',
        choices=['m1', 'm2', 'm3', 'm4', 'm5', 'auto'],
        default='auto',
        help='Target module (default: auto automatically detected)'
    )
    
    parser.add_argument(
        '-r', '--region',
        choices=['FDA', 'EMA', 'ICH'],
        default='ICH',
        help='Target region (Default: ICH)'
    )
    
    parser.add_argument(
        '-v', '--version',
        choices=['3.2.2', '4.0'],
        default='4.0',
        help='eCTD version (default: 4.0)'
    )
    
    parser.add_argument(
        '-d', '--dtd-path',
        help='Custom DTD path'
    )
    
    parser.add_argument(
        '--validate',
        action='store_true',
        help='Validate generated XML'
    )
    
    parser.add_argument(
        '--verbose',
        action='store_true',
        help='Show details'
    )
    
    return parser


def main():
    """main function"""
    parser = create_parser()
    args = parser.parse_args()
    
    # Check dependencies
    missing_deps = []
    if not HAS_DOCX:
        missing_deps.append("python-docx")
    if not HAS_PYPDF2:
        missing_deps.append("PyPDF2")
    
    if missing_deps:
        print(f"warn: Missing optional dependencies: {', '.join(missing_deps)}")
        print("Installation: pip install" + " ".join(missing_deps))
    
    # Initialize the compiler
    try:
        compiler = ECTDCompiler(
            output_dir=args.output,
            region=args.region,
            version=args.version
        )
    except Exception as e:
        print(f"mistake: Failed to initialize compiler - {e}", file=sys.stderr)
        sys.exit(1)
    
    # Process input files
    processed = 0
    for file_path in args.input_files:
        if args.verbose:
            print(f"deal with: {file_path}")
        
        try:
            module = compiler.compile_document(file_path, args.module)
            if args.verbose:
                print(f"  -> assigned to module: {module}")
            processed += 1
        except ECTDError as e:
            print(f"mistake [{file_path}]: {e}", file=sys.stderr)
        except Exception as e:
            print(f"mistake [{file_path}]: {str(e)}", file=sys.stderr)
    
    if processed == 0:
        print("Error: No files were successfully processed", file=sys.stderr)
        sys.exit(1)
    
    print(f"\nsuccessfully processed {processed} files")
    
    # Generate XML
    print(f"\ngenerateeCTD XMLskeleton...")
    output_dir = compiler.generate_xml()
    print(f"Output directory: {output_dir.absolute()}")
    
    # Show module allocation statistics
    print("Module assignment:")
    for module_id in ['m1', 'm2', 'm3', 'm4', 'm5']:
        count = len(compiler.modules[module_id])
        if count > 0:
            print(f"  {module_id}: {count} documents")
    
    # Verification (if requested)
    if args.validate:
        print("Validate XML structure...")
        errors = compiler.validate()
        if errors:
            print("The following issues were found:")
            for error in errors:
                print(f"  - {error}")
        else:
            print("Verification passed!")
    
    print("Finish!")
    print(f"eCTDThe skeleton has been generated in: {output_dir.absolute()}")


if __name__ == '__main__':
    main()

ClawHub Research Writing+2

A@clawhub-aipoch-ai-772015cadb

eCRF Designer

Skill

Design clinical trial CRFs with proper validation rules

---
name: ecrf-designer
description: Design clinical trial CRFs with proper validation rules
version: 1.0.0
category: Pharma
tags: []
author: AIPOCH
license: MIT
status: Draft
risk_level: Medium
skill_type: Tool/Script
owner: AIPOCH
reviewer: ''
last_updated: '2026-02-06'
---

# eCRF Designer

Clinical data collection form design.

## Use Cases
- Case report form creation
- CDISC SDTM compliance
- EDC system setup
- Data validation rules

## Parameters
- `visit_schedule`: Time points
- `data_elements`: Variables to collect
- `cdisc_domain`: SDTM domain

## Returns
- CRF specifications
- Field validation rules
- Logic skip patterns
- Data dictionary

## Example
Demographics form with edit checks for age range

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites

No additional Python packages required.

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

FILE:scripts/main.py
#!/usr/bin/env python3
"""
eCRF Designer
Design clinical trial CRFs with proper validation rules.
"""

import argparse
import json


class eCRFDesigner:
    """Design electronic Case Report Forms."""
    
    FIELD_TYPES = {
        "text": {"validation": ["max_length", "regex"], "example": "Free text"},
        "number": {"validation": ["min", "max", "decimal_places"], "example": "123.45"},
        "integer": {"validation": ["min", "max"], "example": "42"},
        "date": {"validation": ["min_date", "max_date"], "example": "2024-01-15"},
        "choice": {"validation": ["options"], "example": "Single select"},
        "multichoice": {"validation": ["options", "max_selections"], "example": "Multi select"},
        "yesno": {"validation": [], "example": "Yes/No"},
        "calculated": {"validation": ["formula"], "example": "Auto-calculated"}
    }
    
    CRF_TEMPLATES = {
        "demographics": {
            "fields": [
                {"name": "subject_id", "type": "text", "required": True, "label": "Subject ID"},
                {"name": "birth_date", "type": "date", "required": True, "label": "Date of Birth"},
                {"name": "gender", "type": "choice", "options": ["Male", "Female", "Other"], "label": "Gender"},
                {"name": "race", "type": "choice", "options": ["Asian", "Black", "White", "Other"], "label": "Race"}
            ]
        },
        "adverse_event": {
            "fields": [
                {"name": "ae_description", "type": "text", "required": True, "label": "AE Description"},
                {"name": "severity", "type": "choice", "options": ["Mild", "Moderate", "Severe"], "label": "Severity"},
                {"name": "start_date", "type": "date", "required": True, "label": "Start Date"},
                {"name": "related", "type": "yesno", "label": "Related to Study Drug?"}
            ]
        },
        "efficacy": {
            "fields": [
                {"name": "visit_date", "type": "date", "required": True, "label": "Visit Date"},
                {"name": "tumor_size", "type": "number", "min": 0, "label": "Tumor Size (mm)"},
                {"name": "response", "type": "choice", "options": ["CR", "PR", "SD", "PD"], "label": "Response"}
            ]
        }
    }
    
    def generate_crf(self, template_name, custom_fields=None):
        """Generate CRF from template."""
        crf = {
            "form_name": template_name,
            "version": "1.0",
            "fields": []
        }
        
        if template_name in self.CRF_TEMPLATES:
            crf["fields"] = self.CRF_TEMPLATES[template_name]["fields"].copy()
        
        if custom_fields:
            crf["fields"].extend(custom_fields)
        
        return crf
    
    def add_validation(self, field, rules):
        """Add validation rules to field."""
        field["validation"] = rules
        return field
    
    def export_json(self, crf, output_file):
        """Export CRF to JSON."""
        with open(output_file, 'w') as f:
            json.dump(crf, f, indent=2)
    
    def print_crf(self, crf):
        """Print CRF structure."""
        print(f"\n{'='*60}")
        print(f"eCRF: {crf['form_name']} (v{crf['version']})")
        print(f"{'='*60}\n")
        
        for field in crf["fields"]:
            print(f"Field: {field['name']}")
            print(f"  Label: {field['label']}")
            print(f"  Type: {field['type']}")
            print(f"  Required: {field.get('required', False)}")
            if "options" in field:
                print(f"  Options: {', '.join(field['options'])}")
            print()
        
        print(f"{'='*60}\n")


def main():
    parser = argparse.ArgumentParser(description="eCRF Designer")
    parser.add_argument("--template", "-t", choices=["demographics", "adverse_event", "efficacy"],
                       default="demographics", help="CRF template")
    parser.add_argument("--output", "-o", default="crf.json", help="Output file")
    parser.add_argument("--list-types", action="store_true", help="List field types")
    
    args = parser.parse_args()
    
    designer = eCRFDesigner()
    
    if args.list_types:
        print("\nAvailable field types:")
        for ftype, info in designer.FIELD_TYPES.items():
            print(f"  {ftype}: {info['example']}")
        return
    
    crf = designer.generate_crf(args.template)
    designer.print_crf(crf)
    designer.export_json(crf, args.output)
    print(f"CRF exported to: {args.output}")


if __name__ == "__main__":
    main()

ClawHub Coding Backend+2

A@clawhub-aipoch-ai-772015cadb

Visual Content Desc

Skill

Generates detailed text descriptions of medical images and charts for.

---
name: visual-content-desc
description: Generates detailed text descriptions of medical images and charts for.
license: MIT
skill-author: AIPOCH
---
# Visual Content Desc

Creates accessible descriptions of medical visuals.

## When to Use

- Use this skill when the task needs Generates detailed text descriptions of medical images and charts for.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

See `## Features` above for related details.

- Scope-focused workflow aligned to: Generates detailed text descriptions of medical images and charts for.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.

## Example Usage

```bash
cd "20260318/scientific-skills/Academic Writing/visual-content-desc"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Features

- Image description generation
- Chart interpretation
- Alt text creation
- Figure legend assistance

## Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `image_type` | str | Yes | "microscopy", "chart", "scan" |
| `key_features` | list | Yes | Important visual elements |

## Output Format

```json
{
  "description": "string",
  "alt_text": "string",
  "figure_text": "string"
}
```

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

No additional Python packages required.

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `visual-content-desc` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `visual-content-desc` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/guidelines.md
# Visual Content Desc - References

## Accessibility Standards
- WCAG Image Guidelines
- Alt Text Best Practices
- Medical Image Description Standards

FILE:scripts/main.py
#!/usr/bin/env python3
"""Visual Content Desc - Image description generator."""

import json

class VisualContentDesc:
    """Generates descriptions of medical images."""
    
    def describe(self, image_type: str, key_features: list) -> dict:
        """Generate image description."""
        
        features_text = ", ".join(key_features)
        
        description = f"This {image_type} shows {features_text}."
        alt_text = f"Medical {image_type}: {features_text}"
        
        return {
            "description": description,
            "alt_text": alt_text[:150],
            "figure_text": f"Figure shows {features_text}.",
            "image_type": image_type
        }

def main():
    describer = VisualContentDesc()
    result = describer.describe("microscopy", ["cellular structures", "stained nuclei", "tissue architecture"])
    print(json.dumps(result, indent=2))

if __name__ == "__main__":
    main()

ClawHub Coding Research+2

A@clawhub-aipoch-ai-772015cadb

Virtual Patient Roleplay

Skill

Simulate standardized patient encounters for medical training, supporting OSCE-style history-taking practice, communication skills rehearsal, and educational...

---
name: virtual-patient-roleplay
description: Simulate standardized patient encounters for medical training, supporting OSCE-style history-taking practice, communication skills rehearsal, and educational debriefing.
license: MIT
skill-author: AIPOCH
---
# Virtual Patient Roleplay

Structured standardized-patient simulation for medical training and clinical interview practice.

> **Educational Disclaimer:** All output is for training simulation only. This skill does not provide real clinical diagnosis, treatment selection, or emergency instructions. Faculty supervision is required for formal assessment use.

## Quick Check

```bash
python -m py_compile scripts/main.py
python -c "from scripts.main import PatientSimulator; sim=PatientSimulator('chest_pain'); print(sim.ask('Where does the pain go?')['patient_response'])"
```

## When to Use

- Use this skill for OSCE-style history-taking practice, communication skills rehearsal, or debrief planning.
- Use this skill when a learner needs to practice clinical interviewing with a simulated patient response.
- Do not use this skill for real patient triage, clinical diagnosis, treatment selection, or emergency guidance.

## Workflow

1. Confirm the training goal, scenario type, learner level, and output focus (questioning, bedside manner, or debriefing).
2. Check whether the request is for live roleplay, case setup, feedback, or post-encounter summary.
3. Use the packaged simulator for supported scenarios; otherwise provide a manual roleplay scaffold without inventing unsupported medical certainty.
4. Return the patient response or teaching artifact with assumptions, missed-question prompts, and debrief notes.
5. If the request exceeds educational scope, stop and restate the boundary explicitly.

## Usage

```text
python -c "from scripts.main import PatientSimulator; sim=PatientSimulator('chest_pain'); print(sim.ask('Where does the pain go?')['patient_response'])"
python -c "from scripts.main import PatientSimulator; sim=PatientSimulator('headache'); print(sim.ask('Did the pain start suddenly?')['patient_response'])"
```

## Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `scenario` | string | No | `chest_pain` | Scenario: `chest_pain`, `headache`, `abdominal_pain` |
| `student_question` | string | Yes (for interaction) | — | Learner question posed to the patient |
| `difficulty` | string | No | `intermediate` | Scenario difficulty level |

## Output

- Simulated patient response
- Scenario-specific cues and debrief elements
- Explicit reminder that output is educational, not clinical advice

## Scope Boundaries

- This skill supports training simulations, not real clinical triage.
- This skill does not provide diagnosis, treatment selection, or emergency instructions.
- This skill should not be used as a substitute for faculty supervision or patient care.

## Stress-Case Rules

For complex multi-constraint requests, always include these explicit blocks:

1. Training Objective
2. Scenario Assumptions
3. Roleplay Output
4. Educational Limits
5. Debrief and Next Checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate clinical certainty, real patient data, or verified diagnostic outcomes.

## Input Validation

This skill accepts: a scenario identifier and a learner question for standardized patient simulation in a medical training context.

If the request does not involve educational patient simulation — for example, asking for real clinical diagnosis, treatment recommendations, emergency triage, or non-medical roleplay — do not proceed with the workflow. Instead respond:
> "virtual-patient-roleplay is designed for medical training simulations only. Your request appears to be outside this scope. Please provide a scenario and learner question for educational practice, or use a more appropriate tool."

## References

- [references/references.md](references/references.md) — Educational standards and simulation frameworks
- [references/audit-reference.md](references/audit-reference.md) — Supported scope, audit commands, and fallback boundaries

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/audit-reference.md
# Audit Reference

## Supported Scope

- Simulate standardized-patient interactions for teaching and rehearsal.
- Return educational feedback, missed-question hints, and debrief support.
- Keep outputs bounded to simulation and training support.

## Stable Audit Commands

```bash
python -m py_compile scripts/main.py
python -c "from scripts.main import PatientSimulator; sim=PatientSimulator('chest_pain'); print(sim.ask('Where does the pain go?')['patient_response'])"
```

## Fallback Boundaries

- If the request asks for real diagnosis, treatment, or emergency advice, stop and restate that the skill is educational only.
- If the requested scenario is unsupported, provide a manual scenario outline rather than claiming a simulated run succeeded.
- If execution fails, return a structured teaching scaffold with learner goals, patient cues, and debrief prompts.

## Output Guardrails

- Separate simulated patient speech from teaching commentary.
- Keep hidden-case facts distinct from revealed information.
- State clearly that final clinical judgment belongs to qualified supervision and real patient evaluation.

FILE:references/guidelines.md
# Virtual Patient Roleplay - References

## Educational Standards
- ASPE (Association of Standardized Patient Educators) Standards of Practice
- AAMC OSCE Guidelines for Medical Schools
- ACGME Milestones for Clinical Skills Assessment

## Clinical Scenarios Reference
- NBME Step 2 CK Practice Materials
- USMLE Clinical Skills Examination Cases
- Standardized Patient Case Development Guidelines

## Communication Skills Framework
- Calgary-Cambridge Guide to the Medical Interview
- SEGUE Framework for Communication Assessment
- Kalamazoo Consensus Statement on Communication Skills

## Technical Implementation
- Python 3.8+
- JSON for case data storage
- Modular design for extensibility

FILE:references/references.md
# Virtual Patient Roleplay References

## Educational Standards

- ASPE Standards of Best Practice
- AAMC OSCE guidance for medical schools
- ACGME clinical-skills milestones

## Simulation Design

- Standardized patient case-design principles
- Structured debriefing guidance for learner feedback
- Communication frameworks for patient interviewing

FILE:requirements.txt
dataclasses
enum

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Virtual Patient Roleplay - Standardized Patient Simulator
Simulates patient interactions for medical education training.
"""

import json
import random
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from enum import Enum


class EmotionalState(Enum):
    CALM = "calm"
    ANXIOUS = "anxious"
    IRRITABLE = "irritable"
    DEPRESSED = "depressed"
    CONFUSED = "confused"


@dataclass
class PatientProfile:
    """Patient demographic and background information."""
    name: str
    age: int
    gender: str
    occupation: str
    medical_history: List[str] = field(default_factory=list)
    allergies: List[str] = field(default_factory=list)
    medications: List[str] = field(default_factory=list)


@dataclass
class ClinicalScenario:
    """Clinical case template."""
    chief_complaint: str
    history_of_present_illness: str
    associated_symptoms: List[str]
    key_findings: Dict[str, str]
    emotional_state: EmotionalState
    hidden_concerns: List[str]


class PatientSimulator:
    """Simulates standardized patient interactions."""
    
    SCENARIOS = {
        "chest_pain": {
            "chief_complaint": "Chest pain for 2 hours",
            "hpi": "Started while climbing stairs. Pressure-like, 8/10, radiates to left arm. Associated with shortness of breath and sweating.",
            "symptoms": ["dyspnea", "diaphoresis", "nausea"],
            "findings": {"vitals": "HR 110, BP 160/95", "ecg": "ST elevation in V1-V4"},
            "emotion": EmotionalState.ANXIOUS,
            "hidden": ["Family history of MI", "Smokes 1 pack/day", "Afraid of dying"]
        },
        "headache": {
            "chief_complaint": "Severe headache for 3 days",
            "hpi": "Worst headache of life. Sudden onset. Associated with photophobia and neck stiffness.",
            "symptoms": ["photophobia", "phonophobia", "neck_stiffness"],
            "findings": {"vitals": "BP 180/110", "neuro": "Positive Kernig sign"},
            "emotion": EmotionalState.IRRITABLE,
            "hidden": ["Had similar episode 6 months ago", "Takes ibuprofen daily"]
        },
        "abdominal_pain": {
            "chief_complaint": "Right lower quadrant pain",
            "hpi": "Started around belly button, moved to RLQ. Gradual worsening over 12 hours. Loss of appetite.",
            "symptoms": ["anorexia", "nausea", "low_grade_fever"],
            "findings": {"vitals": "T 37.8C, HR 95", "exam": "Tenderness at McBurney's point"},
            "emotion": EmotionalState.CALM,
            "hidden": ["Fear of surgery", "No health insurance"]
        }
    }
    
    def __init__(self, scenario: str = "chest_pain", difficulty: str = "intermediate"):
        self.scenario_type = scenario
        self.difficulty = difficulty
        # Deterministic seed keeps the same scenario setup reproducible across audits.
        self.seed = hash((scenario, difficulty)) & 0xFFFFFFFF
        self.rng = random.Random(self.seed)
        self.scenario = self._load_scenario(scenario)
        self.conversation_history: List[Dict] = []
        self.asked_questions: set = set()
        self.revealed_info: set = set()
        
        # Generate patient profile
        self.patient = self._generate_patient_profile()
        
    def _load_scenario(self, scenario_type: str) -> Dict:
        """Load scenario template."""
        if scenario_type not in self.SCENARIOS:
            raise ValueError(f"Unknown scenario: {scenario_type}. Available: {list(self.SCENARIOS.keys())}")
        return self.SCENARIOS[scenario_type]
    
    def _generate_patient_profile(self) -> PatientProfile:
        """Generate random but consistent patient demographics."""
        first_names_m = ["James", "Robert", "John", "Michael", "David"]
        first_names_f = ["Mary", "Patricia", "Jennifer", "Linda", "Elizabeth"]
        last_names = ["Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia"]
        
        gender = self.rng.choice(["male", "female"])
        first_names = first_names_m if gender == "male" else first_names_f
        
        name = f"{self.rng.choice(first_names)} {self.rng.choice(last_names)}"
        age = self.rng.randint(35, 75) if self.scenario_type == "chest_pain" else self.rng.randint(18, 65)
        
        occupations = ["teacher", "retired", "office worker", "truck driver", "nurse"]
        
        return PatientProfile(
            name=name,
            age=age,
            gender=gender,
            occupation=self.rng.choice(occupations),
            medical_history=["hypertension"] if self.rng.random() > 0.5 else [],
            allergies=["penicillin"] if self.rng.random() > 0.7 else [],
            medications=["lisinopril"] if self.rng.random() > 0.5 else []
        )
    
    def ask(self, question: str) -> Dict:
        """
        Process student question and generate patient response.
        
        Args:
            question: Student's question text
            
        Returns:
            Response dictionary with patient answer and feedback
        """
        question_lower = question.lower()
        self.asked_questions.add(question_lower)
        
        # Determine response based on question type
        response = self._generate_response(question_lower)
        
        # Record in history
        exchange = {
            "question": question,
            "response": response["patient_response"],
            "revealed": list(self.revealed_info)
        }
        self.conversation_history.append(exchange)
        
        # Generate feedback
        feedback = self._generate_feedback()
        
        return {
            "patient_response": response["patient_response"],
            "emotional_state": self.scenario["emotion"].value,
            "physical_cues": response.get("physical_cues", ""),
            "hidden_information": self._get_unrevealed_info(),
            "conversation_history": self.conversation_history,
            "feedback": feedback
        }
    
    def _generate_response(self, question: str) -> Dict:
        """Generate contextual patient response."""
        responses = {
            "chest_pain": {
                "pain": ("It feels like an elephant is sitting on my chest. It's crushing and heavy.",
                        "Patient clutches chest while speaking"),
                "arm": ("Yes, it goes down my left arm. Feels numb and tingly.",
                       "Patient rubs left arm"),
                "family": ("My father died of a heart attack when he was 55.",
                          ""),
                "smoke": ("I smoke about a pack a day. I know I should quit...",
                         "Patient looks embarrassed"),
                "name": (f"I'm {self.patient.name}.",
                        ""),
                "default": ("I'm not sure I understand. Can you ask differently?",
                           "Patient looks confused")
            },
            "headache": {
                "head": ("It's the worst headache I've ever had. Like someone hit me with a hammer.",
                        "Patient holds head in hands, avoids light"),
                "light": ("Yes, the light hurts my eyes terribly.",
                         "Patient squints"),
                "neck": ("My neck is so stiff I can barely move it.",
                        "Patient demonstrates limited neck mobility"),
                "default": ("Could you repeat that? The pain makes it hard to concentrate.",
                           "Patient appears distracted")
            },
            "abdominal_pain": {
                "pain": ("It started near my belly button but now it's here on the right side.",
                        "Patient points to RLQ"),
                "eat": ("I haven't felt like eating anything since yesterday.",
                       ""),
                "fever": ("I think I might have a slight fever. I feel warm.",
                         "Patient touches forehead"),
                "surgery": ("I'm scared about needing surgery...",
                           "Patient voice trembles slightly"),
                "default": ("What do you mean?",
                           "Patient looks puzzled")
            }
        }
        
        scenario_responses = responses.get(self.scenario_type, responses["chest_pain"])
        
        # Match keywords
        for keyword, (answer, cue) in scenario_responses.items():
            if keyword != "default" and keyword in question:
                self.revealed_info.add(keyword)
                return {
                    "patient_response": answer,
                    "physical_cues": cue
                }
        
        # Default response
        return {
            "patient_response": scenario_responses.get("default", ("I'm not sure.", ""))[0],
            "physical_cues": scenario_responses.get("default", ("", ""))[1]
        }
    
    def _get_unrevealed_info(self) -> List[str]:
        """Return information not yet revealed to student."""
        hidden = set(self.scenario.get("hidden", []))
        return list(hidden - self.revealed_info)
    
    def _generate_feedback(self) -> Dict:
        """Generate educational feedback."""
        feedback = {
            "communication_tips": [],
            "missed_questions": [],
            "strengths": []
        }
        
        # Check for missed critical questions
        critical_questions = {
            "chest_pain": ["pain", "arm", "family"],
            "headache": ["head", "light", "neck"],
            "abdominal_pain": ["pain", "eat", "fever"]
        }
        
        scenario_critical = critical_questions.get(self.scenario_type, [])
        asked_keywords = set()
        for q in self.asked_questions:
            for keyword in scenario_critical:
                if keyword in q:
                    asked_keywords.add(keyword)
        
        missed = set(scenario_critical) - asked_keywords
        if missed:
            feedback["missed_questions"] = [f"Consider asking about: {m}" for m in missed]
        
        # Communication tips
        if len(self.conversation_history) > 5:
            feedback["communication_tips"].append("Good thoroughness in history taking.")
        else:
            feedback["communication_tips"].append("Consider asking more follow-up questions.")
        
        return feedback
    
    def get_case_summary(self) -> Dict:
        """Return complete case information for debriefing."""
        return {
            "patient_profile": {
                "name": self.patient.name,
                "age": self.patient.age,
                "gender": self.patient.gender,
                "occupation": self.patient.occupation
            },
            "chief_complaint": self.scenario["chief_complaint"],
            "history_of_present_illness": self.scenario["hpi"],
            "key_findings": self.scenario["findings"],
            "hidden_concerns": self.scenario.get("hidden", []),
            "questions_asked": len(self.asked_questions),
            "information_revealed": len(self.revealed_info),
            "conversation_transcript": self.conversation_history
        }


def main():
    """CLI interface for testing."""
    import sys
    
    scenario = sys.argv[1] if len(sys.argv) > 1 else "chest_pain"
    
    print(f"🩺 Virtual Patient Simulator - {scenario.upper()}\n")
    print("=" * 50)
    
    simulator = PatientSimulator(scenario=scenario)
    
    print(f"\nPatient: {simulator.patient.name}, {simulator.patient.age}y {simulator.patient.gender}")
    print(f"Occupation: {simulator.patient.occupation}")
    print("\nThe patient enters the room, holding their chest...")
    print("\nType your questions (or 'quit' to exit, 'summary' for debrief):\n")
    
    while True:
        try:
            question = input("> ").strip()
            
            if question.lower() == 'quit':
                break
            elif question.lower() == 'summary':
                summary = simulator.get_case_summary()
                print("\n" + "=" * 50)
                print("CASE DEBRIEF")
                print("=" * 50)
                print(json.dumps(summary, indent=2))
                continue
            
            if not question:
                continue
            
            result = simulator.ask(question)
            print(f"\n🧑 Patient: {result['patient_response']}")
            if result['physical_cues']:
                print(f"   [Physical: {result['physical_cues']}]")
            
            if result['feedback']['missed_questions']:
                print(f"\n💡 Tip: {result['feedback']['missed_questions'][0]}")
            
        except KeyboardInterrupt:
            break
        except Exception as e:
            print(f"Error: {e}")
    
    print("\nSession ended.")


if __name__ == "__main__":
    main()

ClawHub Automation Documentation+2

A@clawhub-aipoch-ai-772015cadb

Upset Plot Converter

Skill

Convert complex Venn diagrams with more than 4 sets to clearer Upset.

---
name: upset-plot-converter
description: Convert complex Venn diagrams with more than 4 sets to clearer Upset.
license: MIT
skill-author: AIPOCH
---
# Upset Plot Converter

Convert complex Venn diagrams (more than 4 sets) to clearer Upset Plots.

## When to Use

- Use this skill when the task is to Convert complex Venn diagrams with more than 4 sets to clearer Upset.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

- Scope-focused workflow aligned to: Convert complex Venn diagrams with more than 4 sets to clearer Upset.
- Packaged executable path(s): `scripts/main.py`.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `matplotlib`: `unspecified`. Declared in `requirements.txt`.
- `numpy`: `unspecified`. Declared in `requirements.txt`.

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Data Analytics/upset-plot-converter"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Usage

```python
from skills.upset_plot_converter.scripts.main import convert_venn_to_upset

# From set data
sets = {
    'A': {1, 2, 3, 4, 5},
    'B': {4, 5, 6, 7, 8},
    'C': {3, 5, 7, 9, 10},
    'D': {2, 4, 6, 8, 10},
    'E': {1, 3, 5, 7, 9}
}
convert_venn_to_upset(sets, output_path="upset_plot.png")

# From list data
from skills.upset_plot_converter.scripts.main import upset_from_lists
set_names = ['Genes A', 'Genes B', 'Genes C', 'Genes D', 'Genes E']
lists = [
    ['gene1', 'gene2', 'gene3'],
    ['gene2', 'gene4', 'gene5'],
    ['gene3', 'gene5', 'gene6'],
    ['gene7', 'gene8', 'gene9'],
    ['gene1', 'gene10', 'gene11']
]
upset_from_lists(set_names, lists, output_path="gene_upset.png", title="Gene Intersections")
```

## Input

- **sets**: Dictionary of set names to sets/lists of elements, OR
- **set_names**: List of set names
- **lists**: List of lists (each containing elements)
- **output_path**: Path to save the output figure
- **title**: Optional title for the plot
- **min_subset_size**: Minimum subset size to display (default: 1)
- **max_intersections**: Maximum number of intersections to show (default: 30)

## Output

PNG file of the Upset Plot visualization.

## Notes

- When Venn diagrams exceed 4 sets, they become difficult to read
- Upset Plots provide a clearer alternative for visualizing set intersections
- The x-axis shows set intersections as dot patterns
- Bar heights represent the size of each intersection
- Automatically sorts intersections by size for better readability

## Requirements

- matplotlib
- numpy
- pandas

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `upset-plot-converter` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `upset-plot-converter` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:requirements.txt
matplotlib
numpy

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Upset Plot Converter
Convert complex Venn diagrams (>4 sets) to clearer Upset Plots.
"""

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
from collections import defaultdict
from itertools import combinations
from typing import Dict, List, Set, Optional, Tuple


def _compute_intersections(sets: Dict[str, Set]) -> Dict[frozenset, Set]:
    """Compute all possible intersections between sets."""
    set_names = list(sets.keys())
    all_elements = set()
    for s in sets.values():
        all_elements.update(s)
    
    intersections = defaultdict(set)
    
    for elem in all_elements:
        # Find which sets contain this element
        containing_sets = frozenset(name for name, s in sets.items() if elem in s)
        if containing_sets:  # Only if element is in at least one set
            intersections[containing_sets].add(elem)
    
    return intersections


def _prepare_upset_data(
    intersections: Dict[frozenset, Set],
    set_names: List[str],
    min_subset_size: int = 1,
    max_intersections: int = 30
) -> Tuple[List[frozenset], List[int], np.ndarray]:
    """Prepare data for upset plot rendering."""
    # Filter by minimum subset size
    filtered = {k: v for k, v in intersections.items() if len(v) >= min_subset_size}
    
    # Sort by size (descending) and take top N
    sorted_items = sorted(filtered.items(), key=lambda x: len(x[1]), reverse=True)
    sorted_items = sorted_items[:max_intersections]
    
    if not sorted_items:
        return [], [], np.array([])
    
    intersection_sets = [item[0] for item in sorted_items]
    sizes = [len(item[1]) for item in sorted_items]
    
    # Create membership matrix
    n_sets = len(set_names)
    n_intersections = len(intersection_sets)
    membership = np.zeros((n_intersections, n_sets), dtype=int)
    
    for i, inter_set in enumerate(intersection_sets):
        for j, name in enumerate(set_names):
            if name in inter_set:
                membership[i, j] = 1
    
    return intersection_sets, sizes, membership


def _plot_upset(
    set_names: List[str],
    intersection_sets: List[frozenset],
    sizes: List[int],
    membership: np.ndarray,
    output_path: str,
    title: Optional[str] = None
) -> None:
    """Create the upset plot visualization."""
    n_sets = len(set_names)
    n_intersections = len(intersection_sets)
    
    if n_intersections == 0:
        print("No intersections to plot.")
        return
    
    # Figure dimensions
    bar_height = 0.6
    matrix_height = 0.4
    set_label_width = 0.15
    
    fig_width = max(10, n_intersections * 0.5 + 3)
    fig_height = max(6, n_sets * 0.5 + 2)
    
    fig = plt.figure(figsize=(fig_width, fig_height))
    gs = fig.add_gridspec(2, 2, 
                          width_ratios=[set_label_width, 1 - set_label_width],
                          height_ratios=[0.6, 0.4],
                          wspace=0.05, hspace=0.1)
    
    # Bar chart (top right)
    ax_bars = fig.add_subplot(gs[0, 1])
    ax_bars.bar(range(n_intersections), sizes, color='#4CAF50', edgecolor='black', linewidth=0.5)
    ax_bars.set_xticks(range(n_intersections))
    ax_bars.set_xticklabels([])
    ax_bars.set_ylabel('Intersection Size', fontsize=10)
    ax_bars.set_xlim(-0.5, n_intersections - 0.5)
    
    # Add value labels on bars
    for i, (size, inter_set) in enumerate(zip(sizes, intersection_sets)):
        ax_bars.text(i, size + max(sizes) * 0.02, str(size), 
                    ha='center', va='bottom', fontsize=8, fontweight='bold')
    
    # Set labels (bottom left)
    ax_set_labels = fig.add_subplot(gs[1, 0])
    ax_set_labels.set_xlim(0, 1)
    ax_set_labels.set_ylim(-0.5, n_sets - 0.5)
    for i, name in enumerate(reversed(set_names)):
        ax_set_labels.text(0.9, i, name, ha='right', va='center', fontsize=9)
    ax_set_labels.axis('off')
    
    # Matrix (bottom right)
    ax_matrix = fig.add_subplot(gs[1, 1])
    ax_matrix.set_xlim(-0.5, n_intersections - 0.5)
    ax_matrix.set_ylim(-0.5, n_sets - 0.5)
    
    # Draw grid lines
    for i in range(n_sets):
        ax_matrix.axhline(i - 0.5, color='gray', linewidth=0.5, alpha=0.3)
    for i in range(n_intersections + 1):
        ax_matrix.axvline(i - 0.5, color='gray', linewidth=0.5, alpha=0.3)
    
    # Draw dots and connecting lines
    colors = plt.cm.tab10(np.linspace(0, 1, n_sets))
    
    for col in range(n_intersections):
        active_rows = []
        for row in range(n_sets):
            if membership[col, row] == 1:
                # Draw circle
                circle = plt.Circle((col, n_sets - 1 - row), 0.15, 
                                   color=colors[row], ec='black', linewidth=1)
                ax_matrix.add_patch(circle)
                active_rows.append(n_sets - 1 - row)
        
        # Draw connecting line if multiple sets in intersection
        if len(active_rows) > 1:
            min_y, max_y = min(active_rows), max(active_rows)
            ax_matrix.plot([col, col], [min_y, max_y], 'k-', linewidth=2)
    
    ax_matrix.set_xticks(range(n_intersections))
    ax_matrix.set_xticklabels([])
    ax_matrix.set_yticks(range(n_sets))
    ax_matrix.set_yticklabels([])
    ax_matrix.set_aspect('equal')
    
    # Set total sizes on the left
    ax_totals = fig.add_subplot(gs[0, 0])
    ax_totals.set_xlim(0, 1)
    ax_totals.set_ylim(-0.5, n_sets - 0.5)
    
    # Hide the totals subplot (not needed for upset plot style)
    ax_totals.axis('off')
    
    if title:
        fig.suptitle(title, fontsize=14, fontweight='bold', y=0.98)
    
    plt.savefig(output_path, dpi=150, bbox_inches='tight', facecolor='white')
    plt.close()
    
    print(f"Upset plot saved to: {output_path}")


def convert_venn_to_upset(
    sets: Dict[str, Set],
    output_path: str = "upset_plot.png",
    title: Optional[str] = None,
    min_subset_size: int = 1,
    max_intersections: int = 30
) -> None:
    """
    Convert a dictionary of sets to an Upset Plot.
    
    Args:
        sets: Dictionary mapping set names to sets of elements
        output_path: Path to save the output figure
        title: Optional title for the plot
        min_subset_size: Minimum subset size to display
        max_intersections: Maximum number of intersections to show
    """
    if len(sets) <= 4:
        print(f"Warning: Only {len(sets)} sets provided. Venn diagrams work well for ≤4 sets.")
        print("Upset plot will still be generated for comparison.")
    
    if len(sets) > 10:
        print(f"Warning: {len(sets)} sets may produce a very large plot.")
    
    set_names = list(sets.keys())
    
    # Convert all values to sets
    sets_converted = {name: set(s) for name, s in sets.items()}
    
    # Compute intersections
    intersections = _compute_intersections(sets_converted)
    
    # Prepare data
    intersection_sets, sizes, membership = _prepare_upset_data(
        intersections, set_names, min_subset_size, max_intersections
    )
    
    # Generate plot
    _plot_upset(set_names, intersection_sets, sizes, membership, output_path, title)


def upset_from_lists(
    set_names: List[str],
    lists: List[List],
    output_path: str = "upset_plot.png",
    title: Optional[str] = None,
    min_subset_size: int = 1,
    max_intersections: int = 30
) -> None:
    """
    Create an Upset Plot from lists of elements.
    
    Args:
        set_names: List of names for each set
        lists: List of lists, each containing elements of a set
        output_path: Path to save the output figure
        title: Optional title for the plot
        min_subset_size: Minimum subset size to display
        max_intersections: Maximum number of intersections to show
    """
    if len(set_names) != len(lists):
        raise ValueError("set_names and lists must have the same length")
    
    sets = {name: set(lst) for name, lst in zip(set_names, lists)}
    
    convert_venn_to_upset(
        sets, output_path, title, min_subset_size, max_intersections
    )


def print_intersection_stats(sets: Dict[str, Set]) -> None:
    """Print statistics about set intersections."""
    intersections = _compute_intersections(sets)
    
    print("\n=== Set Intersection Statistics ===")
    print(f"Total unique elements: {len(set().union(*sets.values()))}")
    print(f"Number of sets: {len(sets)}")
    print(f"Number of unique intersections: {len(intersections)}")
    print("\nIntersections (sorted by size):")
    
    sorted_items = sorted(intersections.items(), key=lambda x: len(x[1]), reverse=True)
    
    for inter_set, elements in sorted_items:
        if len(inter_set) == 0:
            continue
        set_names = ', '.join(sorted(inter_set))
        print(f"  {set_names}: {len(elements)} elements")


# Example usage
if __name__ == "__main__":
    # Example with 5 sets
    sets = {
        'A': {1, 2, 3, 4, 5, 20, 21},
        'B': {4, 5, 6, 7, 8, 20, 22},
        'C': {3, 5, 7, 9, 10, 21, 22},
        'D': {2, 4, 6, 8, 10, 20, 23},
        'E': {1, 3, 5, 7, 9, 21, 23}
    }
    
    print_intersection_stats(sets)
    convert_venn_to_upset(sets, output_path="upset_plot.png", title="Example Upset Plot")

ClawHub Coding Data Analysis+2

A@clawhub-aipoch-ai-772015cadb

Unstructured Medical Text Miner

Skill

Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.

---
name: unstructured-medical-text-miner
description: Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.
license: MIT
skill-author: AIPOCH
---
# Unstructured Medical Text Miner (ID: 213)

## When to Use

- Use this skill when the task needs Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.
- Use this skill for evidence insight tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

See `## Features` above for related details.

- Scope-focused workflow aligned to: Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.
- Packaged executable path(s): `scripts/__init__.py` plus 1 additional script(s).
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

```
pandas>=1.3.0
spacy>=3.4.0
scispacy>=0.5.1
radlex (for radiology terminology)
negspacy (for negation detection)
```

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Evidence Insight/unstructured-medical-text-miner"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/__init__.py` with additional helper scripts under `scripts/`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py -h
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Overview

Mine "text data" that has been long overlooked in MIMIC-IV, extracting unstructured diagnostic logic, order details, and progress notes.

## Purpose

The MIMIC-IV database contains large amounts of structured data (vital signs, laboratory results, etc.), but its true clinical value is often hidden in unstructured text:
- Diagnostic reasoning chains in discharge summaries
- Subtle finding descriptions in imaging reports
- Treatment decision logic in progress notes
- Personalized medication considerations in orders

This Skill provides a complete text mining toolchain to transform raw medical text into analyzable structured insights.

## Features

### 1. Text Extraction
- **NOTEEVENTS**: Extract clinical notes from MIMIC-IV NOTE module
- **Radiology Reports**: Extract imaging diagnostic text
- **ECG Reports**: Parse ECG interpretation text
- **Discharge Summaries**: Extract complete diagnostic and treatment course

### 2. Information Extraction
- **Entity Recognition**: Diseases, symptoms, medications, procedures, anatomical sites
- **Relation Extraction**: Medication-disease treatment relationships, symptom-disease diagnostic relationships
- **Timeline Extraction**: Event occurrence times, disease progression sequence
- **Negation Detection**: Identify negated clinical findings (e.g., "no fever")

### 3. Clinical Logic Parsing
- **Diagnostic Reasoning Chain**: Reasoning path from symptoms → examination → diagnosis
- **Treatment Decision Tree**: Clinical basis for medication selection and dosage adjustment
- **Disease Progression**: Disease progression and outcome descriptions

### 4. Structured Output
- FHIR-compatible clinical document format
- Knowledge graph-friendly triple format
- Temporal event sequences

## Usage

```python
from skills.unstructured_medical_text_miner.scripts.main import MedicalTextMiner

# Initialize miner
miner = MedicalTextMiner()

# Load MIMIC-IV note data
miner.load_notes(notes_path="path/to/noteevents.csv")

# Extract all text records for a specific patient
patient_texts = miner.get_patient_texts(subject_id=10000032)

# Execute complete information extraction
insights = miner.extract_insights(
    text=patient_texts,
    extract_entities=True,
    extract_relations=True,
    extract_timeline=True
)
```

## Input

### Data Sources
- MIMIC-IV NOTEEVENTS table (csv/parquet format)
- Discharge summary files
- Imaging report files
- Custom medical text

### Field Requirements
| Field Name | Description | Required |
|--------|------|------|
| subject_id | Patient unique identifier | Yes |
| hadm_id | Hospital admission record identifier | No |
| note_type | Note type (DS/RR/ECG, etc.) | Yes |
| note_text | Note text content | Yes |
| charttime | Record time | No |

## Output

### Entity Extraction Results
```json
{
  "entities": [
    {
      "text": "acute myocardial infarction",
      "type": "DISEASE",
      "start": 156,
      "end": 183,
      "confidence": 0.94
    },
    {
      "text": "aspirin 81mg",
      "type": "MEDICATION",
      "start": 245,
      "end": 257,
      "attributes": {
        "dose": "81mg",
        "frequency": "daily"
      }
    }
  ]
}
```

### Clinical Logic Graph
```json
{
  "clinical_logic": {
    "presenting_complaint": "chest pain",
    "differential_diagnoses": ["ACS", "PE", "aortic dissection"],
    "workup": ["ECG", "troponin", "CTA chest"],
    "final_diagnosis": "STEMI",
    "treatment_plan": ["PCI", "dual antiplatelet"]
  }
}
```

### Temporal Events
```json
{
  "timeline": [
    {
      "time": "2020-03-15 08:30",
      "event": "admission",
      "description": "presented with chest pain"
    },
    {
      "time": "2020-03-15 09:15",
      "event": "ECG",
      "description": "ST elevation in V1-V4"
    }
  ]
}
```

## Configuration

```yaml

# config.yaml
extraction:
  entity_types: ["DISEASE", "SYMPTOM", "MEDICATION", "PROCEDURE", "ANATOMY"]
  relation_types: ["TREATS", "CAUSES", "CONTRAINDICATED_WITH"]
  enable_negation_detection: true
  
models:
  ner_model: "en_core_sci_lg"  # or "en_core_sci_scibert"
  relation_model: "custom_relation_extractor"
  
output:
  format: "json"  # json/fhir/kg
  include_raw_text: false
```

## CLI Usage

```text

# Process single file
python -m skills.unstructured_medical_text_miner.scripts.main \
  --input notes.csv \
  --output extracted.json \
  --extract all

# Process specific patient
python -m skills.unstructured_medical_text_miner.scripts.main \
  --subject-id 10000032 \
  --db-path mimic_iv.db \
  --output patient_insights.json
```

## References

1. MIMIC-IV Clinical Database: https://physionet.org/content/mimiciv/
2. scispacy: https://allenai.github.io/scispacy/
3. NegEx/negspacy for negation detection
4. FHIR Clinical Document specifications

## Author

Skill ID: 213
Category: Medical Data Mining
Complexity: Advanced

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `unstructured-medical-text-miner` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `unstructured-medical-text-miner` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## References

- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:config.yaml
# Unstructured Medical Text Miner Configuration

extraction:
  # 要提取的实体类型
  entity_types:
    - DISEASE
    - SYMPTOM
    - MEDICATION
    - PROCEDURE
    - ANATOMY
    - LAB_VALUE
  
  # 要提取的关系类型
  relation_types:
    - TREATS
    - CAUSES
    - CONTRAINDICATED_WITH
    - INDICATES
  
  # 启用否定检测
  enable_negation_detection: true
  
  # 文本预处理选项
  preprocessing:
    lowercase: false
    remove_deidentification_brackets: true  # 移除 [**Name**] 等标记
    normalize_whitespace: true

models:
  # spaCy模型选择
  # 可选: en_core_web_sm, en_core_web_md, en_core_web_lg
  # 医疗专用: en_core_sci_lg, en_core_sci_scibert
  ner_model: "en_core_sci_lg"
  
  # 关系抽取模型（自定义）
  relation_model: null

output:
  # 输出格式: json, fhir, kg (knowledge graph)
  format: "json"
  
  # 是否包含原始文本
  include_raw_text: false
  
  # 是否包含位置信息
  include_spans: true
  
  # 置信度阈值
  confidence_threshold: 0.5

# 特定笔记类型的处理规则
note_type_rules:
  DS:  # Discharge Summary
    sections_to_extract:
      - "Chief Complaint"
      - "History of Present Illness"
      - "Hospital Course"
      - "Discharge Diagnosis"
      - "Discharge Medications"
  
  RR:  # Radiology Report
    sections_to_extract:
      - "Clinical History"
      - "Findings"
      - "Impression"
  
  ECG:  # ECG Report
    sections_to_extract:
      - "Interpretation"
      - "Conclusion"

# 性能设置
performance:
  # 批处理大小
  batch_size: 100
  
  # 并行工作线程
  n_workers: 4
  
  # 内存限制（MB）
  memory_limit: 4096

FILE:example.py
"""Example usage of Unstructured Medical Text Miner

Demonstrate how to use a medical text miner to process MIMIC-IV data"""

from scripts.main import MedicalTextMiner

# Example 1: Processing a single text fragment
sample_discharge_summary = """
DISCHARGE SUMMARY

Patient: [**Name**] (M, 67 y/o)
Admission Date: [**Date**]
Discharge Date: [**Date**]

CHIEF COMPLAINT:
Chest pain for 3 hours

HISTORY OF PRESENT ILLNESS:
Patient presented with acute onset chest pain radiating to left arm. 
No prior history of coronary artery disease. Denies dyspnea, nausea, or diaphoresis.
Vital signs on admission: BP 140/90, HR 98, RR 18, SpO2 97% on room air.

HOSPITAL COURSE:
ECG showed ST elevation in V1-V4 consistent with anterior STEMI.
Troponin I peaked at 15.6 ng/mL.
Patient underwent emergent cardiac catheterization with PCI to LAD.
Started on aspirin 81mg daily, clopidogrel 75mg daily, atorvastatin 40mg daily.
No complications post-procedure. Chest pain resolved.

DISCHARGE DIAGNOSIS:
1. Acute anterior ST-elevation myocardial infarction (STEMI), successfully treated with PCI
2. Hypertension

DISCHARGE MEDICATIONS:
1. Aspirin 81mg PO daily
2. Clopidogrel 75mg PO daily  
3. Atorvastatin 40mg PO daily
4. Lisinopril 10mg PO daily

FOLLOW UP:
Cardiology clinic in 1 week
"""


def demo_single_text():
    """Demonstrate entity and relationship extraction from a single text"""
    print("=" * 60)
    print("DEMO 1: Single Text Extraction")
    print("=" * 60)
    
    miner = MedicalTextMiner()
    
    # Extract complete insights
    insights = miner.extract_insights(
        sample_discharge_summary,
        extract_entities=True,
        extract_relations=True,
        extract_timeline=True,
        extract_logic=True
    )
    
    print(f"\nExtracted {insights['entity_count']} entities:")
    for entity_type, entities in insights.get('entities_by_type', {}).items():
        print(f"\n  {entity_type}:")
        for e in entities[:3]:  # Only show first 3
            neg_mark = " [NEGATED]" if e.get('negated') else ""
            print(f"    - {e['text']}{neg_mark}")
    
    print(f"\nExtracted {insights.get('relation_count', 0)} relations:")
    for r in insights.get('relations', [])[:3]:
        print(f"  - {r['subject']} --{r['predicate']}--> {r['object']}")
    
    print("\nClinical Logic:")
    logic = insights.get('clinical_logic', {})
    if logic.get('presenting_complaint'):
        print(f"  Chief Complaint: {logic['presenting_complaint']}")
    if logic.get('workup'):
        print(f"  Workup: {', '.join(logic['workup'])}")
    
    return insights


def demo_patient_processing():
    """Demonstrates patient-level processing (requires MIMIC-IV data files)"""
    print("\n" + "=" * 60)
    print("DEMO 2: Patient Processing (requires MIMIC-IV data)")
    print("=" * 60)
    
    miner = MedicalTextMiner()
    
    # NOTE: This requires actual MIMIC-IV data files
    # Uncomment below and provide correct file path to run
    
    # miner.load_notes("path/to/noteevents.csv", note_types=["DS", "RR"])
    # patient_insights = miner.process_patient(subject_id=10000032)
    # 
    # print(miner.generate_summary_report(patient_insights))
    # miner.export_to_json(patient_insights, "patient_10000032_insights.json")
    
    print("""
    To run patient processing:
    
    1. Download MIMIC-IV NOTEEVENTS data from PhysioNet
    2. Run:
       
       miner = MedicalTextMiner()
       miner.load_notes("noteevents.csv", note_types=["DS"])
       insights = miner.process_patient(subject_id=10000032)
       miner.export_to_json(insights, "output.json")
    """)


def demo_regex_patterns():
    """Demonstrates regular expression pattern matching"""
    print("\n" + "=" * 60)
    print("DEMO 3: Regex Pattern Matching")
    print("=" * 60)
    
    test_text = """
    The patient was diagnosed with acute myocardial infarction.
    Started on heparin drip and metformin 500mg twice daily.
    No evidence of pneumonia. Troponin elevated at 5.2 ng/mL.
    """
    
    miner = MedicalTextMiner()
    entities = miner.extract_entities_regex(test_text)
    
    print(f"\nFound {len(entities)} entities:")
    for e in entities:
        neg_mark = " [NEGATED]" if e.get('negated') else ""
        print(f"  - [{e['type']}] {e['text']}{neg_mark}")


if __name__ == "__main__":
    # Run the demo
    demo_single_text()
    demo_regex_patterns()
    demo_patient_processing()

FILE:references/audit-reference.md
# Audit Reference

## Scope

- Skill directory: `unstructured-medical-text-miner`
- Core purpose: Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.
- Use only within the documented workflow and category boundary defined in `SKILL.md`

## Supported Audit Paths

- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`
- `python scripts/main.py -h`

## Fallback Boundary

If required inputs are incomplete, the skill should still return:

- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable

FILE:requirements.txt
negspacy
numpy
pandas
pyarrow
pyyaml
scispacy
spacy
tqdm

FILE:scripts/main.py
#!/usr/bin/env python3
"""Unstructured Medical Text Miner (Skill ID: 213)

Mining MIMIC-IV medical text data and extracting unstructured diagnosis and treatment information.
Supports entity recognition, relationship extraction, timeline analysis and diagnosis and treatment logic extraction."""

import argparse
import json
import re
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Union, Any
import warnings

import pandas as pd

# Try importing optional dependencies
try:
    import spacy
    SPACY_AVAILABLE = True
except ImportError:
    SPACY_AVAILABLE = False
    warnings.warn("spacy not installed. NER functionality will be limited.")

try:
    from negspacy.negation import Negex
    NEGSPACY_AVAILABLE = True
except ImportError:
    NEGSPACY_AVAILABLE = False


class MedicalTextMiner:
    """MIMIC-IV unstructured medical text miner
    
    Main functions:
    1. Load and preprocess text data such as NOTEEVENTS
    2. Entity recognition (diseases, symptoms, drugs, surgeries, etc.)
    3. Negative testing (identification of clinical findings that are negated)
    4. Relationship extraction (drug-disease relationship, etc.)
    5. Timeline extraction (disease progression sequence)
    6. Analysis of diagnosis and treatment logic"""
    
    # Medical entity type definition
    ENTITY_PATTERNS = {
        "MEDICATION": [
            r"\b(aspirin|warfarin|heparin|metformin|insulin|amoxicillin|"
            r"azithromycin|morphine|fentanyl|propofol|norepinephrine|"
            r"phenylephrine|dopamine|epinephrine)\b",
            r"\b\w+\s*(?:mg|mcg|g|units?)\b"  # dose mode
        ],
        "DISEASE": [
            r"\b(acute myocardial infarction|AMI|STEMI|NSTEMI|heart failure|"
            r"pneumonia|sepsis|ARDS|COPD|asthma|diabetes mellitus|"
            r"hypertension|atrial fibrillation|AFib|stroke|DVT|PE)\b",
            r"\b\w+(?:itis|osis|emia|oma|pathy| syndrome)\b"
        ],
        "SYMPTOM": [
            r"\b(chest pain|dyspnea|shortness of breath|fatigue|nausea|"
            r"vomiting|fever|chills|headache|dizziness|syncope|"
            r"palpitations|cough|hemoptysis)\b"
        ],
        "PROCEDURE": [
            r"\b(PCI|CABG|ECG|EKG|echocardiogram|CT|MRI|chest X-ray|"
            r"bronchoscopy|intubation|extubation|thoracentesis|"
            r"paracentesis|lumbar puncture)\b"
        ],
        "ANATOMY": [
            r"\b(left ventricle|right ventricle|left atrium|right atrium|"
            r"mitral valve|aortic valve|tricuspid valve|pulmonary valve|"
            r"coronary artery|LAD|LCx|RCA|lung|liver|kidney|spleen)\b"
        ],
        "LAB_VALUE": [
            r"\b(troponin|creatinine|BNP|NT-proBNP|WBC|hemoglobin|platelet|"
            r"INR|PTT|PT|glucose|HbA1c|lactate|pH|PCO2|PO2)\b"
        ]
    }
    
    # negative word pattern
    NEGATION_CUES = [
        r"\b(no|not|denies|denied|without|absent|negative for|"
        r"ruled out|no evidence of|non-)\b"
    ]
    
    # time pattern
    TIME_PATTERNS = [
        r"\b(\d{1,2}/\d{1,2}/\d{2,4})\b",  # MM/DD/YYYY
        r"\b(\d{4}-\d{2}-\d{2})\b",       # YYYY-MM-DD
        r"\b(\d{1,2}:\d{2}(?::\d{2})?\s*(?:AM|PM|am|pm)?)\b",  # time
        r"\b(on admission|at presentation|post-operatively|"
        r"pre-operatively|on day \d+|POD \#?\d+)\b"
    ]
    
    def __init__(self, config: Optional[Dict] = None):
        """Initialize MedicalTextMiner
        
        Args:
            config: configuration dictionary, which can include model path, output format and other settings"""
        self.config = config or {}
        self.notes_df: Optional[pd.DataFrame] = None
        self.nlp = None
        
        # Initialize spaCy model (if available)
        self._init_nlp()
        
    def _init_nlp(self):
        """Initialize NLP model"""
        if not SPACY_AVAILABLE:
            return
            
        model_name = self.config.get("ner_model", "en_core_web_sm")
        try:
            self.nlp = spacy.load(model_name)
            
            # Add negative detection component
            if NEGSPACY_AVAILABLE:
                negex = Negex(self.nlp, name="negex")
                self.nlp.add_pipe("negex", last=True)
                
        except OSError:
            warnings.warn(f"Model {model_name} not found. Running in regex-only mode.")
            self.nlp = None
    
    def load_notes(self, notes_path: Union[str, Path], 
                   note_types: Optional[List[str]] = None) -> pd.DataFrame:
        """Load MIMIC-IV NOTEEVENTS data
        
        Args:
            notes_path: NOTEEVENTS file path (CSV or Parquet)
            note_types: filter specific note types (such as ['DS', 'RR'])
            
        Returns:
            Loaded DataFrame"""
        path = Path(notes_path)
        
        if path.suffix == '.csv':
            df = pd.read_csv(path)
        elif path.suffix in ['.parquet', '.pq']:
            df = pd.read_parquet(path)
        else:
            raise ValueError(f"Unsupported file format: {path.suffix}")
        
        # Standardized column names
        df.columns = df.columns.str.lower()
        
        # Filter note types
        if note_types and 'note_type' in df.columns:
            df = df[df['note_type'].isin(note_types)]
        
        # Convert time column
        if 'charttime' in df.columns:
            df['charttime'] = pd.to_datetime(df['charttime'], errors='coerce')
        
        self.notes_df = df
        print(f"Loaded {len(df)} notes from {path}")
        
        return df
    
    def get_patient_texts(self, subject_id: int, 
                          hadm_id: Optional[int] = None) -> List[Dict]:
        """Get all text records for a specific patient
        
        Args:
            subject_id: patient ID
            hadm_id: Optional hospitalization record ID
            
        Returns:
            List of text records for this patient"""
        if self.notes_df is None:
            raise ValueError("No notes loaded. Call load_notes() first.")
        
        mask = self.notes_df['subject_id'] == subject_id
        if hadm_id is not None:
            mask = mask & (self.notes_df['hadm_id'] == hadm_id)
        
        patient_notes = self.notes_df[mask].sort_values('charttime')
        
        return patient_notes.to_dict('records')
    
    def extract_entities_regex(self, text: str) -> List[Dict]:
        """Extract medical entities using regular expressions
        
        Args:
            text: input text
            
        Returns:
            Entity list"""
        entities = []
        
        for entity_type, patterns in self.ENTITY_PATTERNS.items():
            for pattern in patterns:
                for match in re.finditer(pattern, text, re.IGNORECASE):
                    entity = {
                        "text": match.group(),
                        "type": entity_type,
                        "start": match.start(),
                        "end": match.end(),
                        "source": "regex"
                    }
                    entities.append(entity)
        
        # check negative
        entities = self._detect_negation(text, entities)
        
        return entities
    
    def extract_entities_spacy(self, text: str) -> List[Dict]:
        """Extract medical entities using spaCy
        
        Args:
            text: input text
            
        Returns:
            Entity list"""
        if self.nlp is None:
            return []
        
        doc = self.nlp(text)
        entities = []
        
        for ent in doc.ents:
            entity = {
                "text": ent.text,
                "type": ent.label_,
                "start": ent.start_char,
                "end": ent.end_char,
                "source": "spacy"
            }
            
            # Check for negation (using negspacy)
            if NEGSPACY_AVAILABLE and hasattr(ent._, "negex"):
                entity["negated"] = ent._.negex
            
            entities.append(entity)
        
        return entities
    
    def _detect_negation(self, text: str, entities: List[Dict]) -> List[Dict]:
        """Check if an entity is negated
        
        Args:
            text: original text
            entities: list of entities
            
        Returns:
            Entity list after adding negative tag"""
        for entity in entities:
            # Get the text window in front of the entity (up to 50 characters)
            start_pos = max(0, entity["start"] - 50)
            context = text[start_pos:entity["start"]]
            
            # Check for negative words
            is_negated = any(
                re.search(pattern, context, re.IGNORECASE)
                for pattern in self.NEGATION_CUES
            )
            entity["negated"] = is_negated
        
        return entities
    
    def extract_relations(self, text: str, entities: List[Dict]) -> List[Dict]:
        """Extract relationships between entities from text
        
        Args:
            text: input text
            entities: list of recognized entities
            
        Returns:
            List of relational triples"""
        relations = []
        
        # Simple pattern-based relationship extraction
        # Drug-Disease Treatment Relationship
        med_disease_pattern = r"(\w+)\s+(?:for|to treat|management of)\s+([\w\s]+?)(?:\.|,|;|$)"
        for match in re.finditer(med_disease_pattern, text, re.IGNORECASE):
            relations.append({
                "subject": match.group(1),
                "predicate": "TREATS",
                "object": match.group(2).strip(),
                "confidence": 0.7,
                "source": "pattern"
            })
        
        # Symptom-disease relationship
        symptom_disease_pattern = r"(\w+)\s+(?:due to|caused by|secondary to)\s+([\w\s]+?)(?:\.|,|;|$)"
        for match in re.finditer(symptom_disease_pattern, text, re.IGNORECASE):
            relations.append({
                "subject": match.group(1),
                "predicate": "CAUSED_BY",
                "object": match.group(2).strip(),
                "confidence": 0.6,
                "source": "pattern"
            })
        
        return relations
    
    def extract_timeline(self, text: str) -> List[Dict]:
        """Extract timing events from text
        
        Args:
            text: input text
            
        Returns:
            event timeline"""
        events = []
        sentences = re.split(r'[.!?]\s+', text)
        
        event_indicators = [
            "admitted", "presented", "developed", "underwent", 
            "started", "discontinued", "noted", "observed",
            "worsening", "improvement", "resolved"
        ]
        
        for sent in sentences:
            # Detect time expression
            time_match = None
            for pattern in self.TIME_PATTERNS:
                match = re.search(pattern, sent, re.IGNORECASE)
                if match:
                    time_match = match.group()
                    break
            
            # detect event verb
            for indicator in event_indicators:
                if re.search(rf"\b{indicator}\b", sent, re.IGNORECASE):
                    events.append({
                        "time_expression": time_match,
                        "event_verb": indicator,
                        "description": sent.strip(),
                        "raw_time": time_match
                    })
                    break
        
        return events
    
    def parse_clinical_logic(self, text: str) -> Dict:
        """Analyze the diagnosis and treatment logic chain
        
        Args:
            text: medical text
            
        Returns:
            Logical structure of diagnosis and treatment"""
        logic = {
            "presenting_complaint": None,
            "differential_diagnoses": [],
            "workup": [],
            "final_diagnosis": None,
            "treatment_plan": [],
            "clinical_course": []
        }
        
        # Extract chief complaint
        chief_complaint_patterns = [
            r"chief complaint[:\s]+([^.]+)",
            r"presented with[:\s]+([^.]+)",
            r"\bPC[:\s]+([^.]+)"
        ]
        for pattern in chief_complaint_patterns:
            match = re.search(pattern, text, re.IGNORECASE)
            if match:
                logic["presenting_complaint"] = match.group(1).strip()
                break
        
        # Extract differential diagnosis
        dd_pattern = r"(?:differential diagnosis|DDx)[:\s]+([^.]+(?:,[^.]+)*)"
        match = re.search(dd_pattern, text, re.IGNORECASE)
        if match:
            logic["differential_diagnoses"] = [
                d.strip() for d in match.group(1).split(",")
            ]
        
        # Extract check
        workup_indicators = ["obtained", "performed", "ordered", "showed"]
        for indicator in workup_indicators:
            pattern = rf"(\w+(?:\s\w+)?)\s+{indicator}"
            for match in re.finditer(pattern, text, re.IGNORECASE):
                procedure = match.group(1)
                if procedure.lower() in [
                    "ecg", "ekg", "ct", "mri", "x-ray", "echo", 
                    "troponin", "bnp", "wbc"
                ]:
                    logic["workup"].append(procedure)
        
        # Remove duplicates
        logic["workup"] = list(set(logic["workup"]))
        
        return logic
    
    def extract_insights(self, text: str,
                        extract_entities: bool = True,
                        extract_relations: bool = True,
                        extract_timeline: bool = True,
                        extract_logic: bool = True) -> Dict:
        """Perform complete medical text insight extraction
        
        Args:
            text: Enter medical text
            extract_entities: whether to extract entities
            extract_relations: whether to extract relationships
            extract_timeline: whether to extract timeline
            extract_logic: whether to parse diagnosis and treatment logic
            
        Returns:
            Comprehensive insights"""
        insights = {
            "raw_text_length": len(text),
            "extraction_timestamp": datetime.now().isoformat()
        }
        
        # Entity extraction
        if extract_entities:
            entities_regex = self.extract_entities_regex(text)
            entities_spacy = self.extract_entities_spacy(text)
            
            # Merge results (remove duplicates)
            all_entities = entities_regex + [
                e for e in entities_spacy
                if not any(e["start"] == ex["start"] for ex in entities_regex)
            ]
            
            # Group by type
            entities_by_type = {}
            for e in all_entities:
                etype = e["type"]
                if etype not in entities_by_type:
                    entities_by_type[etype] = []
                entities_by_type[etype].append(e)
            
            insights["entities"] = all_entities
            insights["entities_by_type"] = entities_by_type
            insights["entity_count"] = len(all_entities)
        
        # Relation extraction
        if extract_relations and extract_entities:
            relations = self.extract_relations(text, insights.get("entities", []))
            insights["relations"] = relations
            insights["relation_count"] = len(relations)
        
        # Timeline extraction
        if extract_timeline:
            timeline = self.extract_timeline(text)
            insights["timeline"] = timeline
            insights["event_count"] = len(timeline)
        
        # Diagnosis and treatment logic analysis
        if extract_logic:
            logic = self.parse_clinical_logic(text)
            insights["clinical_logic"] = logic
        
        return insights
    
    def process_patient(self, subject_id: int, 
                       hadm_id: Optional[int] = None) -> Dict:
        """Process all text records for a single patient
        
        Args:
            subject_id: patient ID
            hadm_id: Optional hospitalization record ID
            
        Returns:
            Comprehensive Patient Insights"""
        notes = self.get_patient_texts(subject_id, hadm_id)
        
        patient_insights = {
            "subject_id": subject_id,
            "hadm_id": hadm_id,
            "note_count": len(notes),
            "notes": []
        }
        
        for note in notes:
            text = note.get('note_text', '')
            if not text:
                continue
            
            note_insights = self.extract_insights(text)
            note_insights.update({
                "note_id": note.get('note_id'),
                "note_type": note.get('note_type'),
                "charttime": note.get('charttime')
            })
            
            patient_insights["notes"].append(note_insights)
        
        # Aggregate all entities
        all_entities = []
        for note in patient_insights["notes"]:
            all_entities.extend(note.get("entities", []))
        
        patient_insights["aggregated_entities"] = self._aggregate_entities(all_entities)
        
        return patient_insights
    
    def _aggregate_entities(self, entities: List[Dict]) -> Dict:
        """Aggregate entities (deduplication and counting)"""
        entity_map = {}
        
        for e in entities:
            key = (e["text"].lower(), e["type"])
            if key not in entity_map:
                entity_map[key] = {
                    "text": e["text"],
                    "type": e["type"],
                    "count": 0,
                    "negated_count": 0,
                    "instances": []
                }
            
            entity_map[key]["count"] += 1
            if e.get("negated"):
                entity_map[key]["negated_count"] += 1
            entity_map[key]["instances"].append({
                "start": e["start"],
                "end": e["end"]
            })
        
        return {
            "unique_count": len(entity_map),
            "entities": list(entity_map.values())
        }
    
    def export_to_json(self, insights: Dict, output_path: str):
        """Export insights results to JSON file"""
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(insights, f, indent=2, ensure_ascii=False, default=str)
        print(f"Exported insights to {output_path}")
    
    def generate_summary_report(self, insights: Dict) -> str:
        """Generate text summary report"""
        report = []
        report.append("=" * 60)
        report.append("MEDICAL TEXT MINING SUMMARY REPORT")
        report.append("=" * 60)
        
        if "subject_id" in insights:
            report.append(f"\nPatient ID: {insights['subject_id']}")
        
        if "note_count" in insights:
            report.append(f"Total Notes Processed: {insights['note_count']}")
        
        if "aggregated_entities" in insights:
            agg = insights["aggregated_entities"]
            report.append(f"\nUnique Entities Found: {agg['unique_count']}")
            
            # Display grouped by type
            by_type = {}
            for e in agg["entities"]:
                t = e["type"]
                if t not in by_type:
                    by_type[t] = []
                by_type[t].append(e)
            
            for etype, entities in sorted(by_type.items()):
                report.append(f"\n{etype} ({len(entities)} unique):")
                for e in sorted(entities, key=lambda x: -x["count"])[:5]:
                    neg_info = f", {e['negated_count']} negated" if e['negated_count'] > 0 else ""
                    report.append(f"  - {e['text']} ({e['count']} mentions{neg_info})")
        
        return "\n".join(report)


def main():
    """CLI entry"""
    parser = argparse.ArgumentParser(
        description="MIMIC-IV Unstructured Medical Text Miner"
    )
    parser.add_argument(
        "--input", "-i",
        help="Path to NOTEEVENTS CSV/Parquet file"
    )
    parser.add_argument(
        "--subject-id", "-s", type=int,
        help="Process specific patient by subject_id"
    )
    parser.add_argument(
        "--hadm-id", type=int,
        help="Filter by hospital admission ID"
    )
    parser.add_argument(
        "--output", "-o",
        help="Output JSON file path"
    )
    parser.add_argument(
        "--extract", choices=["entities", "relations", "timeline", "logic", "all"],
        default="all",
        help="What to extract from the text"
    )
    parser.add_argument(
        "--note-types", nargs="+",
        help="Filter by note types (e.g., DS RR ECG)"
    )
    parser.add_argument(
        "--sample-text",
        help="Process a single text snippet directly"
    )
    
    args = parser.parse_args()
    
    miner = MedicalTextMiner()
    
    # Process individual text fragments
    if args.sample_text:
        insights = miner.extract_insights(args.sample_text)
        print(json.dumps(insights, indent=2, default=str))
        return
    
    # Requires input files
    if not args.input:
        parser.error("--input is required (unless using --sample-text)")
    
    # Load data
    miner.load_notes(args.input, args.note_types)
    
    # Process specific patients or batch processing
    if args.subject_id:
        insights = miner.process_patient(args.subject_id, args.hadm_id)
    else:
        # Handle all patients (simplified version)
        print("Processing all patients...")
        all_insights = []
        for sid in miner.notes_df['subject_id'].unique()[:10]:  # Limit to first 10 patients
            patient_insights = miner.process_patient(sid)
            all_insights.append(patient_insights)
        insights = {
            "patient_count": len(all_insights),
            "patients": all_insights
        }
    
    # Generate summary
    print("\n" + miner.generate_summary_report(insights))
    
    # Export results
    if args.output:
        miner.export_to_json(insights, args.output)
    else:
        # Default output
        default_output = "medical_text_insights.json"
        miner.export_to_json(insights, default_output)


if __name__ == "__main__":
    main()

FILE:scripts/__init__.py
"""Unstructured Medical Text Miner (Skill ID: 213)

Python package for mining MIMIC-IV medical text data"""

from .main import MedicalTextMiner

__version__ = "0.1.0"
__all__ = ["MedicalTextMiner"]

ClawHub Coding Data Analysis+2

A@clawhub-aipoch-ai-772015cadb

USMLE Case Scenario Generator

Skill

Generate USMLE Step 1/2 style clinical cases with patient history, physical.

---
name: usmle-case-generator
description: Generate USMLE Step 1/2 style clinical cases with patient history, physical.
license: MIT
skill-author: AIPOCH
---
# USMLE Case Generator

Generate USMLE Step 1 and Step 2 CK style clinical cases for medical education and board exam preparation.

## When to Use

- Use this skill when the task is to Generate USMLE Step 1/2 style clinical cases with patient history, physical.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

See `## Features` above for related details.

- Scope-focused workflow aligned to: Generate USMLE Step 1/2 style clinical cases with patient history, physical.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

- Python 3.8+
- No external API dependencies (template-based generation)
- Optional: LLM integration for case variation

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Academic Writing/usmle-case-generator"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Features

- **Step 1 Cases**: Basic science concepts, pathophysiology, pharmacology
- **Step 2 Cases**: Clinical diagnosis, management, next best steps
- **Complete Vignettes**: History, physical exam, labs, imaging
- **Multiple Choice Questions**: Single best answer format
- **Answer Explanations**: Detailed rationale for learning

## Usage

```python

# Generate a Step 1 case (pathophysiology focus)
python scripts/main.py --step 1 --topic cardiology --difficulty medium

# Generate a Step 2 case (clinical management focus)
python scripts/main.py --step 2 --topic nephrology --include-diagnosis

# Generate case with specific conditions
python scripts/main.py --step 2 --condition "diabetic ketoacidosis" --format json
```

## Parameters

| Parameter | Options | Description |
|-----------|---------|-------------|
| `--step` | 1, 2 | USMLE Step level |
| `--topic` | See references/topics.json | Medical specialty |
| `--condition` | Any condition | Specific disease/condition |
| `--difficulty` | easy, medium, hard | Case complexity |
| `--format` | text, json, markdown | Output format |
| `--include-diagnosis` | flag | Include answer key |
| `--count` | 1-10 | Number of cases to generate |

## Topics Covered

- Cardiology
- Pulmonology
- Gastroenterology
- Nephrology
- Endocrinology
- Hematology/Oncology
- Infectious Disease
- Neurology
- Psychiatry
- Musculoskeletal
- Dermatology
- Obstetrics/Gynecology
- Pediatrics
- Surgery

## Case Structure

Each generated case includes:

1. **Patient Demographics**: Age, gender, relevant background
2. **Chief Complaint**: Presenting problem
3. **History of Present Illness**: Detailed symptom timeline
4. **Past Medical History**: Relevant comorbidities
5. **Medications**: Current drug regimen
6. **Allergies**: Drug/environmental allergies
7. **Family History**: Genetic conditions
8. **Social History**: Smoking, alcohol, occupation
9. **Physical Examination**: Vital signs, relevant findings
10. **Laboratory Studies**: CBC, CMP, specific markers
11. **Imaging/Diagnostics**: X-ray, CT, ECG, etc.
12. **Question**: USMLE-style multiple choice
13. **Answer Options**: 5 choices (A-E)
14. **Correct Answer**: With detailed explanation
15. **Educational Objectives**: Key learning points

## Output Formats

### Text Format (Default)
Plain text suitable for printing or reading.

### JSON Format
Structured data for integration with applications.

### Markdown Format
Formatted for documentation or web display.

## Technical Difficulty

**High** - Requires medical knowledge validation and clinical accuracy.

⚠️ **Manual Review Required**: Generated cases should be reviewed by medical professionals before use in high-stakes educational settings.

## References

- `references/topics.json` - Medical specialty taxonomy
- `references/case_templates.json` - Case structure templates
- `references/usmle_patterns.md` - USMLE question patterns
- `references/conditions/` - Condition-specific case data

## Example Output

```
Case: A 58-year-old male with chest pain

A 58-year-old man presents to the emergency department with 
crushing substernal chest pain radiating to his left arm, 
beginning 2 hours ago at rest...

[History, physical, labs, ECG findings...]

Question: What is the most appropriate next step in management?

A. Administer aspirin and nitroglycerin
B. Order CT pulmonary angiography
C. Perform immediate synchronized cardioversion
D. Start heparin drip and call cardiology
E. Discharge with outpatient stress test

Correct Answer: D
Explanation: [Detailed rationale...]
```

## Safety & Limitations

- Cases are AI-generated and may contain inaccuracies
- Not a substitute for professional medical education
- Always verify clinical details with authoritative sources
- Intended for educational purposes only

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `usmle-case-generator` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `usmle-case-generator` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/case_templates.json
{
  "case_structures": {
    "step1_vignette": {
      "name": "Step 1 - Basic Science Vignette",
      "focus": "Pathophysiology, mechanisms, pharmacology",
      "structure": [
        "Patient Demographics (age, gender)",
        "Brief Presentation (1-2 sentences)",
        "Key Clinical Finding",
        "Question about mechanism/pathogen"
      ],
      "example_question_types": [
        "What is the most likely mechanism?",
        "Which pathogen is most likely responsible?",
        "What is the pathophysiology?",
        "Which enzyme is deficient?",
        "What is the mode of inheritance?"
      ]
    },
    "step2_vignette": {
      "name": "Step 2 - Clinical Vignette",
      "focus": "Diagnosis, management, next steps",
      "structure": [
        "Patient Demographics",
        "Chief Complaint",
        "History of Present Illness",
        "Past Medical History",
        "Medications",
        "Allergies",
        "Family History",
        "Social History",
        "Physical Examination",
        "Laboratory/Diagnostic Studies",
        "Question"
      ],
      "example_question_types": [
        "What is the most likely diagnosis?",
        "What is the most appropriate next step?",
        "What is the best initial test?",
        "What is the most accurate test?",
        "What is the most appropriate treatment?"
      ]
    }
  },
  "vital_signs_template": {
    "temperature": {
      "normal": "98.6°F (37.0°C)",
      "low_grade_fever": "100-101°F",
      "moderate_fever": "101-103°F",
      "high_fever": "103-105°F",
      "hypothermia": "<95°F"
    },
    "heart_rate": {
      "bradycardia": "<60 bpm",
      "normal": "60-100 bpm",
      "mild_tachycardia": "100-120 bpm",
      "moderate_tachycardia": "120-140 bpm",
      "severe_tachycardia": ">140 bpm"
    },
    "blood_pressure": {
      "hypotension": "<90/60 mmHg",
      "low_normal": "90-100/60-70 mmHg",
      "normal": "120/80 mmHg",
      "elevated": "120-139/80-89 mmHg",
      "stage1_htn": "140-159/90-99 mmHg",
      "stage2_htn": "≥160/≥100 mmHg",
      "hypertensive_emergency": ">180/120 mmHg"
    },
    "respiratory_rate": {
      "bradypnea": "<12/min",
      "normal": "12-16/min",
      "tachypnea": ">20/min",
      "severe": ">30/min"
    },
    "oxygen_saturation": {
      "normal": "95-100%",
      "mild_hypoxemia": "90-94%",
      "moderate_hypoxemia": "85-89%",
      "severe_hypoxemia": "<85%"
    }
  },
  "history_components": {
    "hpi_elements": [
      "Location",
      "Quality",
      "Severity",
      "Duration",
      "Timing",
      "Context",
      "Modifying factors",
      "Associated symptoms"
    ],
    "ros_categories": [
      "Constitutional",
      "HEENT",
      "Cardiovascular",
      "Respiratory",
      "Gastrointestinal",
      "Genitourinary",
      "Musculoskeletal",
      "Integumentary",
      "Neurological",
      "Psychiatric",
      "Endocrine",
      "Hematologic/Lymphatic",
      "Allergic/Immunologic"
    ],
    "social_history_elements": [
      "Tobacco use",
      "Alcohol use",
      "Drug use",
      "Sexual history",
      "Occupation",
      "Living situation",
      "Exercise",
      "Diet",
      "Travel history"
    ]
  },
  "physical_exam_template": {
    "general": ["Appearance", "Level of consciousness", "Nutritional status"],
    "vital_signs": ["Temperature", "Heart rate", "Blood pressure", "Respiratory rate", "O2 saturation"],
    "skin": ["Color", "Turgor", "Lesions", "Rashes"],
    "head": ["Normocephalic", "Trauma", "Tenderness"],
    "eyes": ["Pupils", "Extraocular movements", "Fundoscopic exam"],
    "ears": ["Canals", "Tympanic membranes"],
    "nose": ["Septum", "Mucosa"],
    "throat": ["Oropharynx", "Dentition"],
    "neck": ["JVD", "Thyroid", "Lymph nodes", "Carotid bruits"],
    "cardiovascular": ["Rate", "Rhythm", "Murmurs", "S1/S2", "Edema"],
    "pulmonary": ["Inspection", "Palpation", "Percussion", "Auscultation"],
    "abdominal": ["Inspection", "Auscultation", "Percussion", "Palpation"],
    "musculoskeletal": ["Range of motion", "Strength", "Tenderness", "Deformity"],
    "neurological": ["Mental status", "Cranial nerves", "Motor", "Sensory", "Reflexes", "Gait"]
  },
  "lab_reference_ranges": {
    "cbc": {
      "wbc": {"normal": "4,500-11,000/μL", "low": "<4,500", "high": ">11,000"},
      "hemoglobin": {"male": "13.5-17.5 g/dL", "female": "12.0-16.0 g/dL"},
      "hematocrit": {"male": "38.8-50.0%", "female": "34.9-44.5%"},
      "platelets": {"normal": "150,000-400,000/μL", "low": "<150,000", "high": ">400,000"}
    },
    "cmp": {
      "sodium": {"normal": "135-145 mEq/L"},
      "potassium": {"normal": "3.5-5.0 mEq/L"},
      "chloride": {"normal": "96-106 mEq/L"},
      "co2": {"normal": "23-29 mEq/L"},
      "bun": {"normal": "7-20 mg/dL"},
      "creatinine": {"normal": "0.6-1.2 mg/dL"},
      "glucose": {"normal": "70-100 mg/dL (fasting)"},
      "calcium": {"normal": "8.5-10.5 mg/dL"},
      "protein": {"normal": "6.0-8.3 g/dL"},
      "albumin": {"normal": "3.5-5.0 g/dL"},
      "bilirubin": {"normal": "0.1-1.2 mg/dL"},
      "ast": {"normal": "10-40 U/L"},
      "alt": {"normal": "7-56 U/L"},
      "alkaline_phosphatase": {"normal": "44-147 U/L"}
    },
    "coagulation": {
      "pt": {"normal": "11-13.5 seconds", "inr": "0.9-1.1"},
      "ptt": {"normal": "25-35 seconds"}
    },
    "abg": {
      "ph": {"normal": "7.35-7.45", "acidemia": "<7.35", "alkalemia": ">7.45"},
      "pco2": {"normal": "35-45 mmHg"},
      "po2": {"normal": "80-100 mmHg"},
      "hco3": {"normal": "22-26 mEq/L"},
      "base_excess": {"normal": "-2 to +2 mEq/L"}
    },
    "cardiac": {
      "troponin_i": {"normal": "<0.04 ng/mL"},
      "troponin_t": {"normal": "<0.01 ng/mL"},
      "bnp": {"normal": "<100 pg/mL"},
      "nt_probnp": {"normal": "<300 pg/mL"}
    },
    "urinalysis": {
      "ph": {"normal": "4.6-8.0"},
      "specific_gravity": {"normal": "1.005-1.030"},
      "protein": {"normal": "negative"},
      "glucose": {"normal": "negative"},
      "ketones": {"normal": "negative"},
      "blood": {"normal": "negative"},
      "wbc": {"normal": "0-5 per HPF"},
      "rbc": {"normal": "0-2 per HPF"}
    }
  },
  "diagnostic_studies": {
    "imaging": {
      "chest_xray": ["PA and lateral views", "Specific findings by pathology"],
      "ct_scan": ["With/without contrast", "Indications", "Findings"],
      "mri": ["With/without contrast", "Indications", "Findings"],
      "ultrasound": ["Type of scan", "Indications", "Findings"],
      "echocardiogram": ["Transthoracic vs TEE", "Measurements", "Findings"],
      "angiography": ["Type", "Indications", "Findings"]
    },
    "procedures": {
      "ekg_ecg": ["Rate", "Rhythm", "Axis", "Intervals", "Morphology", "Findings"],
      "lumbar_puncture": ["Opening pressure", "CSF analysis", "Cell count", "Glucose", "Protein", "Culture"],
      "biopsy": ["Type", "Findings", "Pathology"],
      "bronchoscopy": ["Indications", "Findings", "BAL results"],
      "colonoscopy": ["Indications", "Findings", "Biopsy results"],
      "egd": ["Indications", "Findings", "Biopsy results"]
    }
  }
}

FILE:references/guidelines.md
# USMLE Case Generator Guidelines

## Purpose
This skill generates realistic USMLE Step 1 and Step 2 style clinical case scenarios for medical education and exam preparation.

## Supported Medical Specialties

### Core Specialties
1. **Cardiology** - Heart failure, MI, arrhythmias
2. **Neurology** - Stroke, seizure, headache
3. **Gastroenterology** - Pancreatitis, liver disease, IBD
4. **Pulmonology** - COPD, asthma, pneumonia
5. **Nephrology** - AKI, CKD, electrolyte disorders
6. **Endocrinology** - DKA, thyroid disorders, adrenal disease
7. **Hematology** - Anemia, coagulopathies
8. **Infectious Disease** - Pneumonia, sepsis, endocarditis
9. **Rheumatology** - RA, SLE, vasculitis
10. **Emergency Medicine** - Anaphylaxis, trauma, toxicology

## Case Structure

### 1. Patient Vignette
- Chief complaint
- History of present illness
- Past medical history
- Medications
- Physical examination findings
- Vital signs
- Diagnostic studies

### 2. Multiple-Choice Question
- Single best answer format
- 5 options (A-E)
- Clinical scenario-based
- Tests high-yield concepts

### 3. Explanation
- Detailed rationale for correct answer
- Why distractors are incorrect
- Key learning objectives
- High-yield takeaways

## Difficulty Levels

### Step 1 (Basic Science)
- Focus on pathophysiology
- Mechanism-based questions
- Laboratory interpretation
- Basic pharmacology

### Step 2 (Clinical Knowledge)
- Focus on diagnosis and management
- Next best step questions
- Treatment algorithms
- Prognosis and complications

## Usage Guidelines

### Generating Cases
```python
from scripts.main import generate_case

# Random case
case = generate_case()

# Specific specialty
case = generate_case(specialty="Cardiology")

# Specific difficulty
case = generate_case(difficulty="Step 2")

# Combined
case = generate_case(specialty="Neurology", difficulty="Step 2")
```

### Command Line
```bash
# Random case
python scripts/main.py

# Specific specialty
python scripts/main.py --specialty Cardiology

# Specific difficulty
python scripts/main.py --difficulty "Step 2"

# Save to file
python scripts/main.py --specialty Neurology --output case.json
```

## Case Template Format

Each case template includes:
- Template identifier
- Chief complaint
- History template with placeholders
- Physical examination findings
- Vital signs template
- Diagnostic studies template
- Question stem
- Options array (5 items)
- Correct answer letter
- Detailed explanation
- Learning objectives array

## Educational Use Only
Cases are generated for educational purposes and should not replace clinical judgment or actual medical advice.

FILE:references/requirements.txt
# Requirements for USMLE Case Generator

## Python Version
- Python 3.8 or higher

## Standard Library Dependencies
- json (built-in)
- random (built-in)
- argparse (built-in)
- datetime (built-in)
- typing (built-in)
- dataclasses (built-in)
- enum (built-in)
- unittest (built-in) - for testing

## No External Dependencies Required
This skill uses only Python standard library modules for maximum portability.

## Installation
No installation required. Simply run:
```bash
python scripts/main.py
```

## Testing
```bash
cd scripts
python test_main.py
```

FILE:references/sample_input.json
{
  "specialty": "Cardiology",
  "difficulty": "Step 2",
  "focus_area": "heart failure"
}

FILE:references/sample_output.json
{
  "status": "success",
  "case": {
    "metadata": {
      "specialty": "Cardiology",
      "difficulty": "Step 2",
      "topic": "Heart Failure",
      "generated_at": "2026-02-05T10:30:00Z"
    },
    "vignette": {
      "chief_complaint": "Progressive shortness of breath and bilateral leg swelling",
      "history_present_illness": "68-year-old male with a history of hypertension and coronary artery disease presents with 3 days of worsening dyspnea on exertion. Patient reports orthopnea and paroxysmal nocturnal dyspnea. No prior similar episodes.",
      "past_medical_history": "Hypertension, cardiology history as relevant",
      "medications": "As clinically appropriate for case",
      "physical_examination": "JVP elevated at 8 cm, bilateral basilar crackles, S3 gallop, 2+ pitting edema bilaterally",
      "vitals": "BP 150/90, HR 92, RR 22, SpO2 92% on room air",
      "diagnostic_studies": "Chest X-ray: cardiomegaly, pulmonary edema. BNP: 1250 pg/mL. Echo: EF 35%"
    },
    "question": {
      "stem": "What is the most appropriate initial medication to administer?",
      "options": [
        "A. Lisinopril",
        "B. Metoprolol",
        "C. Furosemide",
        "D. Digoxin",
        "E. Amlodipine"
      ],
      "correct_answer": "C",
      "explanation": "Furosemide is the correct initial therapy for acute decompensated heart failure with volume overload. It provides rapid diuresis, reducing preload and relieving pulmonary congestion. While ACE inhibitors and beta-blockers are important for chronic management, diuresis is the priority in acute decompensation.",
      "learning_objectives": [
        "Acute heart failure management",
        "Diuretic therapy",
        "Volume overload"
      ]
    }
  }
}

FILE:references/topics.json
{
  "medical_specialties": {
    "cardiology": {
      "display_name": "Cardiology",
      "subtopics": [
        "Acute Coronary Syndromes",
        "Heart Failure",
        "Arrhythmias",
        "Valvular Heart Disease",
        "Hypertension",
        "Pericardial Disease",
        "Congenital Heart Disease",
        "Vascular Disease"
      ],
      "step1_focus": ["Electrophysiology", "Hemodynamics", "Pharmacology", "Pathophysiology"],
      "step2_focus": ["Diagnosis", "Management", "Acute Care", "Chronic Management"]
    },
    "pulmonology": {
      "display_name": "Pulmonology",
      "subtopics": [
        "Obstructive Disease",
        "Restrictive Disease",
        "Infections",
        "Vascular Disease",
        "Sleep Disorders",
        "Pleural Disease",
        "Neoplasms"
      ],
      "step1_focus": ["Ventilation/Perfusion", "Gas Exchange", "Respiratory Physiology"],
      "step2_focus": ["Pulmonary Function Tests", "Imaging", "Treatment"]
    },
    "gastroenterology": {
      "display_name": "Gastroenterology",
      "subtopics": [
        "Esophageal Disorders",
        "Gastric Disorders",
        "Small Intestine",
        "Colorectal Disorders",
        "Hepatobiliary",
        "Pancreas",
        "GI Emergencies"
      ],
      "step1_focus": ["Digestive Physiology", "Enzymes", "Absorption"],
      "step2_focus": ["Endoscopy", "Imaging", "Surgical vs Medical Management"]
    },
    "nephrology": {
      "display_name": "Nephrology",
      "subtopics": [
        "Acute Kidney Injury",
        "Chronic Kidney Disease",
        "Glomerular Disease",
        "Tubulointerstitial Disease",
        "Electrolyte Disorders",
        "Acid-Base Disorders",
        "Genetic Kidney Disease"
      ],
      "step1_focus": ["Renal Physiology", "Fluid/Electrolyte Balance", "Acid-Base"],
      "step2_focus": ["Dialysis", "Transplant", "Nephrotic/Nephritic Syndromes"]
    },
    "endocrinology": {
      "display_name": "Endocrinology",
      "subtopics": [
        "Diabetes Mellitus",
        "Thyroid Disorders",
        "Adrenal Disorders",
        "Pituitary Disorders",
        "Parathyroid Disorders",
        "Gonadal Disorders",
        "Bone Metabolism"
      ],
      "step1_focus": ["Hormone Pathways", "Feedback Loops", "Receptors"],
      "step2_focus": ["Replacement Therapy", "Suppression Therapy", "Monitoring"]
    },
    "hematology_oncology": {
      "display_name": "Hematology/Oncology",
      "subtopics": [
        "Anemias",
        "Coagulation Disorders",
        "Leukemias",
        "Lymphomas",
        "Plasma Cell Disorders",
        "Solid Tumors",
        "Transfusion Medicine"
      ],
      "step1_focus": ["Coagulation Cascade", "Cell Biology", "Oncogenes"],
      "step2_focus": ["Staging", "Treatment Protocols", "Complications"]
    },
    "infectious_disease": {
      "display_name": "Infectious Disease",
      "subtopics": [
        "Bacterial Infections",
        "Viral Infections",
        "Fungal Infections",
        "Parasitic Infections",
        "HIV/AIDS",
        "Sepsis",
        "Fever of Unknown Origin"
      ],
      "step1_focus": ["Microbiology", "Antibiotic Mechanisms", "Immunology"],
      "step2_focus": ["Empiric Therapy", "Prophylaxis", "Infection Control"]
    },
    "neurology": {
      "display_name": "Neurology",
      "subtopics": [
        "Cerebrovascular Disease",
        "Seizure Disorders",
        "Neurodegenerative Disease",
        "Demyelinating Disease",
        "Neuromuscular Disease",
        "Headache",
        "Movement Disorders"
      ],
      "step1_focus": ["Neuroanatomy", "Neurophysiology", "Neurotransmitters"],
      "step2_focus": ["Imaging", "Lumbar Puncture", "Emergency Management"]
    },
    "psychiatry": {
      "display_name": "Psychiatry",
      "subtopics": [
        "Mood Disorders",
        "Anxiety Disorders",
        "Psychotic Disorders",
        "Personality Disorders",
        "Substance Use Disorders",
        "Eating Disorders",
        "Neurocognitive Disorders"
      ],
      "step1_focus": ["Neurotransmitters", "Pharmacology", "Development"],
      "step2_focus": ["Diagnosis Criteria", "Psychotherapy", "Medication Management"]
    },
    "dermatology": {
      "display_name": "Dermatology",
      "subtopics": [
        "Inflammatory Conditions",
        "Infections",
        "Neoplasms",
        "Autoimmune Disease",
        "Drug Reactions",
        "Genetic Disorders"
      ],
      "step1_focus": ["Skin Histology", "Immunology", "Pathology"],
      "step2_focus": ["Visual Diagnosis", "Biopsy", "Treatment"]
    },
    "obstetrics_gynecology": {
      "display_name": "Obstetrics & Gynecology",
      "subtopics": [
        "Prenatal Care",
        "Labor and Delivery",
        "Postpartum",
        "Gynecologic Oncology",
        "Reproductive Endocrinology",
        "Family Planning"
      ],
      "step1_focus": ["Embryology", "Hormonal Cycles", "Physiology"],
      "step2_focus": ["Complications", "Screening", "Surgical Management"]
    },
    "pediatrics": {
      "display_name": "Pediatrics",
      "subtopics": [
        "Newborn Care",
        "Growth and Development",
        "Genetic Disorders",
        "Infectious Disease",
        "Behavioral Issues",
        "Adolescent Medicine"
      ],
      "step1_focus": ["Developmental Milestones", "Congenital Anomalies", "Immunology"],
      "step2_focus": ["Vaccination", "Screening", "Age-Appropriate Care"]
    },
    "surgery": {
      "display_name": "Surgery",
      "subtopics": [
        "General Surgery",
        "Trauma",
        "Surgical Complications",
        "Preoperative Evaluation",
        "Postoperative Care",
        "Surgical Emergencies"
      ],
      "step1_focus": ["Anatomy", "Wound Healing", "Immunology"],
      "step2_focus": ["Indications", "Contraindications", "Complications"]
    },
    "musculoskeletal": {
      "display_name": "Musculoskeletal",
      "subtopics": [
        "Fractures",
        "Arthritis",
        "Muscle Disorders",
        "Bone Disorders",
        "Soft Tissue Injury",
        "Rheumatologic Disease"
      ],
      "step1_focus": ["Anatomy", "Biomechanics", "Connective Tissue"],
      "step2_focus": ["Imaging", "Physical Exam", "Rehabilitation"]
    },
    "emergency_medicine": {
      "display_name": "Emergency Medicine",
      "subtopics": [
        "Cardiac Emergencies",
        "Respiratory Emergencies",
        "Trauma",
        "Toxicology",
        "Environmental",
        "Psychiatric Emergencies"
      ],
      "step1_focus": ["Pathophysiology", "Pharmacology"],
      "step2_focus": ["Triage", "Stabilization", "Procedures"]
    }
  },
  "case_complexity": {
    "easy": {
      "description": "Classic presentation, single system involvement",
      "vital_signs": 1-2 abnormal,
      "diagnostics": 1-2 key findings,
      "distractors": "Minimal"
    },
    "medium": {
      "description": "Typical presentation with complicating factors",
      "vital_signs": 2-3 abnormal,
      "diagnostics": 2-3 key findings,
      "distractors": "Moderate"
    },
    "hard": {
      "description": "Atypical presentation, multisystem, confounders",
      "vital_signs": "Multiple abnormalities",
      "diagnostics": "Multiple findings requiring synthesis",
      "distractors": "Challenging"
    }
  }
}

FILE:references/usmle_patterns.md
# USMLE Question Patterns and Strategies

## Step 1 Question Patterns

### Mechanism-Based Questions
**Format**: "What is the most likely mechanism/pathophysiology?"

**Key Features**:
- Focus on underlying disease processes
- Require understanding of cellular/molecular mechanisms
- Often test enzyme deficiencies or receptor abnormalities

**Example**:
> A patient with hemolytic anemia and dark urine after eating fava beans...
> Question: What enzyme deficiency is responsible?
> Answer: Glucose-6-phosphate dehydrogenase (G6PD) deficiency

### Pharmacology Questions
**Format**: "What is the mechanism of action?" or "What is the most likely adverse effect?"

**Key Features**:
- Drug mechanisms at molecular level
- Side effect profiles
- Drug interactions
- Antidote knowledge

**Example**:
> A patient on chronic therapy for tuberculosis develops orange discoloration of bodily fluids...
> Question: Which medication causes this effect?
> Answer: Rifampin

### Microbiology Questions
**Format**: "What is the most likely causative organism?"

**Key Features**:
- Organism identification by:
  - Gram stain characteristics
  - Culture requirements
  - Toxin production
  - Disease associations
  - Environmental exposures

**Example**:
> A patient with lung abscess following dental aspiration...
> Question: What organism is most likely?
> Answer: Anaerobes (Fusobacterium, Prevotella, Peptostreptococcus)

### Genetics Questions
**Format**: "What is the mode of inheritance?" or "What is the risk to offspring?"

**Key Features**:
- Mendelian inheritance patterns
- Chromosomal abnormalities
- Trinucleotide repeat disorders
- Imprinting disorders

### Biochemistry Questions
**Format**: "What is the deficient enzyme?" or "What metabolic pathway is affected?"

**Key Features**:
- Enzyme deficiencies
- Metabolic pathway disruptions
- Accumulation of substrates/deficiency of products

## Step 2 CK Question Patterns

### Diagnosis Questions
**Format**: "What is the most likely diagnosis?"

**Approach**:
1. Identify key clinical features
2. Recognize pattern/syndrome
3. Eliminate distractors based on:
   - Atypical features
   - Missing key findings
   - Wrong demographics

**Red Flags for Common Diagnoses**:
- STEMI: Crushing substernal pain, diaphoresis, ST elevation
- Appendicitis: Migration of pain, RLQ tenderness, anorexia
- Pancreatitis: Epigastric pain radiating to back, elevated amylase/lipase
- DKA: Hyperglycemia, anion gap metabolic acidosis, ketones

### Next Best Step Questions
**Format**: "What is the most appropriate next step in management?"

**Priority Hierarchy**:
1. **Stabilize patient first** (ABCs)
   - Airway protection
   - Breathing support
   - Circulation/IV access

2. **Life-threatening conditions** (don't delay)
   - STEMI → PCI
   - Stroke → tPA (if eligible)
   - Sepsis → Antibiotics + fluids

3. **Definitive diagnosis** (when needed)
   - Gold standard tests
   - Biopsy for tissue diagnosis

4. **Initial screening** (when appropriate)
   - Best initial test (often less invasive)
   - Risk stratification

### Best Initial Test Questions
**Format**: "What is the best initial test?" or "What is the best initial step in diagnosis?"

**Characteristics of Best Initial Test**:
- High sensitivity
- Non-invasive or minimally invasive
- Cost-effective
- Guides further workup

**Examples**:
- Suspected PE with normal vitals → D-dimer
- Suspected DVT → Compression ultrasound
- Suspected MI → ECG + troponins
- Suspected pneumonia → Chest X-ray

### Most Accurate Test Questions
**Format**: "What is the most accurate test?" or "What is the gold standard?"

**Characteristics**:
- Highest sensitivity AND specificity
- May be invasive
- Definitive diagnosis

**Examples**:
- PE → CTPA (CT pulmonary angiography)
- Coronary artery disease → Cardiac catheterization
- GERD → 24-hour pH monitoring
- H. pylori → Endoscopy with biopsy

### Treatment Questions
**Format**: "What is the most appropriate treatment?" or "What is the drug of choice?"

**Key Considerations**:
1. First-line therapy (usually correct answer)
2. Contraindications in vignette
3. Patient factors (pregnancy, renal function, allergies)
4. Severity of condition

**Common First-Line Treatments**:
- HTN (general): Thiazide diuretic
- Type 2 DM: Metformin
- Community-acquired pneumonia: Macrolide or doxycycline
- H. pylori: Triple therapy (PPI + clarithromycin + amoxicillin)

## Answer Choice Patterns

### The "Next Step" Traps
**Wrong answers often include**:
- Tests that are too invasive for initial screening
- Treatments before diagnosis is confirmed
- Watchful waiting for serious conditions
- Secondary prevention when primary treatment needed

### The "Most Likely" Distractors
**Common wrong answers**:
- Rare presentations of common diseases
- Common presentations of rare diseases (when classic presentation fits)
- Conditions with similar symptoms but wrong demographics
- Partial matches missing key features

### The "EXCEPT" Questions
**Strategy**:
- Three answers are true/correct
- One is false/incorrect
- Read carefully - "EXCEPT" is easy to miss
- Often tests contraindications or adverse effects

## Clinical Reasoning Strategies

### 1. Identify the Diagnosis First
Before answering, be confident in the diagnosis:
- What is the classic triad/presentation?
- Do all features fit?
- What doesn't fit (rule out)?

### 2. Know Your Vital Signs
Abnormal vitals guide urgency:
- **Hypotension**: Think shock, sepsis, hemorrhage
- **Fever + hypotension**: Sepsis until proven otherwise
- **Tachycardia + hypoxia**: PE, pneumothorax, severe pneumonia
- **Hypertensive emergency**: >180/120 with end-organ damage

### 3. Laboratory Pattern Recognition
Common patterns to recognize:

| Pattern | Possible Causes |
|---------|-----------------|
| High BUN/Cr ratio (>20:1) | Prerenal azotemia |
| Low BUN/Cr ratio (<15:1) | ATN, low protein intake |
| Anion gap metabolic acidosis | DKA, lactic acidosis, renal failure, toxins |
| Respiratory alkalosis | Anxiety, PE, hypoxemia, early salicylate toxicity |
| Elevated amylase/lipase | Pancreatitis |
| Elevated troponin | MI, myocarditis, PE, renal failure |

### 4. Imaging Findings
Key patterns:
- **Air bronchograms**: Consolidation (pneumonia, pulmonary edema)
- **Kerley B lines**: CHF
- **Ground glass opacities**: PCP, ARDS, viral pneumonia
- **Westermark sign**: Pulmonary embolism (oligemia)
- **Hampton's hump**: Pulmonary infarction (wedge-shaped)

## Special Populations

### Pediatric Considerations
- Different vital sign normal ranges
- Developmental milestones
- Vaccination schedules
- Common pediatric conditions (RSV, croup, intussusception)

### Geriatric Considerations
- Atypical disease presentations
- Polypharmacy interactions
- Falls and functional assessment
- Delirium vs dementia

### Pregnancy Considerations
- Physiologic changes (anemia, elevated WBC)
- Contraindicated medications
- Pregnancy-specific conditions (preeclampsia, gestational diabetes)
- Radiation exposure concerns

## Time-Sensitive Emergencies

### The "Golden Windows"
- **STEMI**: Door-to-balloon <90 min, door-to-needle <30 min
- **Stroke**: tPA within 4.5 hours of symptom onset
- **Sepsis**: Antibiotics within 1 hour
- **Meningitis**: Antibiotics within 30 minutes

### Prioritization Rules
When multiple actions seem reasonable:
1. ABCs always come first
2. Don't let the test die (stabilize)
3. Treat life-threatening before diagnostic
4. Definitive therapy over supportive care

## Common USMLE Associations

### Classic Triads
- **Beck's Triad** (cardiac tamponade): Hypotension, JVD, muffled heart sounds
- **Charcot's Triad** (cholangitis): Fever, RUQ pain, jaundice
- **Reynolds' Pentad**: Charcot's + shock + confusion
- **Virchow's Triad** (thrombosis): Stasis, hypercoagulability, endothelial injury

### Eponymous Signs
- **McBurney's point**: Appendicitis
- **Murphy's sign**: Cholecystitis
- **Rovsing's sign**: Appendicitis
- **Psoas sign**: Appendicitis or diverticulitis
- **Obturator sign**: Appendicitis or pelvic abscess
- **Kernig's/Brudzinski's**: Meningitis

### Drug Side Effect Patterns
- **ACE inhibitors**: Cough, hyperkalemia, angioedema
- **Statins**: Myopathy, hepatotoxicity
- **Amiodarone**: Pulmonary fibrosis, thyroid dysfunction, blue-gray skin
- **Lithium**: Tremor, diabetes insipidus, hypothyroidism
- **Corticosteroids**: Cushings, osteoporosis, hyperglycemia, immunosuppression

## Question Analysis Checklist

Before selecting an answer:

1. ✓ What is the diagnosis?
2. ✓ Is the patient stable or unstable?
3. ✓ Is this an emergency?
4. ✓ What is being asked (diagnosis, next step, best test, treatment)?
5. ✓ Are there contraindications in the vignette?
6. ✓ Does the answer make sense with the patient's presentation?
7. ✓ Have I ruled out the distractors?

## Practice Tips

1. **Read the last line first** - Know what is being asked
2. **Generate answer before looking** - Prevents bias from options
3. **Don't add information** - Work only with what's given
4. **Watch for distractors** - Common conditions that "sort of" fit
5. **Know normal values** - Essential for interpretation
6. **Trust your instincts** - First answer is often correct

## Score Interpretation Guide

### For Self-Assessment

| Performance | Percentage | Action |
|-------------|------------|--------|
| Excellent | >85% | Ready for exam |
| Good | 70-85% | Review weak areas |
| Fair | 60-70% | Significant study needed |
| Poor | <60% | Major content review required |

### NBME Score Correlation
- Average Step 1: ~230-232
- Average Step 2 CK: ~245-247
- Target for competitive specialties: >250

---

*Note: This is for educational purposes. Always consult current USMLE content outlines and official resources for exam preparation.*

FILE:requirements.txt
main

FILE:scripts/main.py
#!/usr/bin/env python3
"""
USMLE Case Generator
Generates Step 1/2 style clinical cases for medical education.
"""

import argparse
import json
import random
import sys
from datetime import datetime
from pathlib import Path


class USMLECaseGenerator:
    """Generator for USMLE-style clinical cases."""
    
    def __init__(self):
        self.case_templates = self._load_case_templates()
        self.vital_signs = self._load_vital_signs()
        self.lab_ranges = self._load_lab_ranges()
        self.demographics = self._load_demographics()
        
    def _load_case_templates(self):
        """Load case structure templates."""
        return {
            "step1": {
                "focus": "pathophysiology, mechanism, basic science",
                "question_type": "What is the most likely mechanism/pathogen/findings?",
                "style": "basic_science_focused"
            },
            "step2": {
                "focus": "diagnosis, management, next best step",
                "question_type": "What is the most likely diagnosis/next step/best initial test?",
                "style": "clinical_focused"
            }
        }
    
    def _load_vital_signs(self):
        """Load vital sign abnormalities by condition."""
        return {
            "normal": {"temp": "98.6°F", "hr": "72 bpm", "bp": "120/80 mmHg", "rr": "16/min", "o2": "98%"},
            "fever": {"temp": "101.5°F", "hr": "95 bpm", "bp": "118/75 mmHg", "rr": "18/min", "o2": "97%"},
            "shock": {"temp": "97.2°F", "hr": "120 bpm", "bp": "85/50 mmHg", "rr": "24/min", "o2": "92%"},
            "hypertensive": {"temp": "98.8°F", "hr": "78 bpm", "bp": "175/105 mmHg", "rr": "16/min", "o2": "98%"},
            "tachycardic": {"temp": "99.1°F", "hr": "115 bpm", "bp": "125/82 mmHg", "rr": "20/min", "o2": "96%"},
        }
    
    def _load_lab_ranges(self):
        """Load normal and abnormal lab values."""
        return {
            "normal": {
                "wbc": "7,500/μL",
                "hgb": "14.0 g/dL",
                "plt": "250,000/μL",
                "na": "140 mEq/L",
                "k": "4.0 mEq/L",
                "cl": "100 mEq/L",
                "co2": "24 mEq/L",
                "bun": "15 mg/dL",
                "cr": "1.0 mg/dL",
                "glucose": "90 mg/dL",
                "ph": "7.40"
            },
            "leukocytosis": {"wbc": "15,000/μL"},
            "anemia": {"hgb": "8.5 g/dL"},
            "thrombocytopenia": {"plt": "80,000/μL"},
            "hyponatremia": {"na": "128 mEq/L"},
            "hyperkalemia": {"k": "5.8 mEq/L"},
            "hypokalemia": {"k": "2.8 mEq/L"},
            "azotemia": {"bun": "45 mg/dL", "cr": "2.5 mg/dL"},
            "hyperglycemia": {"glucose": "320 mg/dL"},
            "metabolic_acidosis": {"ph": "7.28", "co2": "18 mEq/L"},
        }
    
    def _load_demographics(self):
        """Load patient demographic options."""
        return {
            "adult_male": {"age_range": (25, 75), "gender": "male"},
            "adult_female": {"age_range": (25, 75), "gender": "female"},
            "elderly_male": {"age_range": (65, 85), "gender": "male"},
            "elderly_female": {"age_range": (65, 85), "gender": "female"},
            "young_adult_male": {"age_range": (18, 35), "gender": "male"},
            "young_adult_female": {"age_range": (18, 35), "gender": "female"},
            "child": {"age_range": (5, 12), "gender": "any"},
            "adolescent": {"age_range": (13, 17), "gender": "any"},
        }
    
    def get_condition_database(self):
        """Get database of medical conditions for case generation."""
        return {
            "cardiology": [
                {
                    "name": "Acute Myocardial Infarction",
                    "step": 2,
                    "demographics": "elderly_male",
                    "chief_complaint": "Chest pain",
                    "hpi": "A {age}-year-old {gender} presents with crushing substernal chest pain radiating to the left arm and jaw, onset 2 hours ago at rest. Associated with diaphoresis, nausea, and shortness of breath.",
                    "vitals": "tachycardic",
                    "pe_findings": "Diaphoretic, anxious. Heart: tachycardic, regular rhythm. Lungs: clear bilaterally.",
                    "ecg": "ST-segment elevation in leads II, III, and aVF",
                    "troponin": "Elevated",
                    "labs": "normal",
                    "question": "What is the most appropriate next step in management?",
                    "options": [
                        "A. Administer aspirin and obtain cardiac enzymes in 6 hours",
                        "B. Give nitroglycerin and discharge home if pain resolves",
                        "C. Start heparin and arrange for emergent cardiac catheterization",
                        "D. Perform CT pulmonary angiography to rule out PE",
                        "E. Order stress echocardiography as outpatient"
                    ],
                    "correct": "C",
                    "explanation": "This patient presents with STEMI (ST-elevation MI) based on ST elevations in inferior leads (II, III, aVF). The standard of care is emergent reperfusion therapy - primary PCI (percutaneous coronary intervention) is preferred if available within 90 minutes. Heparin anticoagulation is initiated while awaiting catheterization.",
                    "learning_objectives": ["STEMI diagnosis", "Emergent reperfusion therapy", "Primary PCI indications"]
                },
                {
                    "name": "Atrial Fibrillation",
                    "step": 2,
                    "demographics": "elderly_male",
                    "chief_complaint": "Palpitations",
                    "hpi": "A {age}-year-old {gender} with history of hypertension presents with sudden onset irregular heartbeat and palpitations for 3 days. Denies chest pain but reports mild dyspnea on exertion.",
                    "vitals": "tachycardic",
                    "pe_findings": "Irregularly irregular pulse. Heart rate approximately 110 bpm. No murmurs.",
                    "ecg": "Irregularly irregular rhythm, no discernible P waves, narrow QRS complexes",
                    "labs": "normal",
                    "question": "What is the most appropriate next step in management?",
                    "options": [
                        "A. Immediate synchronized cardioversion",
                        "B. Rate control with beta-blocker and anticoagulation assessment",
                        "C. Start aspirin daily",
                        "D. Schedule outpatient Holter monitor",
                        "E. Initiate amiodarone for rhythm control"
                    ],
                    "correct": "B",
                    "explanation": "For hemodynamically stable atrial fibrillation (>48 hours duration), the first step is rate control (beta-blocker or calcium channel blocker) and assessment for anticoagulation using CHA2DS2-VASc score. Cardioversion without anticoagulation or TEE increases stroke risk.",
                    "learning_objectives": ["AFib management", "Rate vs rhythm control", "CHA2DS2-VASc score"]
                },
                {
                    "name": "Heart Failure Exacerbation",
                    "step": 2,
                    "demographics": "elderly_female",
                    "chief_complaint": "Shortness of breath",
                    "hpi": "A {age}-year-old {gender} with known heart failure presents with worsening dyspnea over 3 days, orthopnea, and bilateral leg swelling. Non-compliant with sodium restriction and medications.",
                    "vitals": "tachycardic",
                    "pe_findings": "Jugular venous distention to jaw at 45 degrees. Bilateral crackles in lung bases. 2+ pitting edema in bilateral lower extremities.",
                    "cxr": "Cardiomegaly, pulmonary edema, pleural effusions",
                    "bnp": "Elevated",
                    "labs": "normal",
                    "question": "What is the most appropriate initial therapy?",
                    "options": [
                        "A. Start ACE inhibitor and beta-blocker immediately",
                        "B. IV loop diuretics and supplemental oxygen",
                        "C. Emergent hemodialysis",
                        "D. High-dose aspirin therapy",
                        "E. Immediate cardiac catheterization"
                    ],
                    "correct": "B",
                    "explanation": "This patient presents with acute decompensated heart failure with volume overload. The initial management focuses on diuresis (IV loop diuretics) to relieve congestion and supplemental oxygen for hypoxemia. ACE inhibitors and beta-blockers are important chronic therapies but not for acute decompensation.",
                    "learning_objectives": ["Acute heart failure management", "Volume overload treatment", "Diuretic therapy"]
                }
            ],
            "pulmonology": [
                {
                    "name": "Community Acquired Pneumonia",
                    "step": 2,
                    "demographics": "adult_male",
                    "chief_complaint": "Fever and cough",
                    "hpi": "A {age}-year-old {gender} presents with 4 days of productive cough with rust-colored sputum, fever to 102°F, and pleuritic chest pain. Recently had URI symptoms.",
                    "vitals": "fever",
                    "pe_findings": "Decreased breath sounds and dullness to percussion in right lower lobe. Bronchial breath sounds present. Tachycardic.",
                    "cxr": "Right lower lobe consolidation with air bronchograms",
                    "labs": "leukocytosis",
                    "question": "What is the most likely causative organism?",
                    "options": [
                        "A. Pseudomonas aeruginosa",
                        "B. Staphylococcus aureus",
                        "C. Streptococcus pneumoniae",
                        "D. Mycoplasma pneumoniae",
                        "E. Haemophilus influenzae"
                    ],
                    "correct": "C",
                    "explanation": "Streptococcus pneumoniae is the most common cause of community-acquired pneumonia in adults, especially with classic presentation of sudden onset, high fever, productive cough with rust-colored sputum, and lobar consolidation on CXR. CURB-65 score helps determine need for hospitalization.",
                    "learning_objectives": ["CAP microbiology", "S. pneumoniae presentation", "CURB-65 criteria"]
                },
                {
                    "name": "Pulmonary Embolism",
                    "step": 2,
                    "demographics": "adult_female",
                    "chief_complaint": "Sudden dyspnea",
                    "hpi": "A {age}-year-old {gender} presents with sudden onset shortness of breath and pleuritic chest pain. Recently returned from 8-hour flight 2 days ago. History of oral contraceptive use.",
                    "vitals": "tachycardic",
                    "pe_findings": "Tachypneic, using accessory muscles. Clear lung sounds. Right calf swollen and tender.",
                    "ecg": "Sinus tachycardia, S1Q3T3 pattern",
                    "d_dimer": "Elevated",
                    "ctpa": "Filling defect in right pulmonary artery",
                    "question": "What is the most appropriate next diagnostic step?",
                    "options": [
                        "A. Start empiric heparin and obtain CT pulmonary angiography",
                        "B. Order ventilation-perfusion scan",
                        "C. Perform lower extremity Doppler ultrasound",
                        "D. Start warfarin and discharge home",
                        "E. Obtain cardiac catheterization"
                    ],
                    "correct": "A",
                    "explanation": "This patient has high pretest probability for PE (Wells criteria: tachycardia, clinical signs of DVT, immobilization, hemoptysis). The diagnostic test of choice is CT pulmonary angiography (CTPA). Empiric anticoagulation should be started if high suspicion while awaiting imaging.",
                    "learning_objectives": ["PE diagnosis", "Wells criteria", "CTPA as gold standard"]
                },
                {
                    "name": "Asthma Exacerbation",
                    "step": 1,
                    "demographics": "young_adult_female",
                    "chief_complaint": "Wheezing",
                    "hpi": "A {age}-year-old {gender} with asthma presents with worsening wheezing and dyspnea for 2 days after exposure to cat dander. Has required rescue inhaler every 2 hours.",
                    "vitals": "normal",
                    "pe_findings": "Audible wheezing bilaterally. Prolonged expiratory phase. Using accessory muscles.",
                    "pfts": "FEV1/FVC ratio 0.65, increased by 15% after bronchodilator",
                    "question": "What is the mechanism of bronchial hyperresponsiveness in asthma?",
                    "options": [
                        "A. IgE-mediated mast cell degranulation and inflammation",
                        "B. Direct bacterial infection of bronchial epithelium",
                        "C. Autoimmune destruction of alveolar walls",
                        "D. Excessive parasympathetic innervation only",
                        "E. Genetic defect in surfactant production"
                    ],
                    "correct": "A",
                    "explanation": "Asthma is characterized by chronic airway inflammation mediated by type I hypersensitivity (IgE-mediated). Exposure to allergens triggers mast cell degranulation, releasing histamine, leukotrienes, and prostaglandins causing bronchoconstriction, mucus production, and airway edema.",
                    "learning_objectives": ["Asthma pathophysiology", "Type I hypersensitivity", "Mast cell mediators"]
                }
            ],
            "gastroenterology": [
                {
                    "name": "Acute Appendicitis",
                    "step": 2,
                    "demographics": "young_adult_male",
                    "chief_complaint": "Abdominal pain",
                    "hpi": "A {age}-year-old {gender} presents with periumbilical pain that migrated to RLQ over 12 hours. Nausea, vomiting, and anorexia. Low-grade fever.",
                    "vitals": "fever",
                    "pe_findings": "Tenderness at McBurney's point. Positive rebound tenderness and guarding. Rovsing's sign positive.",
                    "wbc": "13,500 with left shift",
                    "ct": "Dilated appendix with wall enhancement and periappendiceal fat stranding",
                    "question": "What is the most appropriate next step?",
                    "options": [
                        "A. Abdominal ultrasound",
                        "B. Emergent laparoscopic appendectomy",
                        "C. IV antibiotics and observation for 24 hours",
                        "D. Colonoscopy",
                        "E. CT chest/abdomen/pelvis with contrast"
                    ],
                    "correct": "B",
                    "explanation": "Acute appendicitis is a surgical emergency. Once diagnosed clinically or confirmed with imaging, the definitive treatment is appendectomy. Delay increases risk of perforation. CT imaging (if obtained) shows the characteristic findings described.",
                    "learning_objectives": ["Appendicitis diagnosis", "Migration of pain", "Surgical management"]
                },
                {
                    "name": "Acute Pancreatitis",
                    "step": 2,
                    "demographics": "adult_male",
                    "chief_complaint": "Epigastric pain",
                    "hpi": "A {age}-year-old {gender} presents with severe epigastric pain radiating to the back after heavy alcohol use and large meal. Nausea and vomiting. History of gallstones.",
                    "vitals": "fever",
                    "pe_findings": "Severe epigastric tenderness with guarding. Decreased bowel sounds. Grey Turner's sign negative.",
                    "amylase": "850 U/L",
                    "lipase": "1,200 U/L",
                    "ct": "Enlarged pancreas with peripancreatic fat stranding",
                    "question": "What is the most common cause of acute pancreatitis in the United States?",
                    "options": [
                        "A. Hypertriglyceridemia",
                        "B. Gallstones",
                        "C. Alcohol abuse",
                        "D. Abdominal trauma",
                        "E. Medications"
                    ],
                    "correct": "B",
                    "explanation": "In the United States, gallstones are the most common cause of acute pancreatitis (40-70%), followed by alcohol abuse (25-35%). The classic presentation is epigastric pain radiating to the back with elevated amylase and lipase (>3x upper limit of normal).",
                    "learning_objectives": ["Pancreatitis etiology", "Gallstone pancreatitis", "Diagnostic criteria"]
                },
                {
                    "name": "GERD Pathophysiology",
                    "step": 1,
                    "demographics": "adult_female",
                    "chief_complaint": "Heartburn",
                    "hpi": "A {age}-year-old {gender} presents with burning retrosternal discomfort after meals and when lying down, occurring 3-4 times per week for 2 months. Worse after spicy foods.",
                    "vitals": "normal",
                    "pe_findings": "Normal physical examination",
                    "question": "What is the primary mechanism of gastroesophageal reflux disease?",
                    "options": [
                        "A. Excessive gastric acid production only",
                        "B. Incompetent lower esophageal sphincter and transient relaxations",
                        "C. H. pylori infection of the esophageal mucosa",
                        "D. Autoimmune destruction of esophageal smooth muscle",
                        "E. Excessive histamine release from gastric G cells"
                    ],
                    "correct": "B",
                    "explanation": "GERD is primarily caused by dysfunction of the lower esophageal sphincter (LES), including transient LES relaxations and decreased basal LES tone. This allows gastric contents to reflux into the esophagus. Contributing factors include hiatal hernia, obesity, and certain foods.",
                    "learning_objectives": ["GERD pathophysiology", "LES dysfunction", "Transient LES relaxations"]
                }
            ],
            "nephrology": [
                {
                    "name": "Acute Kidney Injury",
                    "step": 2,
                    "demographics": "elderly_male",
                    "chief_complaint": "Decreased urine output",
                    "hpi": "A {age}-year-old {gender} underwent elective hip surgery 2 days ago. Now with decreased urine output (300 mL in 24h) and elevated creatinine from 1.0 to 2.8 mg/dL.",
                    "vitals": "normal",
                    "pe_findings": "Dry mucous membranes. Flat neck veins. No peripheral edema.",
                    "labs": "azotemia",
                    "bun_cr_ratio": ">20:1",
                    "fena": "<1%",
                    "question": "What is the most likely cause of this patient's acute kidney injury?",
                    "options": [
                        "A. Acute tubular necrosis",
                        "B. Prerenal azotemia from hypovolemia",
                        "C. Postrenal obstruction",
                        "D. Rhabdomyolysis",
                        "E. Acute interstitial nephritis"
                    ],
                    "correct": "B",
                    "explanation": "This patient has prerenal azotemia secondary to hypovolemia from surgery and poor intake. The BUN/Cr ratio >20:1 and FENa <1% are classic findings of prerenal azotemia. Physical exam findings of volume depletion (dry mucous membranes, flat JVP) support this diagnosis.",
                    "learning_objectives": ["AKI classification", "Prerenal vs intrinsic", "FENa interpretation"]
                },
                {
                    "name": "Nephrotic Syndrome",
                    "step": 2,
                    "demographics": "child",
                    "chief_complaint": "Swelling",
                    "hpi": "A {age}-year-old child presents with periorbital edema in the morning and leg swelling. Parents report foamy urine. No recent illness.",
                    "vitals": "normal",
                    "pe_findings": "Periorbital edema, 2+ pitting edema in lower extremities. Ascites present.",
                    "urinalysis": "4+ protein, no RBCs",
                    "protein_creatinine_ratio": ">3.5",
                    "albumin": "2.1 g/dL",
                    "lipids": "Elevated cholesterol and triglycerides",
                    "question": "What is the most likely diagnosis?",
                    "options": [
                        "A. Acute post-streptococcal glomerulonephritis",
                        "B. Minimal change disease",
                        "C. IgA nephropathy",
                        "D. Rapidly progressive glomerulonephritis",
                        "E. Lupus nephritis"
                    ],
                    "correct": "B",
                    "explanation": "Minimal change disease is the most common cause of nephrotic syndrome in children (80-90% of cases <10 years). The classic tetrad of nephrotic syndrome is heavy proteinuria (>3.5 g/day), hypoalbuminemia, edema, and hyperlipidemia. Minimal change disease typically has normal appearing glomeruli on light microscopy.",
                    "learning_objectives": ["Nephrotic syndrome", "Minimal change disease", "Nephrotic vs nephritic"]
                }
            ],
            "endocrinology": [
                {
                    "name": "Diabetic Ketoacidosis",
                    "step": 2,
                    "demographics": "young_adult_female",
                    "chief_complaint": "Nausea and confusion",
                    "hpi": "A {age}-year-old {gender} with type 1 diabetes presents with nausea, vomiting, and confusion for 24 hours. Ran out of insulin 3 days ago. Also reports polyuria and polydipsia.",
                    "vitals": "tachycardic",
                    "pe_findings": "Kussmaul respirations (deep, rapid breathing). Dry mucous membranes. Fruity breath odor. Altered mental status.",
                    "labs": "hyperglycemia",
                    "ph": "7.15",
                    "bicarb": "12 mEq/L",
                    "anion_gap": "22",
                    "ketones": "Positive serum and urine",
                    "question": "What is the most appropriate initial management?",
                    "options": [
                        "A. Subcutaneous regular insulin only",
                        "B. IV fluid resuscitation followed by IV insulin drip",
                        "C. Oral hypoglycemic agents",
                        "D. Sodium bicarbonate infusion",
                        "E. Glucagon injection"
                    ],
                    "correct": "B",
                    "explanation": "DKA management priorities: 1) Fluid resuscitation with isotonic saline (often 1-2L in first hour), 2) IV regular insulin (0.1 U/kg bolus then 0.1 U/kg/hr drip), 3) Potassium replacement when K+ <5.3, 4) Bicarbonate only if pH <6.9. Glucose monitoring with dextrose added when glucose <250.",
                    "learning_objectives": ["DKA management", "Fluid resuscitation priority", "Insulin therapy protocol"]
                },
                {
                    "name": "Hyperthyroidism",
                    "step": 2,
                    "demographics": "young_adult_female",
                    "chief_complaint": "Weight loss and tremor",
                    "hpi": "A {age}-year-old {gender} presents with 15-pound weight loss over 2 months despite increased appetite, heat intolerance, palpitations, and anxiety. Also notes tremor in hands.",
                    "vitals": "tachycardic",
                    "pe_findings": "Tremor in outstretched hands. Warm, moist skin. Lid lag present. Diffuse thyroid enlargement without nodules.",
                    "tsh": "0.01 mIU/L",
                    "free_t4": "Elevated",
                    "tsi": "Elevated",
                    "question": "What is the most likely diagnosis?",
                    "options": [
                        "A. Hashimoto's thyroiditis",
                        "B. Graves' disease",
                        "C. Toxic multinodular goiter",
                        "D. Subacute thyroiditis",
                        "E. Iatrogenic hyperthyroidism"
                    ],
                    "correct": "B",
                    "explanation": "Graves' disease is the most common cause of hyperthyroidism. The combination of hyperthyroid symptoms (weight loss, heat intolerance, palpitations) with diffuse goiter and elevated TSI (thyroid-stimulating immunoglobulins) is diagnostic. Eye findings (lid lag, exophthalmos) may also be present.",
                    "learning_objectives": ["Graves' disease diagnosis", "TSI antibodies", "Hyperthyroidism workup"]
                }
            ],
            "infectious_disease": [
                {
                    "name": "Meningitis",
                    "step": 2,
                    "demographics": "young_adult_male",
                    "chief_complaint": "Headache and fever",
                    "hpi": "A {age}-year-old {gender} presents with sudden onset severe headache, fever to 103°F, and neck stiffness. Photophobia and confusion noted by roommate.",
                    "vitals": "fever",
                    "pe_findings": "Nuchal rigidity. Positive Kernig's sign. Positive Brudzinski's sign. Altered mental status.",
                    "wbc": "15,000 with left shift",
                    "lp": "CSF: Elevated WBC, low glucose, high protein, gram-positive diplococci",
                    "question": "What is the most appropriate empiric treatment?",
                    "options": [
                        "A. Vancomycin + Ceftriaxone + Ampicillin",
                        "B. Azithromycin monotherapy",
                        "C. Metronidazole alone",
                        "D. Amoxicillin only",
                        "E. Ciprofloxacin"
                    ],
                    "correct": "A",
                    "explanation": "Empiric therapy for suspected bacterial meningitis in adults (18-50): Vancomycin (for possible resistant pneumococcus) + Ceftriaxone (or cefotaxime) + Ampicillin (for Listeria coverage in adults >50 or immunocompromised). Dexamethasone should be given before or with first antibiotic dose.",
                    "learning_objectives": ["Meningitis empiric therapy", "CSF analysis", "Kernig/Brudzinski signs"]
                },
                {
                    "name": "HIV/AIDS Opportunistic Infection",
                    "step": 2,
                    "demographics": "adult_male",
                    "chief_complaint": "Dyspnea",
                    "hpi": "A {age}-year-old {gender} with HIV (last CD4 unknown, not on HAART) presents with 2 weeks of progressive dyspnea, nonproductive cough, and low-grade fever.",
                    "vitals": "fever",
                    "pe_findings": "Tachypneic. Decreased oxygen saturation to 88% on room air. Clear lung sounds bilaterally.",
                    "cd4": "45 cells/μL",
                    "cxr": "Bilateral diffuse interstitial infiltrates",
                    "abg": "PaO2 58 mmHg, increased A-a gradient",
                    "question": "What is the most likely diagnosis?",
                    "options": [
                        "A. Tuberculosis",
                        "B. Pneumocystis jirovecii pneumonia (PCP)",
                        "C. Bacterial community-acquired pneumonia",
                        "D. Kaposi's sarcoma",
                        "E. CMV pneumonitis"
                    ],
                    "correct": "B",
                    "explanation": "PCP is the most common opportunistic infection in AIDS with CD4 <200. The classic presentation is subacute dyspnea, nonproductive cough, fever, hypoxemia with normal or minimal auscultatory findings, and diffuse interstitial infiltrates on CXR. Diagnosis confirmed by silver stain of induced sputum or BAL.",
                    "learning_objectives": ["PCP presentation", "AIDS opportunistic infections", "CD4 thresholds"]
                }
            ],
            "neurology": [
                {
                    "name": "Acute Ischemic Stroke",
                    "step": 2,
                    "demographics": "elderly_male",
                    "chief_complaint": "Weakness",
                    "hpi": "A {age}-year-old {gender} with atrial fibrillation presents with sudden onset right-sided weakness and difficulty speaking. Last known well 2 hours ago.",
                    "vitals": "hypertensive",
                    "pe_findings": "Right facial droop, right arm and leg weakness (3/5), expressive aphasia present. Left gaze preference.",
                    "nihss": "12",
                    "ct_head": "No hemorrhage, early ischemic changes in left MCA territory",
                    "question": "What is the most appropriate next step?",
                    "options": [
                        "A. Start aspirin immediately",
                        "B. Administer IV tPA (thrombolytics)",
                        "C. Anticoagulate with heparin",
                        "D. Perform CT angiography only and observe",
                        "E. Aspirin and clopidogrel dual therapy"
                    ],
                    "correct": "B",
                    "explanation": "For acute ischemic stroke presenting within 4.5 hours of symptom onset, IV tPA (alteplase) is indicated if no contraindications (normal CT ruling out hemorrhage, BP <185/110, no recent surgery/ bleeding). Mechanical thrombectomy may be considered for large vessel occlusion.",
                    "learning_objectives": ["Acute stroke management", "tPA indications", "Time window for thrombolysis"]
                },
                {
                    "name": "Multiple Sclerosis",
                    "step": 1,
                    "demographics": "young_adult_female",
                    "chief_complaint": "Vision loss",
                    "hpi": "A {age}-year-old {gender} presents with painful vision loss in right eye over 3 days. Had episode of leg weakness 6 months ago that resolved spontaneously. Lives in northern climate.",
                    "vitals": "normal",
                    "pe_findings": "Decreased visual acuity OD. Afferent pupillary defect (Marcus Gunn pupil) on right. Internuclear ophthalmoplegia noted on left gaze.",
                    "mri_brain": "Multiple periventricular white matter lesions, ovoid, perpendicular to ventricles (Dawson's fingers)",
                    "csf": "Oligoclonal bands present",
                    "question": "What is the pathophysiology of this disease?",
                    "options": [
                        "A. Autoimmune demyelination of CNS white matter",
                        "B. Direct viral infection of oligodendrocytes",
                        "C. Vasculitic occlusion of cerebral vessels",
                        "D. Autoantibody attack on neuromuscular junction",
                        "E. Progressive neuronal apoptosis"
                    ],
                    "correct": "A",
                    "explanation": "MS is an autoimmune demyelinating disease of the CNS characterized by T-cell mediated attack on myelin basic protein. Features include dissemination in time and space (multiple episodes, multiple locations), periventricular white matter lesions (Dawson's fingers), oligoclonal bands in CSF, and association with HLA-DR2 and vitamin D deficiency.",
                    "learning_objectives": ["MS pathophysiology", "Demyelination", "McDonald criteria"]
                }
            ]
        }
    
    def generate_case(self, step=2, topic=None, condition=None, difficulty="medium", include_answer=True):
        """Generate a USMLE-style clinical case."""
        conditions_db = self.get_condition_database()
        
        # Filter by topic if specified
        if topic and topic.lower() in conditions_db:
            available_conditions = conditions_db[topic.lower()]
        else:
            # Use all conditions
            available_conditions = []
            for topic_conditions in conditions_db.values():
                available_conditions.extend(topic_conditions)
        
        # Filter by step
        available_conditions = [c for c in available_conditions if c.get("step", 2) == step]
        
        if not available_conditions:
            return self._generate_generic_case(step, topic)
        
        # Select condition
        if condition:
            selected = next((c for c in available_conditions if condition.lower() in c["name"].lower()), None)
            if not selected:
                selected = random.choice(available_conditions)
        else:
            selected = random.choice(available_conditions)
        
        return self._format_case(selected, include_answer)
    
    def _format_case(self, condition, include_answer=True):
        """Format a case with demographics filled in."""
        # Get demographics
        demo_key = condition.get("demographics", "adult_male")
        demo = self.demographics.get(demo_key, self.demographics["adult_male"])
        
        age = random.randint(demo["age_range"][0], demo["age_range"][1])
        gender = demo["gender"] if demo["gender"] != "any" else random.choice(["male", "female"])
        
        # Fill in HPI
        hpi = condition.get("hpi", "").format(age=age, gender=gender)
        
        # Build case
        case = {
            "title": f"Case: {condition['name']}",
            "metadata": {
                "condition": condition['name'],
                "step": condition.get('step', 2),
                "topic": self._get_topic_for_condition(condition['name']),
                "difficulty": condition.get('difficulty', 'medium'),
                "generated_at": datetime.now().isoformat()
            },
            "case_content": {
                "chief_complaint": condition.get('chief_complaint', 'Unknown'),
                "history_of_present_illness": hpi,
                "vital_signs": self._get_vitals(condition.get('vitals', 'normal')),
                "physical_examination": condition.get('pe_findings', 'Normal'),
                "diagnostic_studies": self._get_diagnostics(condition),
                "laboratory_results": self._get_labs(condition.get('labs', 'normal'))
            },
            "question": {
                "text": condition.get('question', 'What is the diagnosis?'),
                "options": condition.get('options', ['A', 'B', 'C', 'D', 'E'])
            }
        }
        
        if include_answer:
            case["answer"] = {
                "correct": condition.get('correct', 'A'),
                "explanation": condition.get('explanation', ''),
                "learning_objectives": condition.get('learning_objectives', [])
            }
        
        return case
    
    def _get_vitals(self, vitals_key):
        """Get vital signs for a case."""
        return self.vital_signs.get(vitals_key, self.vital_signs["normal"])
    
    def _get_labs(self, labs_key):
        """Get lab values for a case."""
        base_labs = self.lab_ranges["normal"].copy()
        if labs_key in self.lab_ranges:
            base_labs.update(self.lab_ranges[labs_key])
        return base_labs
    
    def _get_diagnostics(self, condition):
        """Get diagnostic study results."""
        diagnostics = {}
        if 'ecg' in condition:
            diagnostics['ECG'] = condition['ecg']
        if 'cxr' in condition:
            diagnostics['Chest X-ray'] = condition['cxr']
        if 'ct' in condition:
            diagnostics['CT'] = condition['ct']
        if 'mri' in condition:
            diagnostics['MRI'] = condition['mri']
        if 'lp' in condition:
            diagnostics['Lumbar Puncture'] = condition['lp']
        if 'pfts' in condition:
            diagnostics['Pulmonary Function Tests'] = condition['pfts']
        return diagnostics
    
    def _get_topic_for_condition(self, condition_name):
        """Get topic category for a condition."""
        db = self.get_condition_database()
        for topic, conditions in db.items():
            if any(c['name'] == condition_name for c in conditions):
                return topic
        return "general"
    
    def _generate_generic_case(self, step, topic):
        """Generate a generic case when specific one not available."""
        return {
            "title": "Generic Clinical Case",
            "metadata": {
                "step": step,
                "topic": topic or "general",
                "generated_at": datetime.now().isoformat()
            },
            "case_content": {
                "chief_complaint": "Patient presents with symptoms",
                "history_of_present_illness": "Please specify a condition or topic for a detailed case.",
                "vital_signs": self.vital_signs["normal"],
                "physical_examination": "Normal",
                "diagnostic_studies": {},
                "laboratory_results": self.lab_ranges["normal"]
            },
            "question": {
                "text": "What would you like to assess?",
                "options": ["A", "B", "C", "D", "E"]
            }
        }
    
    def format_output(self, case, format_type="text"):
        """Format case for output."""
        if format_type == "json":
            return json.dumps(case, indent=2)
        elif format_type == "markdown":
            return self._format_markdown(case)
        else:
            return self._format_text(case)
    
    def _format_text(self, case):
        """Format case as plain text."""
        content = case["case_content"]
        lines = [
            "=" * 60,
            case["title"],
            "=" * 60,
            "",
            f"CHIEF COMPLAINT: {content['chief_complaint']}",
            "",
            "HISTORY OF PRESENT ILLNESS:",
            content['history_of_present_illness'],
            "",
            "VITAL SIGNS:",
            f"  Temperature: {content['vital_signs']['temp']}",
            f"  Heart Rate: {content['vital_signs']['hr']}",
            f"  Blood Pressure: {content['vital_signs']['bp']}",
            f"  Respiratory Rate: {content['vital_signs']['rr']}",
            f"  O2 Saturation: {content['vital_signs']['o2']}",
            "",
            "PHYSICAL EXAMINATION:",
            content['physical_examination'],
            "",
            "DIAGNOSTIC STUDIES:"
        ]
        
        if content['diagnostic_studies']:
            for study, result in content['diagnostic_studies'].items():
                lines.append(f"  {study}: {result}")
        else:
            lines.append("  None")
        
        lines.extend([
            "",
            "LABORATORY RESULTS:",
        ])
        
        for lab, value in content['laboratory_results'].items():
            lines.append(f"  {lab.upper()}: {value}")
        
        lines.extend([
            "",
            "-" * 60,
            "QUESTION:",
            case["question"]["text"],
            "",
        ])
        
        for option in case["question"]["options"]:
            lines.append(option)
        
        if "answer" in case:
            lines.extend([
                "",
                "-" * 60,
                f"CORRECT ANSWER: {case['answer']['correct']}",
                "",
                "EXPLANATION:",
                case['answer']['explanation'],
                "",
                "LEARNING OBJECTIVES:",
            ])
            for obj in case['answer']['learning_objectives']:
                lines.append(f"  - {obj}")
        
        lines.append("=" * 60)
        
        return "\n".join(lines)
    
    def _format_markdown(self, case):
        """Format case as markdown."""
        content = case["case_content"]
        lines = [
            f"# {case['title']}",
            "",
            "## Case Information",
            f"- **Condition**: {case['metadata']['condition']}",
            f"- **USMLE Step**: {case['metadata']['step']}",
            f"- **Topic**: {case['metadata']['topic']}",
            "",
            "## Chief Complaint",
            content['chief_complaint'],
            "",
            "## History of Present Illness",
            content['history_of_present_illness'],
            "",
            "## Vital Signs",
            f"| Parameter | Value |",
            f"|-----------|-------|",
            f"| Temperature | {content['vital_signs']['temp']} |",
            f"| Heart Rate | {content['vital_signs']['hr']} |",
            f"| Blood Pressure | {content['vital_signs']['bp']} |",
            f"| Respiratory Rate | {content['vital_signs']['rr']} |",
            f"| O2 Saturation | {content['vital_signs']['o2']} |",
            "",
            "## Physical Examination",
            content['physical_examination'],
            "",
            "## Diagnostic Studies",
        ]
        
        if content['diagnostic_studies']:
            for study, result in content['diagnostic_studies'].items():
                lines.append(f"- **{study}**: {result}")
        else:
            lines.append("None")
        
        lines.extend([
            "",
            "## Laboratory Results",
        ])
        
        for lab, value in content['laboratory_results'].items():
            lines.append(f"- **{lab.upper()}**: {value}")
        
        lines.extend([
            "",
            "## Question",
            case["question"]["text"],
            "",
        ])
        
        for option in case["question"]["options"]:
            lines.append(f"{option}")
        
        if "answer" in case:
            lines.extend([
                "",
                "---",
                "",
                f"## Answer: **{case['answer']['correct']}**",
                "",
                "### Explanation",
                case['answer']['explanation'],
                "",
                "### Learning Objectives",
            ])
            for obj in case['answer']['learning_objectives']:
                lines.append(f"- {obj}")
        
        return "\n".join(lines)


def main():
    parser = argparse.ArgumentParser(description="Generate USMLE-style clinical cases")
    parser.add_argument("--step", type=int, choices=[1, 2], default=2, help="USMLE Step level")
    parser.add_argument("--topic", type=str, help="Medical topic (cardiology, pulmonology, etc.)")
    parser.add_argument("--condition", type=str, help="Specific medical condition")
    parser.add_argument("--difficulty", type=str, choices=["easy", "medium", "hard"], default="medium")
    parser.add_argument("--format", type=str, choices=["text", "json", "markdown"], default="text")
    parser.add_argument("--include-diagnosis", action="store_true", help="Include answer key")
    parser.add_argument("--count", type=int, default=1, help="Number of cases to generate")
    
    args = parser.parse_args()
    
    generator = USMLECaseGenerator()
    
    for i in range(args.count):
        case = generator.generate_case(
            step=args.step,
            topic=args.topic,
            condition=args.condition,
            difficulty=args.difficulty,
            include_answer=args.include_diagnosis
        )
        
        output = generator.format_output(case, args.format)
        print(output)
        
        if i < args.count - 1:
            print("\n" + "=" * 60 + "\n")
    
    return 0


if __name__ == "__main__":
    sys.exit(main())

ClawHub Coding Research+2

A@clawhub-aipoch-ai-772015cadb

Translational Gap Analyzer

Skill

Assess translational gaps between preclinical models and human diseases.

---
name: translational-gap-analyzer
description: Assess translational gaps between preclinical models and human diseases.
license: MIT
skill-author: AIPOCH
---
# Translational Gap Analyzer

**ID**: 209

## When to Use

- Use this skill when the task needs Assess translational gaps between preclinical models and human diseases.
- Use this skill for evidence insight tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

- Scope-focused workflow aligned to: Assess translational gaps between preclinical models and human diseases.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

- Python 3.8+
- Built-in libraries: argparse, json, sys

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Evidence Insight/translational-gap-analyzer"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Description

Assesses the "translational gap" between basic research models (such as mice, zebrafish, cell lines) and human diseases, providing early warning of clinical translation failure risks. This system helps researchers identify potential translational barriers in preclinical research and improve clinical trial success rates through multi-dimensional analysis.

## Capabilities

- Evaluates anatomical/physiological differences between models and humans
- Analyzes pathological similarity of disease models
- Identifies interspecies differences in molecular pathways
- Evaluates pharmacokinetic differences
- Provides early warning of clinical trial failure risk factors
- Provides improvement recommendations to increase translation success rates

## Usage

```text

# Full assessment report
python scripts/main.py --model <model_type> --disease <disease_name> --full

# Quick risk assessment
python scripts/main.py --model <model_type> --disease <disease_name> --quick

# Compare multiple models
python scripts/main.py --models mouse,rat,primate --disease <disease_name> --compare

# Specify focus areas
python scripts/main.py --model mouse --disease "Alzheimer's" --focus metabolism,immune
```

## Arguments

| Argument | Description | Required |
|----------|-------------|----------|
| `--model` | Model type (mouse, rat, zebrafish, cell_line, organoid, primate) | Yes (unless --models) |
| `--models` | Multi-model comparison mode, comma-separated | No |
| `--disease` | Disease name or MeSH ID | Yes |
| `--focus` | Focus areas, comma-separated (anatomy, physiology, metabolism, immune, genetics, behavior) | No |
| `--full` | Generate full assessment report | No |
| `--quick` | Quick risk assessment mode | No |
| `--compare` | Multi-model comparison mode | No |
| `--output` | Output file path | No |
| `--format` | Output format (json, markdown, table) | No |

## Example Output

```json
{
  "model": "mouse",
  "disease": "Alzheimer's Disease",
  "overall_gap_score": 6.8,
  "risk_level": "HIGH",
  "dimensions": {
    "genetics": {"score": 8.5, "concerns": ["APOE4 differences", "Different tau pathology patterns"]},
    "physiology": {"score": 7.0, "concerns": ["Brain structure differences", "Lifespan differences"]},
    "metabolism": {"score": 6.5, "concerns": ["Significant drug metabolism differences"]},
    "immune": {"score": 5.5, "concerns": ["Microglia functional differences", "Different neuroinflammation patterns"]},
    "behavior": {"score": 6.0, "concerns": ["Limitations in cognitive assessment methods"]}
  },
  "clinical_failure_predictors": [
    "Immune-related mechanism research may not translate",
    "Drug clearance rate differences may lead to inappropriate dosing"
  ],
  "recommendations": [
    "Consider using humanized mouse models",
    "Add non-human primate validation experiments",
    "Focus on peripheral immune and central immune interactions"
  ]
}
```

## Model Types

### Common Models

| Model | Applicable Scenarios | Typical Gaps |
|------|----------|----------|
| mouse | Genetic manipulation, basic research | Immune, metabolism, brain structure |
| rat | Behavioral studies, cardiovascular | Cognition, drug metabolism |
| zebrafish | Development, high-throughput screening | Anatomy, physiology |
| cell_line | Molecular mechanisms | Microenvironment, systemic |
| organoid | Human-specific research | Maturity, vascularization |
| primate | Preclinical validation | Cost, ethics |

## Gap Scoring System

- **0-3**: Low gap, good translation prospects
- **4-6**: Moderate gap, requires additional validation
- **7-8**: High gap, significant translation risks exist
- **9-10**: Extremely high gap, low translation likelihood

## Files

- `SKILL.md` - This file
- `scripts/main.py` - Main analysis script

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `translational-gap-analyzer` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `translational-gap-analyzer` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## References

- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/audit-reference.md
# Audit Reference

## Scope

- Skill: `translational-gap-analyzer`
- Core purpose: Assess translational gaps between preclinical models and human diseases.
- Use only within the documented workflow and category boundary defined in `SKILL.md`

## Supported Audit Paths

- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`

## Fallback Boundary

If required inputs are incomplete, the skill should still return:

- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable

FILE:requirements.txt
dataclasses
enum

FILE:scripts/main.py
#!/usr/bin/env python3
"""Translational Gap Analyzer (ID: 209)
Assessing the translational gap between basic research models and human disease"""

import argparse
import json
import sys
from dataclasses import dataclass, asdict
from typing import List, Dict, Optional
from enum import Enum


class RiskLevel(Enum):
    LOW = "LOW"
    MEDIUM = "MEDIUM"
    HIGH = "HIGH"
    CRITICAL = "CRITICAL"


@dataclass
class DimensionScore:
    """Ratings and concerns for each dimension"""
    score: float  # 0-10, the higher the value, the greater the gap.
    concerns: List[str]
    details: Dict[str, any]


@dataclass
class GapAnalysisReport:
    """Conversion Gap Analysis Report"""
    model: str
    disease: str
    overall_gap_score: float
    risk_level: str
    dimensions: Dict[str, DimensionScore]
    clinical_failure_predictors: List[str]
    recommendations: List[str]


# Knowledge Base: Model-Disease Gap Data
TRANSLATIONAL_KNOWLEDGE = {
    "mouse": {
        "anatomy": {
            "brain_structure": 7.5,
            "organ_size_ratio": 6.0,
            "vascular_pattern": 7.0,
            "immune_organs": 5.5
        },
        "physiology": {
            "lifespan": 9.0,
            "heart_rate": 8.0,
            "body_temperature": 7.5,
            "reproductive_cycle": 8.5
        },
        "metabolism": {
            "drug_clearance": 8.0,
            "cytochrome_p450": 7.5,
            "glucose_metabolism": 6.5,
            "lipid_metabolism": 6.0
        },
        "immune": {
            "innate_immunity": 6.0,
            "adaptive_immunity": 5.5,
            "cytokine_profile": 7.0,
            "microglia_function": 8.5
        },
        "genetics": {
            "gene_conservation": 4.0,
            "regulatory_elements": 7.0,
            "chromosome_structure": 6.5,
            "splice_variants": 7.5
        },
        "behavior": {
            "cognitive_assessment": 8.0,
            "social_behavior": 7.5,
            "motor_function": 5.0,
            "pain_perception": 7.0
        }
    },
    "rat": {
        "anatomy": {
            "brain_structure": 6.5,
            "organ_size_ratio": 5.5,
            "vascular_pattern": 6.0,
            "immune_organs": 5.0
        },
        "physiology": {
            "lifespan": 8.5,
            "heart_rate": 7.5,
            "body_temperature": 7.0,
            "reproductive_cycle": 7.5
        },
        "metabolism": {
            "drug_clearance": 7.5,
            "cytochrome_p450": 7.0,
            "glucose_metabolism": 6.0,
            "lipid_metabolism": 5.5
        },
        "immune": {
            "innate_immunity": 5.5,
            "adaptive_immunity": 5.0,
            "cytokine_profile": 6.5,
            "microglia_function": 7.5
        },
        "genetics": {
            "gene_conservation": 4.5,
            "regulatory_elements": 6.5,
            "chromosome_structure": 6.0,
            "splice_variants": 7.0
        },
        "behavior": {
            "cognitive_assessment": 7.0,
            "social_behavior": 6.5,
            "motor_function": 4.5,
            "pain_perception": 6.5
        }
    },
    "zebrafish": {
        "anatomy": {
            "brain_structure": 8.5,
            "organ_structure": 8.0,
            "vascular_pattern": 7.0,
            "immune_organs": 8.5
        },
        "physiology": {
            "water_vs_air": 9.0,
            "temperature_regulation": 9.0,
            "reproduction": 8.5,
            "regeneration": 7.0
        },
        "metabolism": {
            "drug_metabolism": 8.5,
            "energy_metabolism": 7.5,
            "xenobiotic_metabolism": 8.0
        },
        "immune": {
            "adaptive_immunity": 9.0,
            "innate_immunity": 7.0,
            "inflammation": 7.5
        },
        "genetics": {
            "gene_duplication": 8.5,
            "conservation": 6.0,
            "regeneration_genes": 6.5
        },
        "behavior": {
            "cognitive_assessment": 8.5,
            "social_behavior": 8.0,
            "learning": 8.0
        }
    },
    "cell_line": {
        "anatomy": {
            "tissue_architecture": 9.5,
            "cell_interactions": 9.0,
            "extracellular_matrix": 8.5,
            "3d_structure": 9.0
        },
        "physiology": {
            "systemic_regulation": 10.0,
            "homeostasis": 9.5,
            "stress_response": 7.0
        },
        "metabolism": {
            "culture_medium_effects": 8.5,
            "oxygen_levels": 8.0,
            "nutrient_availability": 7.5
        },
        "immune": {
            "immune_interactions": 9.5,
            "inflammation_context": 9.0
        },
        "genetics": {
            "genetic_drift": 8.0,
            "passage_effects": 8.5,
            "mutation_accumulation": 8.0
        }
    },
    "organoid": {
        "anatomy": {
            "maturation": 7.5,
            "vascularization": 9.0,
            "size_limitations": 8.5,
            "cell_diversity": 6.5
        },
        "physiology": {
            "systemic_integration": 9.5,
            "maturity_level": 7.5,
            "functionality": 7.0
        },
        "metabolism": {
            "nutrient_diffusion": 8.0,
            "waste_removal": 8.5,
            "metabolic_activity": 6.5
        },
        "immune": {
            "immune_component": 8.5,
            "inflammation_modeling": 7.5
        },
        "genetics": {
            "genetic_stability": 6.0,
            "patient_variability": 5.5
        }
    },
    "primate": {
        "anatomy": {
            "brain_structure": 2.5,
            "organ_structure": 3.0,
            "vascular_pattern": 3.5,
            "immune_organs": 3.0
        },
        "physiology": {
            "lifespan": 4.5,
            "reproductive_cycle": 5.0,
            "metabolic_rate": 4.0,
            "body_temperature": 3.0
        },
        "metabolism": {
            "drug_metabolism": 4.5,
            "cytochrome_p450": 4.0,
            "glucose_metabolism": 3.5,
            "lipid_metabolism": 3.5
        },
        "immune": {
            "innate_immunity": 3.5,
            "adaptive_immunity": 3.0,
            "cytokine_profile": 4.0,
            "microglia_function": 3.5
        },
        "genetics": {
            "gene_similarity": 2.5,
            "regulatory_elements": 4.0,
            "chromosome_structure": 3.0
        },
        "behavior": {
            "cognitive_assessment": 4.0,
            "social_behavior": 3.5,
            "emotional_response": 4.5
        }
    }
}


# disease-specific risk factors
DISEASE_SPECIFIC_FACTORS = {
    "alzheimer": {
        "mouse": {
            "pathology_differences": ["Lack of natural tau pathology", "Different Aβ deposition patterns", "Differences in neurodegeneration rates"],
            "genetic_risks": ["APOE4 model limited", "TREM2 functional differences"],
            "immune_factors": ["Microglial reactivity differs", "Differences in neuroinflammation patterns"],
            "drug_targets": ["The amyloid hypothesis has failed many times", "Tau pathology is difficult to replicate"]
        },
        "rat": {
            "pathology_differences": ["tau pathology limited", "Aβ clearance differences"],
            "genetic_risks": ["There are fewer APOE models"],
            "behavioral_differences": ["Limitations of cognitive assessment methods"]
        }
    },
    "parkinson": {
        "mouse": {
            "pathology_differences": ["Lack of natural Lewy bodies", "Dopamine system differences"],
            "genetic_risks": ["LRRK2 mutations have different effects", "Limitations of SNCA overexpression model"],
            "drug_targets": ["Dopamine replacement therapy model is effective but clinically variable"]
        },
        "rat": {
            "toxin_models": ["MPTP model is very different from sporadic PD", "6-OHDA model limitations"],
            "genetic_models": ["Genetic modeling progress is slow"]
        }
    },
    "cancer": {
        "mouse": {
            "immune_factors": ["Differences in the immune system affect the effectiveness of immunotherapy", "Different tumor microenvironments"],
            "metabolic_factors": ["Drug metabolism varies significantly", "Dose conversion is difficult"],
            "genetic_risks": ["Driver mutations are conserved but contextually diverse"]
        },
        "cell_line": {
            "model_limitations": ["lack of microenvironment", "2D vs 3D differences", "cell line evolution"]
        }
    },
    "diabetes": {
        "mouse": {
            "metabolic_factors": ["Differences in insulin resistance mechanisms", "Different fat distribution"],
            "immune_factors": ["Type 1 diabetes autoimmune differences", "Limitations of NOD mice"],
            "drug_targets": ["GLP-1 analogues have been relatively successful", "SGLT2 inhibitor differences"]
        },
        "rat": {
            "metabolic_factors": ["Differences between Zucker diabetic obese rats and human T2D"]
        }
    },
    "autoimmune": {
        "mouse": {
            "immune_factors": ["Differences in immune tolerance mechanisms", "MHC systems are fundamentally different"],
            "drug_targets": ["TNF inhibitors are relatively successful", "B cell targeting differences"],
            "pathology_differences": ["Differences between disease-induced models and spontaneous diseases"]
        }
    },
    "cardiovascular": {
        "mouse": {
            "anatomical_differences": ["Different distribution of coronary arteries", "Differences in myocardial regeneration ability"],
            "physiological_differences": ["Extremely fast heart rate", "Differences in EKG Interpretation"],
            "drug_targets": ["Statins are effective", "Differences in anticoagulant drug metabolism"]
        },
        "rat": {
            "advantages": ["Cardiovascular surgery models are more mature"],
            "differences": ["Still significantly different from clinical"]
        }
    }
}


def normalize_disease_name(disease: str) -> str:
    """standardized disease names"""
    disease_lower = disease.lower().replace("'", "").replace(" ", "_")
    disease_mapping = {
        "alzheimer": "alzheimer",
        "alzheimers": "alzheimer",
        "alzheimers_disease": "alzheimer",
        "parkinson": "parkinson",
        "parkinsons": "parkinson",
        "parkinsons_disease": "parkinson",
        "cancer": "cancer",
        "tumor": "cancer",
        "diabetes": "diabetes",
        "autoimmune": "autoimmune",
        "cardiovascular": "cardiovascular",
        "heart_disease": "cardiovascular"
    }
    return disease_mapping.get(disease_lower, disease_lower)


def get_disease_specific_risks(model: str, disease: str) -> List[str]:
    """Get disease-specific risks"""
    normalized_disease = normalize_disease_name(disease)
    risks = []
    
    if normalized_disease in DISEASE_SPECIFIC_FACTORS:
        disease_data = DISEASE_SPECIFIC_FACTORS[normalized_disease]
        if model in disease_data:
            model_data = disease_data[model]
            for category, items in model_data.items():
                if isinstance(items, list):
                    risks.extend(items)
    
    return risks


def calculate_dimension_score(model: str, dimension: str, focus_areas: List[str]) -> DimensionScore:
    """Calculate the score for a specific dimension"""
    if model not in TRANSLATIONAL_KNOWLEDGE:
        return DimensionScore(score=5.0, concerns=["Unknown model type"], details={})
    
    model_data = TRANSLATIONAL_KNOWLEDGE[model]
    
    if dimension not in model_data:
        return DimensionScore(score=5.0, concerns=["The dimension data is not available"], details={})
    
    dim_data = model_data[dimension]
    scores = list(dim_data.values())
    avg_score = sum(scores) / len(scores) if scores else 5.0
    
    # Identify key concerns
    concerns = []
    for key, value in dim_data.items():
        if value >= 7.0:
            concerns.append(f"{key}: significant difference (score: {value})")
        elif value >= 5.0:
            concerns.append(f"{key}: medium difference (score: {value})")
    
    # If this dimension is the focus, list more detailed concerns
    if dimension in focus_areas:
        concerns = [f"{k}: {v}" for k, v in dim_data.items() if v >= 5.0]
    
    return DimensionScore(
        score=round(avg_score, 1),
        concerns=concerns[:5],  # limited quantity
        details=dim_data
    )


def determine_risk_level(overall_score: float) -> str:
    """Determine risk level"""
    if overall_score >= 8.0:
        return RiskLevel.CRITICAL.value
    elif overall_score >= 6.5:
        return RiskLevel.HIGH.value
    elif overall_score >= 4.0:
        return RiskLevel.MEDIUM.value
    else:
        return RiskLevel.LOW.value


def generate_failure_predictors(model: str, disease: str, dimensions: Dict[str, DimensionScore]) -> List[str]:
    """Generate clinical trial failure predictors"""
    predictors = []
    
    # Identify risks based on dimension scores
    high_gap_dims = [dim for dim, score in dimensions.items() if score.score >= 7.0]
    
    if "immune" in high_gap_dims:
        predictors.append("Research on immune-related mechanisms may be at risk of translation failure")
    if "metabolism" in high_gap_dims:
        predictors.append("Pharmacokinetic differences may lead to inappropriate clinical dosing")
    if "anatomy" in high_gap_dims and model in ["mouse", "rat"]:
        predictors.append("Differences in organ structure may affect drug distribution")
    if "behavior" in high_gap_dims and "alzheimer" in disease.lower():
        predictors.append("Interspecies differences in cognitive assessment methods may mask true efficacy")
    if "genetics" in high_gap_dims:
        predictors.append("Differences in gene regulatory networks may affect target effectiveness")
    
    # Add disease-specific risks
    disease_risks = get_disease_specific_risks(model, disease)
    predictors.extend(disease_risks[:3])  # Up to 3
    
    return predictors if predictors else ["No specific high-risk factors have been identified, but careful assessment is still required"]


def generate_recommendations(model: str, disease: str, dimensions: Dict[str, DimensionScore]) -> List[str]:
    """Generate improvement suggestions"""
    recommendations = []
    
    # Recommendations based on model type
    if model == "mouse":
        if dimensions.get("immune", DimensionScore(0, [], {})).score >= 7.0:
            recommendations.append("Consider using mouse models of humanized immune systems")
        if dimensions.get("metabolism", DimensionScore(0, [], {})).score >= 7.0:
            recommendations.append("In vitro human hepatocyte validation of drug metabolism")
        if dimensions.get("behavior", DimensionScore(0, [], {})).score >= 7.0:
            recommendations.append("Increase behavioral validation in non-human primates")
        recommendations.append("Consider using genetically humanized mice to increase translational relevance")
    
    elif model == "rat":
        recommendations.append("Utilizing Better Behavioral Characteristics of Rats for Cognitive Assessment")
        if dimensions.get("metabolism", DimensionScore(0, [], {})).score >= 6.0:
            recommendations.append("Performing metabolic studies in humanized liver microsomes")
    
    elif model == "zebrafish":
        recommendations.append("Mainly used for high-throughput screening and developmental toxicity testing")
        recommendations.append("Positive results need to be verified in mammalian models")
        recommendations.append("Focus on cardiovascular and neurodevelopmental toxicity assessment")
    
    elif model == "cell_line":
        recommendations.append("Improving physiological relevance using 3D culture or organoid models")
        recommendations.append("Combined with co-culture system to simulate tumor microenvironment")
        recommendations.append("Regularly verify cell line identity and genetic stability")
    
    elif model == "organoid":
        recommendations.append("Focus on organoid maturity and functional assessment")
        recommendations.append("Consider vascularization techniques to improve nutrient supply")
        recommendations.append("Combined with single-cell sequencing to verify cellular heterogeneity")
    
    elif model == "primate":
        recommendations.append("Used as final preclinical validation model")
        recommendations.append("Strictly follow the 3R principles and design experiments rationally")
    
    # Disease-specific recommendations
    normalized_disease = normalize_disease_name(disease)
    if normalized_disease == "alzheimer":
        recommendations.append("Focus on pathological mechanisms beyond Aβ and tau")
        recommendations.append("Pay attention to neuroimmune and microglia research")
    elif normalized_disease == "cancer":
        recommendations.append("Pay attention to tumor microenvironment and immunotherapy evaluation")
        recommendations.append("Consider patient-derived xenograft models (PDX)")
    elif normalized_disease == "autoimmune":
        recommendations.append("Focus on human-specific immune targets")
        recommendations.append("Consider using humanized immune system models")
    
    return recommendations


def analyze_gap(model: str, disease: str, focus_areas: List[str] = None) -> GapAnalysisReport:
    """Perform a conversion gap analysis"""
    if focus_areas is None:
        focus_areas = []
    
    dimensions_to_analyze = focus_areas if focus_areas else [
        "anatomy", "physiology", "metabolism", "immune", "genetics", "behavior"
    ]
    
    dimensions = {}
    total_score = 0
    
    for dim in dimensions_to_analyze:
        dim_score = calculate_dimension_score(model, dim, focus_areas)
        dimensions[dim] = dim_score
        total_score += dim_score.score
    
    overall_score = total_score / len(dimensions) if dimensions else 5.0
    risk_level = determine_risk_level(overall_score)
    
    failure_predictors = generate_failure_predictors(model, disease, dimensions)
    recommendations = generate_recommendations(model, disease, dimensions)
    
    # Convert to serializable format
    serializable_dimensions = {}
    for dim, score in dimensions.items():
        serializable_dimensions[dim] = {
            "score": score.score,
            "concerns": score.concerns,
            "details": score.details
        }
    
    return GapAnalysisReport(
        model=model,
        disease=disease,
        overall_gap_score=round(overall_score, 1),
        risk_level=risk_level,
        dimensions=serializable_dimensions,
        clinical_failure_predictors=failure_predictors,
        recommendations=recommendations
    )


def format_report(report: GapAnalysisReport, output_format: str = "json") -> str:
    """Format report output"""
    if output_format == "json":
        return json.dumps(asdict(report), indent=2, ensure_ascii=False)
    
    elif output_format == "markdown":
        md = f"""# Conversion Gap Analysis Report

## Basic information
- **Model**: {report.model}
- **disease**: {report.disease}
- **overall gap score**: {report.overall_gap_score}/10
- **risk level**: {report.risk_level}

## Assessment in various dimensions
"""
        for dim, data in report.dimensions.items():
            md += f"\n### {dim.capitalize()}\n"
            md += f"- **score**: {data['score']}\n"
            if data['concerns']:
                md += "- **Points of concern**:"
                for concern in data['concerns']:
                    md += f"  - {concern}\n"
        
        md += "## Predictors of clinical trial failure"
        for predictor in report.clinical_failure_predictors:
            md += f"- ⚠️ {predictor}\n"
        
        md += "## Improvement suggestions"
        for rec in report.recommendations:
            md += f"- 💡 {rec}\n"
        
        return md
    
    else:  # table format
        lines = []
        lines.append("=" * 70)
        lines.append(f"Conversion Gap Analysis Report - {report.model} vs {report.disease}")
        lines.append("=" * 70)
        lines.append(f"Overall rating: {report.overall_gap_score}/10 | risk level: {report.risk_level}")
        lines.append("-" * 70)
        lines.append(f"{'Dimensions':<15} {'score':<10} {'main focus'}")
        lines.append("-" * 70)
        
        for dim, data in report.dimensions.items():
            concerns = "; ".join(data['concerns'][:2]) if data['concerns'] else "none"
            lines.append(f"{dim.capitalize():<15} {data['score']:<10} {concerns}")
        
        lines.append("-" * 70)
        lines.append("[Risk warning of clinical trial failure]")
        for i, predictor in enumerate(report.clinical_failure_predictors[:3], 1):
            lines.append(f"  {i}. {predictor}")
        
        lines.append("[Improvement suggestions]")
        for i, rec in enumerate(report.recommendations[:4], 1):
            lines.append(f"  {i}. {rec}")
        
        return "\n".join(lines)


def compare_models(models: List[str], disease: str, focus_areas: List[str] = None) -> str:
    """Compare multiple models"""
    reports = []
    for model in models:
        report = analyze_gap(model, disease, focus_areas)
        reports.append(report)
    
    # Sorting: The lower the score, the better (the smaller the gap)
    reports.sort(key=lambda x: x.overall_gap_score)
    
    output = ["=" * 80]
    output.append(f"Multi-model comparative analysis: {disease}")
    output.append("=" * 80)
    output.append(f"{'Model':<15} {'gap score':<12} {'risk level':<12} {'Recommended sorting'}")
    output.append("-" * 80)
    
    for i, report in enumerate(reports, 1):
        output.append(f"{report.model:<15} {report.overall_gap_score:<12} {report.risk_level:<12} #{i}")
    
    output.append("\n" + "=" * 80)
    output.append("Detailed analysis")
    output.append("=" * 80)
    
    for report in reports:
        output.append(f"\n[{report.model.upper()}]")
        output.append(f"risk level: {report.risk_level} | key questions: {', '.join(report.clinical_failure_predictors[:2])}")
        output.append(f"Main recommendations: {report.recommendations[0] if report.recommendations else 'N/A'}")
    
    return "\n".join(output)


def main():
    parser = argparse.ArgumentParser(
        description="Translational Gap Analyzer - Assessing the translational gap between basic research models and human disease",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""Example:
  python main.py --model mouse --disease "Alzheimer's" --full
  python main.py --model mouse --disease "Alzheimer's" --quick
  python main.py --models mouse,rat,primate --disease cancer --compare
  python main.py --model mouse --disease diabetes --focus metabolism,immune"""
    )
    
    parser.add_argument("--model", type=str, help="Model type (mouse, rat, zebrafish, cell_line, organoid, primate)")
    parser.add_argument("--models", type=str, help="Multi-model comparison mode, comma separated")
    parser.add_argument("--disease", type=str, required=True, help="Disease name")
    parser.add_argument("--focus", type=str, help="Areas of concern, separated by commas (anatomy, physiology, metabolism, immune, genetics, behavior)")
    parser.add_argument("--full", action="store_true", help="Generate full assessment report")
    parser.add_argument("--quick", action="store_true", help="Rapid risk assessment mode")
    parser.add_argument("--compare", action="store_true", help="Multi-model comparison mode")
    parser.add_argument("--output", type=str, help="Output file path")
    parser.add_argument("--format", type=str, choices=["json", "markdown", "table"], default="table", help="Output format")
    
    args = parser.parse_args()
    
    # Analyze areas of concern
    focus_areas = []
    if args.focus:
        focus_areas = [f.strip() for f in args.focus.split(",")]
    
    # Multi-model comparison mode
    if args.compare or args.models:
        if not args.models:
            print("Error: Comparison models require --models argument", file=sys.stderr)
            sys.exit(1)
        
        models = [m.strip() for m in args.models.split(",")]
        result = compare_models(models, args.disease, focus_areas)
        
        if args.output:
            with open(args.output, "w", encoding="utf-8") as f:
                f.write(result)
            print(f"Comparison report saved to: {args.output}")
        else:
            print(result)
        return
    
    # Single model analysis mode
    if not args.model:
        print("Error: Single-model analysis requires --model argument", file=sys.stderr)
        sys.exit(1)
    
    # Perform analysis
    report = analyze_gap(args.model, args.disease, focus_areas)
    
    # Determine output format
    output_format = args.format
    if args.full:
        output_format = "markdown"
    elif args.quick:
        output_format = "table"
    
    result = format_report(report, output_format)
    
    # Output results
    if args.output:
        with open(args.output, "w", encoding="utf-8") as f:
            f.write(result)
        print(f"Report saved to: {args.output}")
    else:
        print(result)


if __name__ == "__main__":
    main()

ClawHub Coding Data Analysis+2

A@clawhub-aipoch-ai-772015cadb

Toxicity Structure Alert

Skill

Analyze data with `toxicity-structure-alert` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.

---
name: toxicity-structure-alert
description: Analyze data with `toxicity-structure-alert` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
license: MIT
skill-author: AIPOCH
---
# Toxicity Structure Alert (Skill ID: 141)

Identify potential toxic structural alerts in drug molecules.

## When to Use

- Use this skill when the task is to Identify potential toxic structural alerts in drug molecules by scanning.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

See `## Features` above for related details.

- Scope-focused workflow aligned to: Analyze data with `toxicity-structure-alert` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

- Python 3.8+
- RDKit

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Data Analytics/toxicity-structure-alert"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py --input "Audit validation sample with explicit symptoms, history, assessment, and next-step plan." --format json
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Features

- Scan molecular structures (SMILES/SMARTS)
- Identify known toxic structural alerts
- Assess potential toxicity risk levels
- Generate detailed reports

## Supported Alert Structures

| Alert Structure | Toxicity Type | Risk Level |
|---------|---------|---------|
| Aromatic Nitro | Mutagenicity | High |
| Aromatic Amine | Carcinogenicity | High |
| Epoxide | Alkylating Agent | High |
| Aldehyde | Reactive Toxicity | Medium |
| Acyl Chloride | Reactive Toxicity | Medium |
| Michael Acceptor | Electrophilic Toxicity | Medium |
| Hydrazine | Hepatotoxicity | High |
| Haloalkyl | Alkylating Agent | High |
| Quinone | Oxidative Stress | Medium |
| Thiol-Reactive Groups | Protein Binding | Low-Medium |

## Usage

```text
python -m py_compile scripts/main.py

# Example invocation: python scripts/main.py --input <smiles_string> [--format json|text]
```

### Parameters

- `--input, -i`: Input SMILES string (required)
- `--format, -f`: Output format, optional `json` or `text` (default: text)
- `--detail, -d`: Detail level, optional `basic`, `standard`, `full` (default: standard)

### Examples

```text

# Basic text output
python scripts/main.py -i "O=[N+]([O-])c1ccccc1"

# JSON format output
python scripts/main.py -i "O=C1OC1c1ccccc1" -f json

# Detailed report
python scripts/main.py -i "c1ccc2c(c1)ccc1c3ccccc3ccc21" -d full
```

### Python API

```python
from scripts.main import ToxicityAlertScanner

scanner = ToxicityAlertScanner()
result = scanner.scan("O=[N+]([O-])c1ccccc1")
print(result.alerts)
```

## Output Format

### JSON Output

```json
{
  "input": "O=[N+]([O-])c1ccccc1",
  "mol_weight": 123.11,
  "alert_count": 1,
  "risk_score": 0.85,
  "risk_level": "HIGH",
  "alerts": [
    {
      "name": "Aromatic Nitro",
      "type": "mutagenic",
      "smarts": "[N+](=O)[O-]",
      "risk_level": "HIGH",
      "description": "May cause DNA damage and mutagenicity"
    }
  ],
  "recommendations": [
    "Recommend Ames test validation",
    "Consider structural optimization to reduce toxicity"
  ]
}
```

## Risk Levels

- **HIGH**: Known significant toxicity, strongly recommended to avoid
- **MEDIUM**: Potential toxicity, further evaluation recommended
- **LOW**: Minor concern, can be considered based on specific circumstances

## Notes

1. This tool is based on known alert structures and cannot replace comprehensive toxicological assessment
2. False positives and false negatives may both exist
3. Recommended to use with other ADMET prediction tools

## References

- Ashby J., Tennant R.W. (1988) Chemical structure, Salmonella mutagenicity...
- Kazius J., McGuire R., Bursi R. (2005) Derivation and validation of toxicophores...
- Enoch S.J., Cronin M.T.D. (2010) A review of the electrophilic reaction chemistry...

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `toxicity-structure-alert` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `toxicity-structure-alert` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

## Inputs to Collect

- Required inputs: the user goal, the primary data or source file, and the requested output format.
- Optional inputs: output directory, formatting preferences, and validation constraints.
- If a required input is unavailable, return a short clarification request before continuing.

## Output Contract

- Return a short summary, the main deliverables, and any assumptions that materially affect interpretation.
- If execution is partial, label what succeeded, what failed, and the next safe recovery step.
- Keep the final answer within the documented scope of the skill.

## Validation and Safety Rules

- Validate identifiers, file paths, and user-provided parameters before execution.
- Do not fabricate results, metrics, citations, or downstream conclusions.
- Use safe fallback behavior when dependencies, credentials, or required inputs are missing.
- Surface any execution failure with a concise diagnosis and recovery path.

FILE:references/runtime_checklist.md
# Runtime Checklist

- Category: `Data Analysis`
- Validate the user goal, required inputs, and output format before taking action.
- Ask a targeted clarification question when a required input is missing.
- Keep the response scoped to the documented workflow and state assumptions explicitly.
- Run a non-destructive smoke check before any file-dependent or data-dependent command.
- Recommended smoke check: `python -m py_compile scripts/main.py`
- If execution fails, stop and return a concise recovery path instead of fabricating results.

FILE:requirements.txt
dataclasses
enum
rdkit

FILE:scripts/main.py
#!/usr/bin/env python3
"""Toxicity Structure Alert (Skill ID: 141)
Scan the drug molecular structure to identify potential toxicity warning structures.

Usage:
    python main.py --input <smiles> [--format json|text] [--detail level]"""

import argparse
import json
import sys
from dataclasses import dataclass, asdict
from typing import List, Dict, Optional, Tuple
from enum import Enum


class RiskLevel(Enum):
    """risk level"""
    LOW = "LOW"
    MEDIUM = "MEDIUM"
    HIGH = "HIGH"


@dataclass
class ToxicAlert:
    """Toxicity warning structure data class"""
    name: str
    name_en: str
    type: str
    smarts: str
    risk_level: RiskLevel
    description: str
    weight: float = 1.0


@dataclass
class AlertMatch:
    """Matched alerts"""
    name: str
    type: str
    smarts: str
    risk_level: str
    description: str
    match_count: int = 1


@dataclass
class ScanResult:
    """Scan results"""
    input_smiles: str
    mol_weight: Optional[float]
    alert_count: int
    risk_score: float
    risk_level: str
    alerts: List[AlertMatch]
    recommendations: List[str]


class ToxicityAlertScanner:
    """Toxicity Alert Structural Scanner"""
    
    # Predefined toxicity warning structures
    TOXIC_ALERTS = [
        # High toxicity warning
        ToxicAlert(
            name="aromatic nitro",
            name_en="Aromatic Nitro",
            type="mutagenic",
            smarts="[N+](=O)[O-]c",
            risk_level=RiskLevel.HIGH,
            description="Nitroaromatic compounds may cause DNA damage, are mutagenic and potentially carcinogenic",
            weight=1.0
        ),
        ToxicAlert(
            name="Aromatic primary amines",
            name_en="Aromatic Primary Amine",
            type="carcinogenic",
            smarts="Nc1ccccc1",
            risk_level=RiskLevel.HIGH,
            description="Aromatic amines can be metabolically activated to produce electrophilic substances and form adducts with DNA.",
            weight=0.9
        ),
        ToxicAlert(
            name="Epoxide",
            name_en="Epoxide",
            type="alkylating",
            smarts="C1OC1",
            risk_level=RiskLevel.HIGH,
            description="Epoxides are highly reactive three-membered cyclic ethers that can act as alkylating agents to damage DNA.",
            weight=1.0
        ),
        ToxicAlert(
            name="aziridine",
            name_en="Aziridine",
            type="alkylating",
            smarts="C1NC1",
            risk_level=RiskLevel.HIGH,
            description="The aziridine ring is highly reactive and can be used as an alkylating agent",
            weight=1.0
        ),
        ToxicAlert(
            name="Hydrazine/Hydrazine",
            name_en="Hydrazine",
            type="hepatotoxic",
            smarts="[NX3][NX3]",
            risk_level=RiskLevel.HIGH,
            description="Hydrazines are hepatotoxic and may cause liver damage",
            weight=0.9
        ),
        ToxicAlert(
            name="Haloalkyl",
            name_en="Haloalkyl",
            type="alkylating",
            smarts="[C][F,Cl,Br,I]",
            risk_level=RiskLevel.HIGH,
            description="Haloalkyl groups can serve as leaving groups, forming electrophilic centers leading to alkylation",
            weight=0.85
        ),
        ToxicAlert(
            name="polycyclic aromatic hydrocarbons",
            name_en="Polycyclic Aromatic Hydrocarbon",
            type="carcinogenic",
            smarts="c1ccc2c(c1)ccc1c3ccccc3ccc21",
            risk_level=RiskLevel.HIGH,
            description="Polycyclic aromatic hydrocarbons can be metabolically activated to form carcinogenic epoxides",
            weight=0.95
        ),
        
        # Moderate toxicity warning
        ToxicAlert(
            name="Aldehyde group",
            name_en="Aldehyde",
            type="reactive",
            smarts="[CX3H1](=O)",
            risk_level=RiskLevel.MEDIUM,
            description="The aldehyde group is electrophilic and can form Schiff base with protein",
            weight=0.6
        ),
        ToxicAlert(
            name="acid chloride",
            name_en="Acyl Chloride",
            type="reactive",
            smarts="C(=O)Cl",
            risk_level=RiskLevel.MEDIUM,
            description="Acid chlorides are highly reactive and can undergo acylation reactions with nucleophilic groups",
            weight=0.7
        ),
        ToxicAlert(
            name="Michael receptor",
            name_en="Michael Acceptor",
            type="electrophilic",
            smarts="C=CC(=O)",
            risk_level=RiskLevel.MEDIUM,
            description="α,β-unsaturated carbonyl group can act as a Michael acceptor to covalently bind to sulfhydryl group",
            weight=0.65
        ),
        ToxicAlert(
            name="Quinones",
            name_en="Quinone",
            type="oxidative",
            smarts="O=C1C=CC(=O)C=C1",
            risk_level=RiskLevel.MEDIUM,
            description="Quinones can produce reactive oxygen species through redox cycles, causing oxidative stress",
            weight=0.7
        ),
        ToxicAlert(
            name="Nitroso",
            name_en="Nitroso",
            type="carcinogenic",
            smarts="N=O",
            risk_level=RiskLevel.MEDIUM,
            description="N-nitroso compounds are potentially carcinogenic",
            weight=0.75
        ),
        ToxicAlert(
            name="thioester",
            name_en="Thioester",
            type="reactive",
            smarts="C(=O)S",
            risk_level=RiskLevel.MEDIUM,
            description="Thioesters can undergo exchange reactions with sulfhydryl groups",
            weight=0.55
        ),
        
        # Low toxicity warning
        ToxicAlert(
            name="Thiol",
            name_en="Thiol",
            type="reactive",
            smarts="[SX2H]",
            risk_level=RiskLevel.LOW,
            description="Thiols are reactive and can participate in redox and metal chelation",
            weight=0.3
        ),
        ToxicAlert(
            name="Sulfonyl chloride",
            name_en="Sulfonyl Chloride",
            type="reactive",
            smarts="S(=O)(=O)Cl",
            risk_level=RiskLevel.LOW,
            description="Sulfonyl chloride reacts with nucleophiles",
            weight=0.4
        ),
    ]
    
    def __init__(self):
        self.alerts = self.TOXIC_ALERTS
        self._compile_patterns()
    
    def _compile_patterns(self):
        """Compile SMARTS mode"""
        try:
            from rdkit import Chem
            self.has_rdkit = True
            self._compiled_patterns = {}
            for alert in self.alerts:
                try:
                    pattern = Chem.MolFromSmarts(alert.smarts)
                    if pattern:
                        self._compiled_patterns[alert.name] = pattern
                except Exception:
                    pass
        except ImportError:
            self.has_rdkit = False
            print("Warning: RDKit not available. Falling back to basic pattern matching.", file=sys.stderr)
    
    def _parse_smiles(self, smiles: str) -> Tuple[Optional['Chem.Mol'], Optional[float]]:
        """Parse SMILES string"""
        if not self.has_rdkit:
            return None, None
        
        from rdkit import Chem
        from rdkit.Chem import Descriptors
        
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            return None, None
        
        mol_weight = Descriptors.MolWt(mol)
        return mol, mol_weight
    
    def _match_alerts(self, mol) -> List[AlertMatch]:
        """Match toxicity warning structure"""
        matches = []
        
        if self.has_rdkit and mol is not None:
            # Exact substructure matching using RDKit
            from rdkit import Chem
            for alert in self.alerts:
                pattern = self._compiled_patterns.get(alert.name)
                if pattern:
                    try:
                        match_list = mol.GetSubstructMatches(pattern)
                        if match_list:
                            matches.append(AlertMatch(
                                name=alert.name,
                                type=alert.type,
                                smarts=alert.smarts,
                                risk_level=alert.risk_level.value,
                                description=alert.description,
                                match_count=len(match_list)
                            ))
                    except Exception:
                        pass
        else:
            # Degraded mode: Use simple string matching
            smiles_upper = self._current_smiles.upper()
            matches = self._fallback_string_match(smiles_upper)
        
        return matches
    
    def _fallback_string_match(self, smiles: str) -> List[AlertMatch]:
        """Degrade mode: based on simple string pattern matching"""
        matches = []
        
        # Define a simplified pattern string
        fallback_patterns = {
            "aromatic nitro": ("O=[N+]([O-])", "NO2"),
            "Aromatic primary amines": ("NC", "NH2"),
            "Epoxide": ("C1OC1",),
            "aziridine": ("C1NC1",),
            "Hydrazine/Hydrazine": ("NN",),
            "Haloalkyl": ("CL", "BR", "F", "I"),
            "polycyclic aromatic hydrocarbons": ("C=CC=CC=CC=CC",),
            "Aldehyde group": ("C=O", "CHO"),
            "acid chloride": ("C(=O)CL", "COCL"),
            "Michael receptor": ("C=CC=O",),
            "Quinones": ("O=C1C=CC(=O)",),
            "Nitroso": ("N=O", "N-O"),
            "Thiol": ("SH",),
        }
        
        alert_map = {a.name: a for a in self.alerts}
        matched_names = set()
        
        for alert_name, patterns in fallback_patterns.items():
            if alert_name in matched_names:
                continue
            for pattern in patterns:
                if pattern in smiles:
                    alert = alert_map.get(alert_name)
                    if alert:
                        matches.append(AlertMatch(
                            name=alert.name,
                            type=alert.type,
                            smarts=alert.smarts,
                            risk_level=alert.risk_level.value,
                            description=alert.description,
                            match_count=smiles.count(pattern)
                        ))
                        matched_names.add(alert_name)
                        break
        
        return matches
    
    def _calculate_risk_score(self, matches: List[AlertMatch]) -> Tuple[float, str]:
        """Calculate risk scores and levels"""
        if not matches:
            return 0.0, "NONE"
        
        # Get the weight of matching alerts
        alert_weights = {a.name: (a.weight, a.risk_level) for a in self.alerts}
        
        total_weight = 0.0
        max_risk = RiskLevel.LOW
        
        for match in matches:
            weight, risk = alert_weights.get(match.name, (0.3, RiskLevel.LOW))
            total_weight += weight * match.match_count
            
            if risk == RiskLevel.HIGH:
                max_risk = RiskLevel.HIGH
            elif risk == RiskLevel.MEDIUM and max_risk != RiskLevel.HIGH:
                max_risk = RiskLevel.MEDIUM
        
        # Calculate risk score (0-1)
        risk_score = min(total_weight / 2.0, 1.0)
        
        return round(risk_score, 2), max_risk.value
    
    def _generate_recommendations(self, matches: List[AlertMatch], risk_level: str) -> List[str]:
        """Generate suggestions"""
        recommendations = []
        
        if risk_level == "HIGH":
            recommendations.append("⚠️ High-risk toxic structures are detected, and it is strongly recommended to conduct Ames test verification")
            recommendations.append("⚠️ Consider structural optimization to reduce potential toxicity risks")
        elif risk_level == "MEDIUM":
            recommendations.append("⚡ Medium risk structure detected, in vitro toxicity screening recommended")
            recommendations.append("⚡ Evaluate the impact of structural modifications on activity and toxicity")
        
        # Make recommendations based on specific alert types
        alert_types = set(m.type for m in matches)
        
        if "mutagenic" in alert_types or "carcinogenic" in alert_types:
            recommendations.append("🧬 It is recommended to conduct mutagenicity assessment (such as Ames, chromosomal aberration test)")
        if "hepatotoxic" in alert_types:
            recommendations.append("🫀 It is recommended to assess the risk of hepatotoxicity")
        if "alkylating" in alert_types:
            recommendations.append("⚗️ Alkylating agents are highly reactive and special attention needs to be paid to off-target effects")
        if "reactive" in alert_types:
            recommendations.append("🔄 Reactive groups may affect metabolic stability, it is recommended to evaluate plasma stability")
        
        if not recommendations:
            recommendations.append("✅ No significant toxicity warning structures were detected, but standard safety assessment is still recommended")
        
        return recommendations
    
    def scan(self, smiles: str) -> ScanResult:
        """Scan the SMILES string to identify toxicity warning structures
        
        Args:
            smiles: input SMILES string
            
        Returns:
            ScanResult: scan result"""
        self._current_smiles = smiles  # Save for downgrade mode
        mol, mol_weight = self._parse_smiles(smiles)
        matches = self._match_alerts(mol)
        risk_score, risk_level = self._calculate_risk_score(matches)
        recommendations = self._generate_recommendations(matches, risk_level)
        
        return ScanResult(
            input_smiles=smiles,
            mol_weight=round(mol_weight, 2) if mol_weight else None,
            alert_count=len(matches),
            risk_score=risk_score,
            risk_level=risk_level,
            alerts=matches,
            recommendations=recommendations
        )
    
    def format_text_output(self, result: ScanResult, detail: str = "standard") -> str:
        """Formatted text output"""
        lines = []
        lines.append("=" * 60)
        lines.append("Toxic Structure Alert Scan Report")
        lines.append("=" * 60)
        lines.append("")
        lines.append(f"enterSMILES: {result.input_smiles}")
        if result.mol_weight:
            lines.append(f"molecular weight: {result.mol_weight} Da")
        lines.append("")
        
        # Risk level display
        risk_emojis = {"HIGH": "🔴", "MEDIUM": "🟡", "LOW": "🟢", "NONE": "✅"}
        risk_emoji = risk_emojis.get(result.risk_level, "⚪")
        lines.append(f"risk level: {risk_emoji} {result.risk_level}")
        lines.append(f"risk score: {result.risk_score:.2f} / 1.0")
        lines.append(f"Number of alerts: {result.alert_count}")
        lines.append("")
        
        if result.alerts:
            lines.append("-" * 60)
            lines.append("Detected warning structures:")
            lines.append("-" * 60)
            for i, alert in enumerate(result.alerts, 1):
                lines.append(f"\n{i}. {alert.name}")
                lines.append(f"   type: {alert.type}")
                lines.append(f"   risk: {alert.risk_level}")
                if detail in ("standard", "full"):
                    lines.append(f"   SMARTS: {alert.smarts}")
                if detail == "full":
                    lines.append(f"   describe: {alert.description}")
                if alert.match_count > 1:
                    lines.append(f"   Number of matches: {alert.match_count}")
        
        lines.append("")
        lines.append("-" * 60)
        lines.append("suggestion:")
        lines.append("-" * 60)
        for rec in result.recommendations:
            lines.append(f"  {rec}")
        
        lines.append("")
        lines.append("=" * 60)
        lines.append("NOTE: This tool is based on known alert structures and is not a substitute for a comprehensive toxicological assessment")
        lines.append("=" * 60)
        
        return "\n".join(lines)
    
    def format_json_output(self, result: ScanResult) -> str:
        """Formatted JSON output"""
        data = {
            "input": result.input_smiles,
            "mol_weight": result.mol_weight,
            "alert_count": result.alert_count,
            "risk_score": result.risk_score,
            "risk_level": result.risk_level,
            "alerts": [
                {
                    "name": a.name,
                    "type": a.type,
                    "smarts": a.smarts,
                    "risk_level": a.risk_level,
                    "description": a.description,
                    "match_count": a.match_count
                }
                for a in result.alerts
            ],
            "recommendations": result.recommendations
        }
        return json.dumps(data, ensure_ascii=False, indent=2)


def main():
    """main function"""
    parser = argparse.ArgumentParser(
        description="Toxicity Structure Alert - scans drug molecular structure for toxicity alerts",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""Example:
  python main.py -i "O=[N+]([O-])c1ccccc1"
  python main.py -i "C1CCCCC1" -f json
  python main.py -i "c1ccc2c(c1)ccc1c3ccccc3ccc21" -d full"""
    )
    
    parser.add_argument(
        "--input", "-i",
        required=True,
        help="Input SMILES string"
    )
    
    parser.add_argument(
        "--format", "-f",
        choices=["json", "text"],
        default="text",
        help="Output format (default: text)"
    )
    
    parser.add_argument(
        "--detail", "-d",
        choices=["basic", "standard", "full"],
        default="standard",
        help="Verbosity (default: standard)"
    )
    
    args = parser.parse_args()
    
    # Create a scanner and perform scans
    scanner = ToxicityAlertScanner()
    result = scanner.scan(args.input)
    
    # Output results
    if args.format == "json":
        print(scanner.format_json_output(result))
    else:
        print(scanner.format_text_output(result, args.detail))
    
    # Return a non-zero exit code if high risk is detected
    if result.risk_level == "HIGH":
        sys.exit(1)
    else:
        sys.exit(0)


if __name__ == "__main__":
    main()

ClawHub Data Analysis Research+2

A@clawhub-aipoch-ai-772015cadb

Tone Adjuster

Skill

Use when converting medical text between academic and patient-friendly tones, translating medical jargon for patients, adapting research papers for public au...

---
name: tone-adjuster
description: Use when converting medical text between academic and patient-friendly tones, translating medical jargon for patients, adapting research papers for public audiences, or rewriting clinical notes for patient handouts. Maintains medical accuracy while adjusting readability level.
license: MIT
skill-author: AIPOCH
---
# Medical Tone Adjuster

Convert medical text between academic rigor and patient-friendly language while preserving clinical accuracy.

## When to Use

- Use this skill when the task needs Use when converting medical text between academic and patient-friendly tones, translating medical jargon for patients, adapting research papers for public audiences, or rewriting clinical notes for patient handouts. Maintains medical accuracy while adjusting readability level.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

- Scope-focused workflow aligned to: Use when converting medical text between academic and patient-friendly tones, translating medical jargon for patients, adapting research papers for public audiences, or rewriting clinical notes for patient handouts. Maintains medical accuracy while adjusting readability level.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.

## Example Usage

```bash
cd "20260318/scientific-skills/Academic Writing/tone-adjuster"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py demo
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Quick Start

```python
from scripts.tone_adjuster import ToneAdjuster

adjuster = ToneAdjuster()

# Academic → Patient-friendly
patient_text = adjuster.convert(
    text="The patient presents with acute myocardial infarction...",
    target_tone="patient-friendly"
)

# Patient-friendly → Academic
academic_text = adjuster.convert(
    text="I had a heart attack...",
    target_tone="academic"
)
```

## Core Capabilities

### 1. Academic to Patient-Friendly

```python
adjuster = ToneAdjuster()
result = adjuster.to_patient_friendly(
    "The patient exhibits tachycardia with irregular rhythm
     consistent with atrial fibrillation",
    reading_level="8th_grade"
)
```

**Conversion Rules:**
- Replace medical terms with common equivalents
- Shorten sentence length (aim for <15 words)
- Use active voice
- Remove unnecessary qualifiers

**Examples:**

| Academic | Patient-Friendly |
|----------|------------------|
| Myocardial infarction | Heart attack |
| Tachycardia | Fast heartbeat |
| Hypertension | High blood pressure |
| Benign prostatic hyperplasia | Enlarged prostate (non-cancerous) |
| Idiopathic | Unknown cause |

### 2. Patient-Friendly to Academic

```python
result = adjuster.to_academic(
    "My stomach hurts after eating spicy food",
    add_citations=True
)

# Output: "The patient reports postprandial abdominal pain

#          exacerbated by capsaicin-containing foods"
```

### 3. Reading Level Assessment

```python
metrics = adjuster.assess_reading_level(text)
print(f"Grade level: {metrics.grade_level}")
print(f"Medical terms: {metrics.jargon_count}")
print(f"Recommendations: {metrics.suggestions}")
```

**Reading Levels:**
- **5th-6th Grade**: Young patients, general public
- **8th Grade**: Most adult patients
- **12th Grade**: Educated lay audiences
- **College**: Healthcare professionals

### 4. Jargon Translation

```python
translations = adjuster.translate_jargon(
    text="Patient presents with dyspnea and orthopnea...",
    show_alternatives=True
)
```

**Common Medical Terms Dictionary:**

```json
{
  "dyspnea": {
    "patient_friendly": "shortness of breath",
    "explanation": "feeling like you can't get enough air"
  },
  "orthopnea": {
    "patient_friendly": "trouble breathing when lying down",
    "explanation": "need to prop up with pillows to breathe"
  }
}
```

## CLI Usage

```text

# Convert file
python scripts/tone_adjuster.py \
  --input clinical_note.txt \
  --direction academic-to-patient \
  --output patient_handout.txt

# Assess reading level
python scripts/tone_adjuster.py \
  --assess readme.txt \
  --target-grade 8
```

## Best Practices

**When Converting to Patient-Friendly:**
- ✅ Use "you" and "your" when appropriate
- ✅ Define terms in parentheses on first use
- ✅ Use analogies for complex concepts
- ✅ Keep paragraphs to 2-3 sentences

**When Converting to Academic:**
- ✅ Use precise medical terminology
- ✅ Include anatomical locations
- ✅ Specify temporal relationships
- ✅ Add objective measurements

## Common Pitfalls

❌ **Don't**: "Your heart has a problem"
✅ **Do**: "Your heart muscle shows signs of reduced blood flow"

❌ **Don't**: "The medicine might make you feel bad"
✅ **Do**: "This medication may cause nausea, dizziness, or fatigue"

## Quality Checklist

- [ ] Medical accuracy preserved
- [ ] No critical information lost
- [ ] Appropriate reading level achieved
- [ ] Tone matches intended audience
- [ ] All medical terms explained or translated

---

**Skill ID**: 202 | **Version**: 1.0 | **License**: MIT

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `tone-adjuster` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `tone-adjuster` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/guidelines.md
# Tone Adjuster - References

## Health Literacy
- NIH Plain Language Guidelines
- AHRQ Health Literacy Resources
- AMA Health Literacy Toolkit

## Readability
- Flesch-Kincaid Readability Tests
- Gunning Fog Index
- SMOG Readability Formula

FILE:scripts/main.py
#!/usr/bin/env python3
"""Tone Adjuster - Bidirectional medical text tone converter."""

import json
import re
from typing import Dict, List

class ToneAdjuster:
    """Adjusts tone of medical text."""
    
    JARGON_DICTIONARY = {
        "myocardial infarction": "heart attack",
        "cerebrovascular accident": "stroke",
        "hypertension": "high blood pressure",
        "hyperlipidemia": "high cholesterol",
        "malignancy": "cancer",
        "benign": "not cancerous",
        "idiopathic": "unknown cause",
        "iatrogenic": "caused by medical treatment",
        "acute": "sudden/severe",
        "chronic": "long-term",
        "exacerbation": "flare-up",
        "remission": "symptoms improved",
        "etiology": "cause",
        "pathogenesis": "how disease develops",
        "morbidity": "illness/complications",
        "mortality": "death rate"
    }
    
    def adjust(self, text: str, target_tone: str, level: str = "moderate") -> Dict:
        """Adjust text tone."""
        
        original_tone = self._assess_tone(text)
        
        if target_tone == "patient_friendly":
            converted = self._to_patient_friendly(text, level)
        elif target_tone == "academic":
            converted = self._to_academic(text)
        else:
            converted = text
        
        changes = self._identify_changes(text, converted)
        
        return {
            "converted_text": converted,
            "original_tone": original_tone,
            "target_tone": target_tone,
            "adjustment_level": level,
            "readability_score": self._calc_readability(converted),
            "changes_made": changes
        }
    
    def _to_patient_friendly(self, text: str, level: str) -> str:
        """Convert to patient-friendly language."""
        result = text
        
        # Replace jargon
        for medical, plain in self.JARGON_DICTIONARY.items():
            pattern = re.compile(medical, re.IGNORECASE)
            result = pattern.sub(plain, result)
        
        # Simplify sentence structure
        result = result.replace("utilize", "use")
        result = result.replace("demonstrate", "show")
        result = result.replace("indicate", "suggest")
        result = result.replace("approximately", "about")
        
        return result
    
    def _to_academic(self, text: str) -> str:
        """Convert to academic tone."""
        # Reverse the dictionary
        reverse_dict = {v: k for k, v in self.JARGON_DICTIONARY.items()}
        
        result = text
        for plain, medical in reverse_dict.items():
            pattern = re.compile(r'\b' + re.escape(plain) + r'\b', re.IGNORECASE)
            result = pattern.sub(medical, result)
        
        return result
    
    def _assess_tone(self, text: str) -> str:
        """Assess current tone."""
        jargon_count = sum(1 for term in self.JARGON_DICTIONARY if term.lower() in text.lower())
        
        if jargon_count > 3:
            return "academic"
        elif jargon_count > 0:
            return "mixed"
        return "patient-friendly"
    
    def _identify_changes(self, original: str, converted: str) -> List[str]:
        """Identify changes made."""
        changes = []
        
        for medical, plain in self.JARGON_DICTIONARY.items():
            if medical.lower() in original.lower() and plain.lower() in converted.lower():
                changes.append(f"'{medical}' → '{plain}'")
        
        return changes[:5]  # Limit to 5 changes
    
    def _calc_readability(self, text: str) -> float:
        """Simple readability estimate (Flesch-inspired)."""
        words = len(text.split())
        sentences = max(1, text.count('.') + text.count('!') + text.count('?'))
        avg_words_per_sentence = words / sentences
        
        # Simple score: lower is more readable
        score = max(0, min(100, 100 - (avg_words_per_sentence - 10) * 5))
        return round(score, 1)

def main():
    import sys
    adjuster = ToneAdjuster()
    
    text = sys.argv[1] if len(sys.argv) > 1 else "The patient suffered an acute myocardial infarction."
    tone = sys.argv[2] if len(sys.argv) > 2 else "patient_friendly"
    
    result = adjuster.adjust(text, tone)
    print(json.dumps(result, indent=2))

if __name__ == "__main__":
    main()

ClawHub Coding Research+2

A@clawhub-aipoch-ai-772015cadb

Time Zone Planner

Skill

Plan cross-time-zone meeting windows for distributed teams, providing region-by-region local time mappings and tradeoff analysis for scheduling decisions.

---
name: time-zone-planner
description: Plan cross-time-zone meeting windows for distributed teams, providing region-by-region local time mappings and tradeoff analysis for scheduling decisions.
license: MIT
skill-author: AIPOCH
---
# Time Zone Planner

Structured cross-time-zone meeting planning for distributed teams.

## Quick Check

```bash
python -m py_compile scripts/main.py
python scripts/main.py
```

## When to Use

- Use this skill when planning meeting windows across multiple regions or time zones.
- Use this skill when comparing candidate windows and tradeoffs for distributed team scheduling.
- Do not use this skill for calendar booking, live availability checking, or travel/legal decisions.

## Workflow

1. Confirm the participant regions, meeting duration, preferred local-hour ranges, and any hard constraints.
2. Check whether the request is for a quick overlap recommendation or a full tradeoff analysis.
3. Use the packaged script for baseline scheduling output; for complex requests, provide a manual comparison table with stated assumptions.
4. Return suggested meeting windows, region-by-region local times, and the tradeoffs behind the recommendation.
5. If timezone details or availability constraints are missing, stop and request the minimum missing fields.

## Usage

```text
python scripts/main.py
# Input: {"regions": ["US", "EU", "Asia"], "duration": 60}
```

## Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `regions` | list[string] | Yes | — | Region set, e.g. `["US/Eastern", "Europe/London", "Asia/Shanghai"]` |
| `duration` | integer | Yes | — | Meeting duration in minutes |
| `preferred_hours` | object | No | — | Per-region preferred local hour ranges, e.g. `{"US/Eastern": [9, 17]}` |

**Region format:** Use IANA timezone names (e.g., `US/Eastern`, `Europe/London`, `Asia/Shanghai`) for precise mapping. Short aliases like `"US"` or `"EU"` are accepted but will be mapped to a representative timezone with a note.

## Region Alias Mapping

| Alias | Mapped To | Note |
|-------|-----------|------|
| `US` | `US/Eastern` | Representative only; specify sub-region for accuracy |
| `EU` | `Europe/London` | Representative only; specify country for accuracy |
| `Asia` | `Asia/Shanghai` | Representative only; specify city for accuracy |

## Output

- Suggested meeting windows by region
- Local-time mapping for each included region
- DST assumption notes when applicable
- Tradeoff summary for each candidate window

## Scope Boundaries

- This skill supports scheduling recommendations, not calendar booking.
- This skill does not validate current DST status from live internet sources.
- This skill does not decide business priority between teams without user-supplied rules.
- Manual confirmation is required before sending invites.

## Stress-Case Rules

For multi-constraint requests, always include these explicit blocks:

1. Assumptions
2. Hard Constraints
3. Recommended Window
4. Tradeoffs
5. Risks and Manual Checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate live calendar availability or confirmed participant agreement.

## Input Validation

This skill accepts: a list of participant regions and a meeting duration for cross-timezone scheduling recommendations.

If the request does not involve cross-timezone meeting planning — for example, asking to book calendar events, check live availability, make travel arrangements, or provide legal scheduling advice — do not proceed with the workflow. Instead respond:
> "time-zone-planner is designed to recommend meeting windows across time zones for distributed teams. Your request appears to be outside this scope. Please provide a region list and meeting duration, or use a more appropriate tool."

## References

- [references/guidelines.md](references/guidelines.md) — General time-zone planning notes
- [references/audit-reference.md](references/audit-reference.md) — Supported scope, audit commands, and fallback boundaries

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:POLISH_CHANGELOG.md
# POLISH_CHANGELOG — time-zone-planner

**Original Score:** 84  
**Polish Date:** 2026-03-19

## Issues Addressed

### P0 / Veto Fixes
- None (no veto failures)

### P1 Fixes
- **Region alias ambiguity:** Added "Region Alias Mapping" section with a table showing how short aliases (`US`, `EU`, `Asia`) map to representative IANA timezones, with a note that sub-region specification is recommended for accuracy.
- **Parameters table enriched:** Added `preferred_hours` parameter and updated `regions` description to recommend IANA timezone names. Added format guidance for region input.

### P2 Fixes
- None beyond P1 fixes.

### QS-1 (Input Validation)
- Already present and well-formed.

### QS-2 (Progressive Disclosure)
- File is 115 lines — within 300-line limit. No content moved to references/.

### QS-3 (Canonical YAML Frontmatter)
- Already present with all four required fields.

FILE:references/audit-reference.md
# Audit Reference

## Supported Scope

- Recommend meeting windows across a fixed set of regions.
- Explain timezone tradeoffs and local-time conversions.
- Keep outputs bounded to scheduling support rather than live calendar coordination.

## Stable Audit Commands

```bash
python -m py_compile scripts/main.py
python scripts/main.py
```

## Fallback Boundaries

- If participant regions are missing, request them before proposing a final schedule.
- If the user asks for real-time DST validation or live availability checks, state that those require external confirmation.
- If conflicting constraints cannot all be satisfied, return the least-bad option and explain the tradeoff instead of inventing a perfect overlap.

## Output Guardrails

- Separate known inputs from assumed time-window preferences.
- Keep recommended windows explicit by region.
- Call out any manual confirmation still required before a meeting is booked.

FILE:references/guidelines.md
# Time Zone Planner - References

## Time Zone Management
- Global Clinical Trial Coordination
- Time Zone Best Practices

FILE:scripts/main.py
#!/usr/bin/env python3
"""Time Zone Planner - Global meeting scheduler."""

import json

class TimeZonePlanner:
    """Plans meetings across time zones."""
    
    def plan(self, regions: list, duration: int) -> dict:
        """Find optimal meeting times."""
        
        # Simplified timezone mapping
        timezones = {
            "US": "EST/EDT (UTC-5/-4)",
            "EU": "CET/CEST (UTC+1/+2)",
            "Asia": "CST (UTC+8)"
        }
        
        suggested = []
        for region in regions:
            if region in timezones:
                suggested.append({
                    "region": region,
                    "local_time": "09:00",
                    "timezone": timezones[region]
                })
        
        return {
            "suggested_times": suggested,
            "optimal_window": "08:00-10:00 EST / 14:00-16:00 CET / 21:00-23:00 CST",
            "duration_minutes": duration
        }

def main():
    planner = TimeZonePlanner()
    result = planner.plan(["US", "EU", "Asia"], 60)
    print(json.dumps(result, indent=2))

if __name__ == "__main__":
    main()

ClawHub Coding Automation+2

A@clawhub-aipoch-ai-772015cadb

Table 1 Generator

Skill

Automated generation of baseline characteristics tables (Table 1) for clinical research papers.

---
name: table-1-generator
description: Automated generation of baseline characteristics tables (Table 1) for clinical research papers.
license: MIT
skill-author: AIPOCH
---
# Table 1 Generator

Automated generation of baseline characteristics tables (Table 1) for clinical research papers.

## When to Use

- Use this skill when the task needs Automated generation of baseline characteristics tables (Table 1) for clinical research papers.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

See `## Features` above for related details.

- Scope-focused workflow aligned to: Automated generation of baseline characteristics tables (Table 1) for clinical research papers.
- Packaged executable path(s): `scripts/main.py`.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `numpy`: `unspecified`. Declared in `requirements.txt`.
- `pandas`: `unspecified`. Declared in `requirements.txt`.
- `scipy`: `unspecified`. Declared in `requirements.txt`.

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Data Analytics/table-1-generator-advanced"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Usage

```text
python scripts/main.py --data patients.csv --group treatment --output table1.csv
```

## Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `--data` | str | Yes | - | Patient data CSV file path |
| `--group` | str | No | - | Grouping variable (e.g., treatment/control) |
| `--vars` | list[str] | No | - | Variables to include in the table |
| `--output` | str | Yes | - | Output file path for Table 1 |

## Features

- Automatic variable type detection
- Appropriate statistics (mean±SD, median[IQR], n(%))
- Group comparisons (t-test, chi-square)
- Missing data reporting
- APA formatting

## Output

- Table 1 (CSV/Excel)
- Statistical test results
- Formatted for publication

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `table-1-generator` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `table-1-generator` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:requirements.txt
numpy
pandas
scipy

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Table 1 Generator
Automated baseline characteristics table for clinical papers.
"""

import argparse
import pandas as pd
import numpy as np
from scipy import stats


class Table1Generator:
    """Generate baseline characteristics table."""
    
    def __init__(self, data):
        self.data = data
    
    def detect_var_type(self, var):
        """Detect variable type."""
        if pd.api.types.is_numeric_dtype(self.data[var]):
            unique_count = self.data[var].nunique()
            if unique_count <= 5:
                return "categorical"
            return "continuous"
        return "categorical"
    
    def summarize_continuous(self, var, group_var=None):
        """Summarize continuous variable."""
        if group_var:
            results = []
            groups = self.data[group_var].unique()
            
            for group in sorted(groups):
                subset = self.data[self.data[group_var] == group][var].dropna()
                mean = subset.mean()
                std = subset.std()
                median = subset.median()
                q25 = subset.quantile(0.25)
                q75 = subset.quantile(0.75)
                n = len(subset)
                
                results.append({
                    "group": group,
                    "n": n,
                    "mean_sd": f"{mean:.1f} ± {std:.1f}",
                    "median_iqr": f"{median:.1f} [{q25:.1f}-{q75:.1f}]"
                })
            
            # Statistical test
            groups_data = [self.data[self.data[group_var] == g][var].dropna() 
                          for g in groups]
            if len(groups) == 2:
                stat, pvalue = stats.ttest_ind(groups_data[0], groups_data[1])
                test = f"t-test p={pvalue:.3f}"
            else:
                stat, pvalue = stats.f_oneway(*groups_data)
                test = f"ANOVA p={pvalue:.3f}"
            
            return results, test
        else:
            subset = self.data[var].dropna()
            mean = subset.mean()
            std = subset.std()
            median = subset.median()
            q25 = subset.quantile(0.25)
            q75 = subset.quantile(0.75)
            n = len(subset)
            
            return [{
                "n": n,
                "mean_sd": f"{mean:.1f} ± {std:.1f}",
                "median_iqr": f"{median:.1f} [{q25:.1f}-{q75:.1f}]"
            }], ""
    
    def summarize_categorical(self, var, group_var=None):
        """Summarize categorical variable."""
        if group_var:
            crosstab = pd.crosstab(self.data[var], self.data[group_var])
            percentages = pd.crosstab(self.data[var], self.data[group_var], 
                                     normalize='columns') * 100
            
            results = []
            for category in crosstab.index:
                row = {"category": category}
                for group in crosstab.columns:
                    n = crosstab.loc[category, group]
                    pct = percentages.loc[category, group]
                    row[f"group_{group}"] = f"{n} ({pct:.1f}%)"
                results.append(row)
            
            # Chi-square test
            chi2, pvalue, dof, expected = stats.chi2_contingency(crosstab)
            test = f"Chi-square p={pvalue:.3f}"
            
            return results, test
        else:
            counts = self.data[var].value_counts()
            percentages = self.data[var].value_counts(normalize=True) * 100
            
            results = []
            for category in counts.index:
                results.append({
                    "category": category,
                    "n_pct": f"{counts[category]} ({percentages[category]:.1f}%)"
                })
            
            return results, ""
    
    def generate_table(self, group_var=None, var_list=None):
        """Generate complete Table 1."""
        if var_list is None:
            var_list = [c for c in self.data.columns if c != group_var]
        
        table_rows = []
        
        for var in var_list:
            var_type = self.detect_var_type(var)
            
            if var_type == "continuous":
                results, test = self.summarize_continuous(var, group_var)
                table_rows.append({
                    "Variable": var,
                    "Type": "Continuous",
                    "Statistics": results,
                    "P-value": test
                })
            else:
                results, test = self.summarize_categorical(var, group_var)
                for r in results:
                    table_rows.append({
                        "Variable": f"  {r.get('category', var)}",
                        "Type": "Categorical",
                        "Statistics": r,
                        "P-value": test if r == results[0] else ""
                    })
        
        return table_rows
    
    def print_table(self, table_rows):
        """Print formatted table."""
        print("\n" + "=" * 80)
        print("TABLE 1: BASELINE CHARACTERISTICS")
        print("=" * 80)
        
        for row in table_rows:
            print(f"\n{row['Variable']} ({row['Type']})")
            print(f"  Stats: {row['Statistics']}")
            if row['P-value']:
                print(f"  Test: {row['P-value']}")
        
        print("=" * 80 + "\n")


def main():
    parser = argparse.ArgumentParser(description="Table 1 Generator")
    parser.add_argument("--data", "-d", required=True, help="Data CSV file")
    parser.add_argument("--group", "-g", help="Grouping variable")
    parser.add_argument("--vars", "-v", help="Comma-separated variables to include")
    parser.add_argument("--output", "-o", help="Output CSV file")
    
    args = parser.parse_args()
    
    # Load data
    data = pd.read_csv(args.data)
    
    # Select variables
    var_list = None
    if args.vars:
        var_list = [v.strip() for v in args.vars.split(",")]
    
    # Generate table
    generator = Table1Generator(data)
    table = generator.generate_table(args.group, var_list)
    
    # Print
    generator.print_table(table)
    
    # Save if requested
    if args.output:
        # Flatten for CSV output
        flat_rows = []
        for row in table:
            flat_row = {
                "Variable": row["Variable"],
                "Type": row["Type"],
                "P-value": row["P-value"]
            }
            if isinstance(row["Statistics"], list):
                for i, stat in enumerate(row["Statistics"]):
                    for key, val in stat.items():
                        flat_row[f"Group{i}_{key}"] = val
            else:
                for key, val in row["Statistics"].items():
                    flat_row[key] = val
            flat_rows.append(flat_row)
        
        df = pd.DataFrame(flat_rows)
        df.to_csv(args.output, index=False)
        print(f"Table saved to: {args.output}")


if __name__ == "__main__":
    main()

ClawHub Coding Data Analysis+2

A@clawhub-aipoch-ai-772015cadb

Systematic Review Screener

Skill

Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.

---
name: systematic-review-screener
description: Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.
license: MIT
skill-author: AIPOCH
---
# Systematic Review Screener

Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.

## When to Use

- Use this skill when the task needs Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.
- Use this skill for evidence insight tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

See `## Features` above for related details.

- Scope-focused workflow aligned to: Automated abstract screening tool for systematic literature reviews with PRISMA workflow support.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `dataclasses`: `unspecified`. Declared in `requirements.txt`.
- `yaml`: `unspecified`. Declared in `requirements.txt`.

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Evidence Insight/systematic-review-screener"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py -h
python scripts/main.py --help
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Overview

This skill screens academic abstracts against predefined inclusion/exclusion criteria, generating PRISMA-compliant outputs with decision rationale and confidence scores.

**Technical Difficulty: High** ⚠️ Manual verification recommended for final inclusion decisions.

## Features

- **Multi-format Input**: PubMed MEDLINE, EndNote XML, CSV/TSV
- **Criteria Matching**: Configurable inclusion/exclusion rules
- **Confidence Scoring**: 0-100% confidence for each decision
- **Conflict Detection**: Flags abstracts requiring human review
- **PRISMA Export**: Flow diagram data and screening log
- **Batch Processing**: Handles large reference sets efficiently

## Usage

### Basic Screening

```python

# Run with default settings
python scripts/main.py --input references.csv --criteria criteria.yaml
```

### With PRISMA Export

```python
python scripts/main.py --input references.xml --criteria criteria.yaml \
  --output results/ --prisma --format excel
```

### Confidence Threshold

```python
python scripts/main.py --input refs.txt --criteria criteria.yaml \
  --threshold 0.8 --conflict-only
```

## Input Formats

### 1. CSV/TSV
Required columns: `title`, `abstract` (optional: `authors`, `year`, `doi`, `pmid`)

```csv
title,abstract,authors,year
title,abstract,authors,year
```

### 2. PubMed MEDLINE
Standard .txt export from PubMed search.

### 3. EndNote XML
Export from EndNote with abstracts included.

## Criteria File (YAML)

See `references/criteria_template.yaml` for complete example:

```yaml
study_type:
  include:
    - "randomized controlled trial"
    - "systematic review"
  exclude:
    - "case report"
    - "letter"
    - "editorial"

population:
  include_keywords:
    - "adults"
    - "elderly"
  exclude_keywords:
    - "pediatric"
    - "children"

intervention:
  required:
    - "drug therapy"
    - "medication"

language:
  allowed: ["English"]
  
year_range:
  min: 2010
  max: 2024

confidence_threshold: 0.75
```

## Output Files

| File | Description |
|------|-------------|
| `screened_included.csv` | Records passing all criteria |
| `screened_excluded.csv` | Records failing one or more criteria |
| `conflicts.csv` | Low-confidence decisions requiring review |
| `prisma_data.json` | PRISMA flow diagram counts |
| `screening_log.json` | Full decision trail with rationale |

## PRISMA Workflow Support

Generates structured data for PRISMA 2020 flow diagram:

```json
{
  "identification": {
    "database_results": 1250,
    "register_results": 45,
    "other_sources": 12
  },
  "screening": {
    "records_screened": 1307,
    "records_excluded": 1150,
    "full_text_assessed": 157,
    "full_text_excluded": 89
  },
  "included": {
    "qualitative_synthesis": 68,
    "quantitative_synthesis": 42
  }
}
```

## Configuration

### Environment Variables
```text
export SCREENING_THRESHOLD=0.75  # Default confidence threshold
export BATCH_SIZE=100             # Records per batch
export MAX_WORKERS=4              # Parallel processing workers
```

### Command Line Options

| Option | Description | Default |
|--------|-------------|---------|
| `--input` | Input file path | Required |
| `--criteria` | Criteria YAML path | Required |
| `--output` | Output directory | `./output` |
| `--format` | Output format: csv/excel/json | csv |
| `--threshold` | Confidence threshold | 0.75 |
| `--prisma` | Generate PRISMA data | False |
| `--conflict-only` | Export only conflicts | False |
| `--batch-size` | Processing batch size | 100 |

## Decision Algorithm

1. **Keyword Matching**: Exact and fuzzy keyword matching against title/abstract
2. **Inclusion Scoring**: Points for each inclusion criterion matched
3. **Exclusion Check**: Immediate exclusion if exclusion criterion detected
4. **Confidence Calculation**: Weighted score based on keyword presence and clarity
5. **Conflict Flagging**: Records with confidence < threshold flagged for manual review

## Limitations

- **Not for Final Decisions**: Tool provides recommendations; human review required for inclusion
- **Language Dependent**: Optimized for English abstracts
- **Structured Abstracts**: Performs better on structured abstracts (Background/Methods/Results/Conclusion)
- **Domain Specific**: Criteria must be tailored to research question

## References

- `references/criteria_template.yaml` - Complete criteria configuration example
- `references/prisma_2020_checklist.pdf` - PRISMA 2020 reporting guidelines
- `references/sample_references.csv` - Example input format

## Version

Version: 1.0.0  
Last Updated: 2026-02-05  
Classification: Research Tool - Requires Human Verification

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `systematic-review-screener` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `systematic-review-screener` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/criteria_template.yaml
# Systematic Review Screening Criteria Template
# Copy and customize this file for your review

# =============================================================================
# STUDY TYPE CRITERIA
# =============================================================================
study_type:
  # Study designs to INCLUDE
  include:
    - "randomized controlled trial"
    - "randomised controlled trial"
    - "RCT"
    - "systematic review"
    - "meta-analysis"
    - "clinical trial"
    - "controlled clinical trial"
    - "cluster randomized"
  
  # Study designs to EXCLUDE (immediate exclusion)
  exclude:
    - "case report"
    - "case series"
    - "letter"
    - "editorial"
    - "commentary"
    - "opinion"
    - "narrative review"
    - "scoping review"
    - "protocol"
    - "conference abstract"

# =============================================================================
# POPULATION CRITERIA
# =============================================================================
population:
  # Keywords indicating target population (inclusion)
  include_keywords:
    - "adults"
    - "adult"
    - "elderly"
    - "older adults"
    - "aged"
    - "middle-aged"
    - "geriatric"
  
  # Keywords indicating excluded populations
  exclude_keywords:
    - "pediatric"
    - "children"
    - "adolescent"
    - "infant"
    - "neonatal"
    - "animal"
    - "in vitro"
    - "cell line"

# =============================================================================
# INTERVENTION CRITERIA
# =============================================================================
intervention:
  # At least one of these must be present for inclusion
  required:
    - "drug therapy"
    - "medication"
    - "pharmacotherapy"
    - "treatment"
    - "intervention"
    - "therapy"

# =============================================================================
# OUTCOME CRITERIA (Optional - for information only)
# =============================================================================
outcome:
  # Outcomes of interest (informational, not exclusionary)
  interest:
    - "efficacy"
    - "effectiveness"
    - "safety"
    - "adverse events"
    - "quality of life"

# =============================================================================
# LANGUAGE CRITERIA
# =============================================================================
language:
  allowed:
    - "English"
  # Note: Language detection is heuristic-based

# =============================================================================
# PUBLICATION YEAR CRITERIA
# =============================================================================
year_range:
  min: 2010
  max: 2024

# =============================================================================
# SCREENING CONFIGURATION
# =============================================================================
# Confidence threshold (0.0 - 1.0)
# Records below this threshold are flagged for manual review
confidence_threshold: 0.75

# Minimum abstract length for reliable screening (characters)
min_abstract_length: 50

# =============================================================================
# EXAMPLE: SYSTEMATIC REVIEW ON DIABETES TREATMENT
# =============================================================================
# Uncomment and modify for your specific review:
#
# condition:
#   required:
#     - "diabetes"
#     - "type 2 diabetes"
#     - "T2DM"
#   exclude:
#     - "type 1 diabetes"
#     - "gestational diabetes"
#
# intervention:
#   required:
#     - "metformin"
#     - "SGLT2 inhibitor"
#     - "GLP-1 agonist"
#   exclude:
#     - "insulin monotherapy"
#
# comparator:
#   allowed:
#     - "placebo"
#     - "standard care"
#     - "active comparator"
#
# outcome:
#   primary:
#     - "HbA1c"
#     - "glycemic control"
#   secondary:
#     - "weight"
#     - "cardiovascular"
#     - "hypoglycemia"

FILE:references/sample_references.csv
title,abstract,authors,year,doi,pmid
"Effect of Exercise on Type 2 Diabetes: A Randomized Controlled Trial","Background: Physical activity is recommended for type 2 diabetes management. Methods: We conducted a randomized controlled trial of 200 adults with T2DM. Results: Exercise intervention reduced HbA1c by 0.8%. Conclusions: Exercise is effective for glycemic control.","Smith J; Jones A",2023,10.1000/example1,38000001
"Case Report: Rare Complication of Diabetes Medication","We present a case of a 45-year-old patient who developed...","Brown K",2022,10.1000/example2,38000002
"Systematic Review of SGLT2 Inhibitors in Elderly Patients","Objective: To review the safety and efficacy of SGLT2 inhibitors in adults over 65. Methods: Systematic review of RCTs. Results: 15 studies met inclusion criteria. Conclusions: SGLT2 inhibitors are safe and effective in elderly patients.","Davis M; Wilson R",2024,10.1000/example3,38000003
"Letter to Editor: Re: Diabetes Treatment Guidelines","I read with interest the recent article on guidelines...","Johnson T",2023,10.1000/example4,38000004
"Metformin vs Placebo in Pediatric Diabetes: A Randomized Trial","Background: Metformin is used off-label in children. Methods: RCT in 100 pediatric patients. Results: Significant improvement in glycemic control.","Lee S; Chen H",2021,10.1000/example5,38000005
"Pharmacotherapy for Type 2 Diabetes: A Meta-Analysis","Objective: To evaluate drug therapy options for T2DM. Methods: Meta-analysis of 50 RCTs. Results: Multiple agents showed efficacy. Conclusions: Individualized therapy is recommended.","Anderson P; Taylor L",2022,10.1000/example6,38000006
"Narrative Review of Diabetes Management Strategies","This review discusses various approaches to managing diabetes...","White G",2023,10.1000/example7,38000007
"Randomized Controlled Trial of GLP-1 Agonist in Adults with T2DM","Background: GLP-1 agonists show promise. Methods: Double-blind RCT in 300 adults. Results: Significant weight loss and HbA1c reduction.","Martinez C; Garcia R",2024,10.1000/example8,38000008
"Animal Study: Effects of Compound X on Diabetic Rats","Methods: 50 rats were given compound X...","Zhang Y",2020,10.1000/example9,38000009
"Cluster Randomized Trial of Diabetes Education Program","Background: Education improves outcomes. Methods: Cluster RCT in 20 clinics. Results: Improved self-management in intervention group.","Thompson E; Brown A",2023,10.1000/example10,38000010

FILE:requirements.txt
dataclasses
yaml

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Systematic Review Abstract Screener
Automated screening of academic abstracts against inclusion/exclusion criteria.
Supports PRISMA workflow and generates screening reports.

Technical Difficulty: High - Manual verification required for final decisions.
"""

import argparse
import csv
import json
import re
import sys
import xml.etree.ElementTree as ET
from dataclasses import dataclass, field, asdict
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Any, Tuple
import yaml


@dataclass
class ScreeningResult:
    """Result of screening a single reference."""
    record_id: str
    title: str
    abstract: str
    authors: str = ""
    year: str = ""
    doi: str = ""
    pmid: str = ""
    
    # Screening outcomes
    include: bool = False
    confidence: float = 0.0
    decision_rationale: List[str] = field(default_factory=list)
    matched_criteria: Dict[str, Any] = field(default_factory=dict)
    exclusion_reasons: List[str] = field(default_factory=list)
    requires_review: bool = False
    
    def to_dict(self) -> Dict:
        return asdict(self)


@dataclass
class PRISMAData:
    """PRISMA 2020 flow diagram data structure."""
    identification: Dict[str, int] = field(default_factory=lambda: {
        "database_results": 0,
        "register_results": 0,
        "other_sources": 0,
        "duplicates_removed": 0
    })
    screening: Dict[str, int] = field(default_factory=lambda: {
        "records_screened": 0,
        "records_excluded": 0,
        "full_text_assessed": 0,
        "full_text_excluded": 0
    })
    included: Dict[str, int] = field(default_factory=lambda: {
        "qualitative_synthesis": 0,
        "quantitative_synthesis": 0
    })
    
    def to_dict(self) -> Dict:
        return asdict(self)


class CriteriaMatcher:
    """Matches abstracts against inclusion/exclusion criteria."""
    
    def __init__(self, criteria: Dict[str, Any]):
        self.criteria = criteria
        self.confidence_threshold = criteria.get('confidence_threshold', 0.75)
    
    def _normalize_text(self, text: str) -> str:
        """Normalize text for matching."""
        return text.lower().strip()
    
    def _fuzzy_match(self, text: str, keywords: List[str]) -> Tuple[bool, List[str], float]:
        """
        Check if any keywords appear in text.
        Returns: (matched, matched_keywords, confidence_score)
        """
        text_lower = self._normalize_text(text)
        matched = []
        total_score = 0.0
        
        for keyword in keywords:
            keyword_lower = self._normalize_text(keyword)
            
            # Exact match
            if keyword_lower in text_lower:
                matched.append(keyword)
                total_score += 1.0
            # Word boundary match
            elif re.search(r'\b' + re.escape(keyword_lower) + r'\b', text_lower):
                matched.append(keyword)
                total_score += 0.9
            # Fuzzy match (contains substring)
            elif keyword_lower.replace(' ', '') in text_lower.replace(' ', ''):
                matched.append(keyword)
                total_score += 0.7
        
        confidence = min(1.0, total_score / max(1, len(keywords))) if keywords else 0.0
        return len(matched) > 0, matched, confidence
    
    def screen(self, reference: Dict[str, str]) -> ScreeningResult:
        """
        Screen a single reference against criteria.
        
        Args:
            reference: Dictionary with title, abstract, and metadata
            
        Returns:
            ScreeningResult with decision and rationale
        """
        title = reference.get('title', '')
        abstract = reference.get('abstract', '')
        full_text = f"{title} {abstract}"
        
        result = ScreeningResult(
            record_id=reference.get('record_id', ''),
            title=title,
            abstract=abstract,
            authors=reference.get('authors', ''),
            year=reference.get('year', ''),
            doi=reference.get('doi', ''),
            pmid=reference.get('pmid', '')
        )
        
        exclusion_confidence = 0.0
        inclusion_confidence = 0.0
        
        # Check exclusion criteria first (hard exclusions)
        if 'study_type' in self.criteria:
            exclude_types = self.criteria['study_type'].get('exclude', [])
            matched, keywords, conf = self._fuzzy_match(full_text, exclude_types)
            if matched:
                result.exclude = True
                result.exclusion_reasons.append(f"Excluded study type: {', '.join(keywords)}")
                exclusion_confidence = max(exclusion_confidence, conf)
        
        # Check population exclusions
        if 'population' in self.criteria:
            exclude_pop = self.criteria['population'].get('exclude_keywords', [])
            matched, keywords, conf = self._fuzzy_match(full_text, exclude_pop)
            if matched:
                result.exclude = True
                result.exclusion_reasons.append(f"Excluded population: {', '.join(keywords)}")
                exclusion_confidence = max(exclusion_confidence, conf)
        
        # Check year range
        if 'year_range' in self.criteria and result.year:
            try:
                year = int(result.year)
                year_min = self.criteria['year_range'].get('min', 1900)
                year_max = self.criteria['year_range'].get('max', 2100)
                if year < year_min or year > year_max:
                    result.exclude = True
                    result.exclusion_reasons.append(f"Year {year} outside range [{year_min}-{year_max}]")
                    exclusion_confidence = 1.0
            except ValueError:
                pass
        
        # Check language
        if 'language' in self.criteria:
            allowed = self.criteria['language'].get('allowed', [])
            # Simple language detection from abstract
            lang_conf = self._detect_language_confidence(abstract)
            if allowed and lang_conf < 0.5:
                result.exclude = True
                result.exclusion_reasons.append(f"Language not in allowed list: {allowed}")
        
        # Check inclusion criteria (if not already excluded)
        if not result.exclusion_reasons:
            inclusion_scores = []
            
            # Study type inclusion
            if 'study_type' in self.criteria:
                include_types = self.criteria['study_type'].get('include', [])
                matched, keywords, conf = self._fuzzy_match(full_text, include_types)
                if matched:
                    result.matched_criteria['study_type'] = keywords
                    inclusion_scores.append(conf)
                    result.decision_rationale.append(f"Matched study types: {', '.join(keywords)}")
            
            # Population inclusion
            if 'population' in self.criteria:
                include_pop = self.criteria['population'].get('include_keywords', [])
                matched, keywords, conf = self._fuzzy_match(full_text, include_pop)
                if matched:
                    result.matched_criteria['population'] = keywords
                    inclusion_scores.append(conf)
                    result.decision_rationale.append(f"Matched population: {', '.join(keywords)}")
            
            # Intervention required
            if 'intervention' in self.criteria:
                required = self.criteria['intervention'].get('required', [])
                matched, keywords, conf = self._fuzzy_match(full_text, required)
                if matched:
                    result.matched_criteria['intervention'] = keywords
                    inclusion_scores.append(conf)
                    result.decision_rationale.append(f"Matched intervention: {', '.join(keywords)}")
                else:
                    # Missing required intervention
                    result.decision_rationale.append("No required intervention keywords found")
                    inclusion_scores.append(0.0)
            
            # Calculate overall confidence
            if inclusion_scores:
                inclusion_confidence = sum(inclusion_scores) / len(inclusion_scores)
            else:
                inclusion_confidence = 0.5  # Neutral if no criteria
            
            # Decision: include if reasonable confidence and no exclusions
            result.include = inclusion_confidence >= self.confidence_threshold
        
        # Final confidence calculation
        if result.exclusion_reasons:
            result.confidence = exclusion_confidence
            result.include = False
        else:
            result.confidence = inclusion_confidence
        
        # Flag for manual review if confidence below threshold
        if result.confidence < self.confidence_threshold:
            result.requires_review = True
        
        return result
    
    def _detect_language_confidence(self, text: str) -> float:
        """Simple heuristic for English language confidence."""
        if not text:
            return 0.0
        
        # Common English words check
        english_indicators = ['the', 'and', 'of', 'to', 'in', 'a', 'is', 'that', 'for', 'with']
        text_lower = text.lower()
        words = text_lower.split()
        
        if not words:
            return 0.0
        
        indicator_count = sum(1 for word in words if word in english_indicators)
        return min(1.0, indicator_count / max(1, len(words) * 0.1))


class ReferenceParser:
    """Parse references from various input formats."""
    
    @staticmethod
    def parse_csv(filepath: Path) -> List[Dict[str, str]]:
        """Parse CSV or TSV file."""
        records = []
        with open(filepath, 'r', encoding='utf-8-sig') as f:
            # Detect delimiter
            sample = f.read(1024)
            f.seek(0)
            delimiter = '\t' if '\t' in sample else ','
            
            reader = csv.DictReader(f, delimiter=delimiter)
            for idx, row in enumerate(reader):
                record = {
                    'record_id': str(idx),
                    'title': row.get('title', ''),
                    'abstract': row.get('abstract', ''),
                    'authors': row.get('authors', ''),
                    'year': row.get('year', ''),
                    'doi': row.get('doi', ''),
                    'pmid': row.get('pmid', '')
                }
                records.append(record)
        return records
    
    @staticmethod
    def parse_pubmed(filepath: Path) -> List[Dict[str, str]]:
        """Parse PubMed MEDLINE format."""
        records = []
        current_record = {}
        
        with open(filepath, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.rstrip()
                
                if line.startswith('PMID-'):
                    if current_record:
                        current_record['record_id'] = current_record.get('pmid', '')
                        records.append(current_record)
                    current_record = {'pmid': line[6:].strip()}
                elif line.startswith('TI  -'):
                    current_record['title'] = line[6:].strip()
                elif line.startswith('AB  -'):
                    current_record['abstract'] = line[6:].strip()
                elif line.startswith('AU  -'):
                    authors = current_record.get('authors', '')
                    current_record['authors'] = authors + line[6:].strip() + '; ' if authors else line[6:].strip()
                elif line.startswith('DP  -'):
                    year_match = re.search(r'(\d{4})', line[6:])
                    if year_match:
                        current_record['year'] = year_match.group(1)
                elif line.startswith('LID -') and '[doi]' in line:
                    doi_match = re.search(r'(10\.\S+)', line)
                    if doi_match:
                        current_record['doi'] = doi_match.group(1)
                elif line.startswith('      ') and current_record.get('abstract'):
                    # Continuation of abstract
                    current_record['abstract'] += ' ' + line.strip()
                elif line.startswith('      ') and current_record.get('title'):
                    # Continuation of title
                    current_record['title'] += ' ' + line.strip()
            
            # Add last record
            if current_record:
                current_record['record_id'] = current_record.get('pmid', '')
                records.append(current_record)
        
        return records
    
    @staticmethod
    def parse_endnote(filepath: Path) -> List[Dict[str, str]]:
        """Parse EndNote XML format."""
        records = []
        tree = ET.parse(filepath)
        root = tree.getroot()
        
        for idx, record in enumerate(root.findall('.//record')):
            rec = {'record_id': str(idx)}
            
            # Title
            title_elem = record.find('.//titles/title/style')
            if title_elem is not None:
                rec['title'] = title_elem.text or ''
            
            # Abstract
            abstract_elem = record.find('.//abstract/style')
            if abstract_elem is not None:
                rec['abstract'] = abstract_elem.text or ''
            
            # Authors
            authors = []
            for author in record.findall('.//contributors/authors/author/style'):
                if author.text:
                    authors.append(author.text)
            rec['authors'] = '; '.join(authors)
            
            # Year
            year_elem = record.find('.//dates/year/style')
            if year_elem is not None and year_elem.text:
                rec['year'] = year_elem.text
            
            # DOI
            doi_elem = record.find('.//electronic-resource-num/style')
            if doi_elem is not None and doi_elem.text:
                rec['doi'] = doi_elem.text
            
            records.append(rec)
        
        return records


class ReportGenerator:
    """Generate screening reports in various formats."""
    
    @staticmethod
    def generate_prisma_data(results: List[ScreeningResult]) -> PRISMAData:
        """Generate PRISMA 2020 flow diagram data."""
        data = PRISMAData()
        
        total = len(results)
        included = sum(1 for r in results if r.include and not r.requires_review)
        excluded = sum(1 for r in results if not r.include and not r.requires_review)
        conflicts = sum(1 for r in results if r.requires_review)
        
        data.identification['database_results'] = total
        data.screening['records_screened'] = total
        data.screening['records_excluded'] = excluded
        data.included['qualitative_synthesis'] = included
        
        return data
    
    @staticmethod
    def save_csv(results: List[ScreeningResult], filepath: Path):
        """Save results to CSV."""
        if not results:
            return
        
        fieldnames = ['record_id', 'title', 'authors', 'year', 'doi', 'pmid', 
                     'include', 'confidence', 'requires_review', 
                     'decision_rationale', 'exclusion_reasons', 'matched_criteria']
        
        with open(filepath, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            for r in results:
                row = r.to_dict()
                # Convert lists/dicts to strings for CSV
                row['decision_rationale'] = '|'.join(row['decision_rationale'])
                row['exclusion_reasons'] = '|'.join(row['exclusion_reasons'])
                row['matched_criteria'] = json.dumps(row['matched_criteria'])
                writer.writerow(row)
    
    @staticmethod
    def save_json(results: List[ScreeningResult], filepath: Path):
        """Save results to JSON."""
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump([r.to_dict() for r in results], f, indent=2)
    
    @staticmethod
    def save_screening_log(results: List[ScreeningResult], prisma_data: PRISMAData, filepath: Path):
        """Save complete screening log with PRISMA data."""
        log = {
            'timestamp': datetime.now().isoformat(),
            'total_records': len(results),
            'included': sum(1 for r in results if r.include),
            'excluded': sum(1 for r in results if not r.include),
            'conflicts': sum(1 for r in results if r.requires_review),
            'prisma_data': prisma_data.to_dict(),
            'records': [r.to_dict() for r in results]
        }
        
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(log, f, indent=2)


def parse_arguments():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(
        description='Systematic Review Abstract Screener',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  %(prog)s --input references.csv --criteria criteria.yaml
  %(prog)s --input refs.xml --criteria criteria.yaml --output results/ --prisma
  %(prog)s --input refs.txt --criteria criteria.yaml --threshold 0.8 --conflict-only
        """
    )
    
    parser.add_argument('--input', '-i', required=True,
                       help='Input file (CSV, TSV, PubMed .txt, EndNote .xml)')
    parser.add_argument('--criteria', '-c', required=True,
                       help='Criteria YAML configuration file')
    parser.add_argument('--output', '-o', default='./output',
                       help='Output directory (default: ./output)')
    parser.add_argument('--format', '-f', choices=['csv', 'json', 'excel'], default='csv',
                       help='Output format (default: csv)')
    parser.add_argument('--threshold', '-t', type=float, default=0.75,
                       help='Confidence threshold (default: 0.75)')
    parser.add_argument('--prisma', '-p', action='store_true',
                       help='Generate PRISMA flow diagram data')
    parser.add_argument('--conflict-only', action='store_true',
                       help='Export only conflicts requiring review')
    parser.add_argument('--batch-size', '-b', type=int, default=100,
                       help='Processing batch size (default: 100)')
    
    return parser.parse_args()


def load_criteria(filepath: Path) -> Dict[str, Any]:
    """Load criteria from YAML file."""
    with open(filepath, 'r', encoding='utf-8') as f:
        return yaml.safe_load(f)


def detect_file_format(filepath: Path) -> str:
    """Detect input file format from extension and content."""
    suffix = filepath.suffix.lower()
    
    if suffix in ['.csv']:
        return 'csv'
    elif suffix in ['.tsv']:
        return 'tsv'
    elif suffix in ['.xml']:
        return 'xml'
    elif suffix in ['.txt']:
        # Check if PubMed format
        with open(filepath, 'r', encoding='utf-8') as f:
            sample = f.read(1024)
            if 'PMID-' in sample:
                return 'pubmed'
    
    # Default to CSV
    return 'csv'


def main():
    """Main entry point."""
    args = parse_arguments()
    
    input_path = Path(args.input)
    criteria_path = Path(args.criteria)
    output_dir = Path(args.output)
    
    # Validate inputs
    if not input_path.exists():
        print(f"Error: Input file not found: {input_path}", file=sys.stderr)
        sys.exit(1)
    
    if not criteria_path.exists():
        print(f"Error: Criteria file not found: {criteria_path}", file=sys.stderr)
        sys.exit(1)
    
    # Create output directory
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Load criteria
    print(f"Loading criteria from {criteria_path}...")
    criteria = load_criteria(criteria_path)
    criteria['confidence_threshold'] = args.threshold
    
    # Detect and parse input format
    file_format = detect_file_format(input_path)
    print(f"Detected format: {file_format}")
    
    print(f"Parsing references from {input_path}...")
    if file_format == 'pubmed':
        references = ReferenceParser.parse_pubmed(input_path)
    elif file_format == 'xml':
        references = ReferenceParser.parse_endnote(input_path)
    else:
        references = ReferenceParser.parse_csv(input_path)
    
    print(f"Loaded {len(references)} references")
    
    if not references:
        print("No references found. Exiting.", file=sys.stderr)
        sys.exit(1)
    
    # Initialize screener
    matcher = CriteriaMatcher(criteria)
    
    # Process references
    print("Screening references...")
    results = []
    for i in range(0, len(references), args.batch_size):
        batch = references[i:i + args.batch_size]
        for ref in batch:
            result = matcher.screen(ref)
            results.append(result)
        
        if (i // args.batch_size + 1) % 10 == 0:
            print(f"  Processed {min(i + args.batch_size, len(references))}/{len(references)}...")
    
    print(f"Screening complete. {len(results)} records processed.")
    
    # Filter if conflict-only
    if args.conflict_only:
        results = [r for r in results if r.requires_review]
        print(f"Filtered to {len(results)} conflicts requiring review")
    
    # Generate PRISMA data if requested
    prisma_data = None
    if args.prisma:
        prisma_data = ReportGenerator.generate_prisma_data(results)
        prisma_path = output_dir / 'prisma_data.json'
        with open(prisma_path, 'w') as f:
            json.dump(prisma_data.to_dict(), f, indent=2)
        print(f"PRISMA data saved to {prisma_path}")
    
    # Split results
    included = [r for r in results if r.include and not r.requires_review]
    excluded = [r for r in results if not r.include and not r.requires_review]
    conflicts = [r for r in results if r.requires_review]
    
    # Save results
    print("\nSaving results...")
    
    if args.format == 'csv':
        if not args.conflict_only:
            ReportGenerator.save_csv(included, output_dir / 'screened_included.csv')
            ReportGenerator.save_csv(excluded, output_dir / 'screened_excluded.csv')
        ReportGenerator.save_csv(conflicts, output_dir / 'conflicts.csv')
    else:
        if not args.conflict_only:
            ReportGenerator.save_json(included, output_dir / 'screened_included.json')
            ReportGenerator.save_json(excluded, output_dir / 'screened_excluded.json')
        ReportGenerator.save_json(conflicts, output_dir / 'conflicts.json')
    
    # Save screening log
    if prisma_data is None:
        prisma_data = ReportGenerator.generate_prisma_data(results)
    ReportGenerator.save_screening_log(results, prisma_data, output_dir / 'screening_log.json')
    
    # Print summary
    print("\n" + "=" * 50)
    print("SCREENING SUMMARY")
    print("=" * 50)
    print(f"Total records:      {len(results)}")
    print(f"Included:           {len(included)} ({100*len(included)/len(results):.1f}%)")
    print(f"Excluded:           {len(excluded)} ({100*len(excluded)/len(results):.1f}%)")
    print(f"Conflicts (review): {len(conflicts)} ({100*len(conflicts)/len(results):.1f}%)")
    print("=" * 50)
    print(f"\nResults saved to: {output_dir}")
    
    if conflicts:
        print(f"\n⚠️  {len(conflicts)} records require manual review.")
        print(f"   Check: {output_dir}/conflicts.{args.format}")
    
    print("\n⚠️  IMPORTANT: This tool provides screening recommendations.")
    print("   Final inclusion decisions must be verified by human reviewers.")


if __name__ == '__main__':
    main()

ClawHub Coding Data Analysis+2

A@clawhub-aipoch-ai-772015cadb

Synthetic Bio Circuit Designer

Skill

Design gene circuits for synthetic biology applications

---
name: synthetic-bio-circuit-designer
description: Design gene circuits for synthetic biology applications
version: 1.0.0
category: Bioinfo
tags: []
author: AIPOCH
license: MIT
status: Draft
risk_level: Medium
skill_type: Tool/Script
owner: AIPOCH
reviewer: ''
last_updated: '2026-02-06'
---

# Synthetic Bio Circuit Designer

Design and simulate gene circuits for synthetic biology.

## Usage

```bash
python scripts/main.py --type toggle --p1 P1 --p2 P2
python scripts/main.py --type oscillator
```

## Circuit Types

- Toggle switch: Bistable genetic switch
- Oscillator: Repressilator circuit

## Output

- Circuit design
- Component list
- Expected behavior

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites

No additional Python packages required.

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Synthetic Bio Circuit Designer
Design gene circuits for synthetic biology applications.
"""

import argparse


class CircuitDesigner:
    """Design synthetic biology circuits."""
    
    def design_toggle_switch(self, promoter1, promoter2):
        """Design a genetic toggle switch."""
        design = f"""
Toggle Switch Circuit Design:

Components:
- Promoter 1: {promoter1} (drives Repressor 2)
- Promoter 2: {promoter2} (drives Repressor 1)
- Repressor 1: inhibits {promoter1}
- Repressor 2: inhibits {promoter2}

Operation:
State A: Repressor 1 OFF, Repressor 2 ON → {promoter1} active
State B: Repressor 1 ON, Repressor 2 OFF → {promoter2} active

Inducers:
- Inducer 1: inactivates Repressor 1
- Inducer 2: inactivates Repressor 2
"""
        return design
    
    def design_oscillator(self):
        """Design a repressilator circuit."""
        design = """
Repressilator Circuit Design:

Ring topology with 3 genes:
Gene A represses Gene B
Gene B represses Gene C
Gene C represses Gene A

Expected behavior: Oscillating gene expression
Period: Depends on protein degradation rates
"""
        return design


def main():
    parser = argparse.ArgumentParser(description="Synthetic Bio Circuit Designer")
    parser.add_argument("--type", "-t", choices=["toggle", "oscillator"],
                       required=True, help="Circuit type")
    parser.add_argument("--p1", default="P1", help="Promoter 1")
    parser.add_argument("--p2", default="P2", help="Promoter 2")
    
    args = parser.parse_args()
    
    designer = CircuitDesigner()
    
    if args.type == "toggle":
        design = designer.design_toggle_switch(args.p1, args.p2)
    else:
        design = designer.design_oscillator()
    
    print(design)


if __name__ == "__main__":
    main()

ClawHub Coding Backend+2

A@clawhub-aipoch-ai-772015cadb

Symptom Checker Triage

Skill

Suggest triage levels (Emergency, Urgent, Outpatient) based on red flag symptoms using a rule-based engine. For AI-assisted decision support only — not a sub...

---
name: symptom-checker-triage
description: Suggest triage levels (Emergency, Urgent, Outpatient) based on red flag symptoms using a rule-based engine. For AI-assisted decision support only — not a substitute for professional medical diagnosis.
license: MIT
skill-author: AIPOCH
---

# Symptom Checker Triage

Analyzes symptom descriptions and suggests triage levels (Emergency / Urgent / Outpatient) based on red flag identification. Provides rationale and recommended next steps. For AI-assisted decision support only.

## Quick Check

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py "Chest pain, difficulty breathing, lasting 30 minutes"
python scripts/main.py "Headache, fever 38.5 degrees, vomiting" --verbose
```

## When to Use

- Triage a patient symptom description to Emergency, Urgent, or Outpatient level
- Identify red flag symptoms in a clinical or research context
- Generate structured triage output for documentation or downstream processing

## Workflow

1. Confirm the symptom description is provided as natural language text.
2. Validate that the request is a symptom triage task; stop early if not.
3. Run `scripts/main.py` with the symptom string or use `--interactive` mode.
4. Return a structured result with triage level, red flags, rationale, and disclaimer.
5. If execution fails or inputs are incomplete, switch to the Fallback Template below.

## Fallback Template

If `scripts/main.py` fails or required fields are missing, respond with:

```
FALLBACK REPORT
───────────────────────────────────────
Objective        : <triage goal>
Inputs Available : <symptom description provided>
Missing Inputs   : <list exactly what is missing>
Partial Result   : <any triage assessment that can be made safely>
Blocked Steps    : <what could not be completed and why>
Disclaimer       : This is AI-assisted advice only. Seek professional medical care.
Next Steps       : <minimum info needed to complete>
───────────────────────────────────────
```

## Stress-Case Output Checklist

For complex multi-constraint requests, always include these sections explicitly:

- **Assumptions**: symptom keywords matched, confidence threshold applied
- **Constraints**: rule-based engine only; no differential diagnosis
- **Risks**: false positives and false negatives are possible; always defer to clinician
- **Unresolved Items**: ambiguous symptoms requiring clarification

## CLI Usage

```bash
# Direct symptom input
python scripts/main.py "Chest pain, radiating to left arm, sweating"

# Interactive mode
python scripts/main.py --interactive

# Verbose output
python scripts/main.py "Headache, fever" --verbose

# JSON output
python scripts/main.py "Abdominal pain, right lower quadrant tenderness" --json
```

## Output Format

```json
{
  "triage_level": "emergency|urgent|outpatient",
  "confidence": 0.85,
  "red_flags": ["Chest pain", "Difficulty breathing"],
  "reason": "Chest pain with difficulty breathing may indicate myocardial infarction or pulmonary embolism",
  "recommendation": "Go to emergency department immediately",
  "department": "Emergency/Cardiology",
  "warning": "This is AI-assisted advice and cannot replace professional medical diagnosis"
}
```

## Triage Levels

| Level | Description | Action |
|---|---|---|
| emergency | Life-threatening red flags present | Call emergency services or go to ED immediately |
| urgent | Serious but not immediately fatal | Seek care within 2–4 hours |
| outpatient | Non-urgent | Schedule outpatient appointment |

## Red Flag Categories

→ Full red flags reference: [references/red_flags.md](references/red_flags.md)

Key categories: Cardiovascular, Respiratory, Neurological, Gastrointestinal, Trauma/Poisoning, Obstetric.

## Disclaimer

> **Important**: This tool provides AI-assisted triage suggestions only. It **cannot replace professional medical diagnosis**. If in doubt, seek medical care immediately. Call emergency services in life-threatening situations.

## Input Validation

This skill accepts: natural language symptom descriptions in English or Chinese for triage level suggestion.

If the request does not involve symptom triage — for example, asking to diagnose a specific disease, prescribe medication, interpret lab results, or perform general medical Q&A — do not proceed. Instead respond:

> "`symptom-checker-triage` is designed to suggest triage levels based on symptom red flags. Your request appears to be outside this scope. Please provide a symptom description, or use a more appropriate tool. This tool does not provide diagnoses or treatment recommendations."

## Error Handling

- If no symptom description is provided, request it explicitly.
- If the task goes outside documented scope (diagnosis, prescription), stop immediately.
- If `scripts/main.py` fails, use the Fallback Template above.
- Do not fabricate triage levels, red flag matches, or medical advice.

## Output Requirements

Every final response must include:

1. **Objective** — what symptom set was triaged
2. **Inputs Received** — symptom description used
3. **Assumptions** — keyword matching applied, confidence threshold
4. **Triage Result** — level, red flags identified, rationale
5. **Risks and Limits** — AI-only, not a diagnosis, false positive/negative risk
6. **Next Checks** — always recommend professional medical evaluation

## Dependencies

- Python 3.8+
- No third-party dependencies (rule-based engine, standard library only)

FILE:requirements.txt
dataclasses
enum

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Symptom Checker Triage (ID: 165)
Recommends triage level (emergency vs outpatient) based on common symptom red flags
"""

import sys
import json
import argparse
import re
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass, asdict
from enum import Enum


class TriageLevel(Enum):
    EMERGENCY = "emergency"      # Life-threatening
    URGENT = "urgent"            # Urgent but not immediately life-threatening
    OUTPATIENT = "outpatient"    # Non-urgent


@dataclass
class TriageResult:
    triage_level: str
    confidence: float
    red_flags: List[str]
    reason: str
    recommendation: str
    department: str
    warning: str = "This is an AI-assisted recommendation and cannot replace professional medical diagnosis"


# Red flag definitions: symptom keywords -> (red flag name, triage level, weight, department suggestion, reason)
RED_FLAGS = {
    # Cardiovascular system - urgent
    "chest pain": ("Chest Pain", TriageLevel.EMERGENCY, 0.95, "Emergency/Cardiology", "Chest pain may indicate myocardial infarction, aortic dissection, or pulmonary embolism"),
    "chest tightness": ("Chest Tightness", TriageLevel.EMERGENCY, 0.85, "Emergency/Cardiology", "Chest tightness may be a presentation of angina or myocardial infarction"),
    "palpitations": ("Palpitations", TriageLevel.URGENT, 0.70, "Cardiology", "Palpitations may indicate cardiac arrhythmia"),
    "syncope": ("Syncope", TriageLevel.EMERGENCY, 0.90, "Emergency/Neurology", "Syncope may suggest serious arrhythmia or cerebrovascular accident"),
    "coma": ("Coma", TriageLevel.EMERGENCY, 1.0, "Emergency", "Coma is a life-threatening emergency"),

    # Respiratory system - urgent
    "dyspnea": ("Dyspnea", TriageLevel.EMERGENCY, 0.95, "Emergency/Pulmonology", "Severe dyspnea indicates risk of respiratory failure"),
    "rapid breathing": ("Rapid Breathing", TriageLevel.URGENT, 0.75, "Emergency", "Rapid breathing may indicate pulmonary or cardiac disease"),
    "hemoptysis": ("Hemoptysis", TriageLevel.EMERGENCY, 0.90, "Emergency/Pulmonology", "Hemoptysis may indicate serious pulmonary disease"),
    "asphyxia": ("Asphyxia", TriageLevel.EMERGENCY, 1.0, "Emergency", "Asphyxia is a life-threatening emergency"),

    # Nervous system - urgent
    "severe headache": ("Severe Headache", TriageLevel.EMERGENCY, 0.85, "Emergency/Neurology", "Sudden severe headache may indicate subarachnoid hemorrhage"),
    "worst headache of life": ("Worst Headache of Life", TriageLevel.EMERGENCY, 0.95, "Emergency/Neurology", "Suggests possible subarachnoid hemorrhage"),
    "slurred speech": ("Slurred Speech", TriageLevel.EMERGENCY, 0.90, "Emergency/Neurology", "May be a sign of stroke"),
    "hemiplegia": ("Hemiplegia", TriageLevel.EMERGENCY, 0.95, "Emergency/Neurology", "Classic symptom of stroke"),
    "hemiparesis": ("Hemiparesis", TriageLevel.EMERGENCY, 0.95, "Emergency/Neurology", "Classic symptom of stroke"),
    "convulsions": ("Convulsions", TriageLevel.URGENT, 0.80, "Emergency/Neurology", "May indicate epileptic seizure"),
    "epilepsy": ("Epilepsy", TriageLevel.URGENT, 0.80, "Neurology", "Epileptic seizures require prompt medical attention"),

    # Digestive system - urgent
    "hematemesis": ("Hematemesis", TriageLevel.EMERGENCY, 0.95, "Emergency/Gastroenterology", "Upper gastrointestinal major bleeding"),
    "melena": ("Melena", TriageLevel.URGENT, 0.80, "Gastroenterology", "May indicate upper gastrointestinal bleeding"),
    "hematochezia": ("Hematochezia", TriageLevel.URGENT, 0.75, "Gastroenterology/Colorectal Surgery", "Gastrointestinal bleeding"),
    "severe abdominal pain": ("Severe Abdominal Pain", TriageLevel.EMERGENCY, 0.85, "Emergency", "May be appendicitis, intestinal perforation, or other acute abdomen"),
    "board-like abdomen": ("Board-like Abdomen", TriageLevel.EMERGENCY, 0.95, "Emergency/Surgery", "Indicates peritonitis"),
    "absent bowel sounds": ("Absent Bowel Sounds", TriageLevel.URGENT, 0.75, "Surgery", "May indicate intestinal obstruction"),

    # Other systems - urgent
    "severe trauma": ("Severe Trauma", TriageLevel.EMERGENCY, 0.95, "Emergency/Surgery", "Severe trauma requires urgent management"),
    "major hemorrhage": ("Major Hemorrhage", TriageLevel.EMERGENCY, 1.0, "Emergency", "Risk of hemorrhagic shock"),
    "drug overdose": ("Drug Overdose", TriageLevel.EMERGENCY, 0.90, "Emergency", "Poisoning requires urgent gastric lavage or antidote"),
    "poisoning": ("Poisoning", TriageLevel.EMERGENCY, 0.95, "Emergency", "Poisoning requires urgent management"),

    # Pediatric/obstetric related
    "decreased fetal movement": ("Decreased Fetal Movement", TriageLevel.URGENT, 0.80, "Obstetrics/Gynecology", "May indicate fetal distress"),
    "vaginal bleeding": ("Vaginal Bleeding", TriageLevel.URGENT, 0.80, "Obstetrics/Gynecology", "Bleeding during pregnancy warrants concern for miscarriage or placental abruption"),
    "febrile seizure": ("Febrile Seizure", TriageLevel.URGENT, 0.85, "Pediatric Emergency", "Febrile seizures in children require urgent management"),
}

# Symptom synonym expansion
SYNONYMS = {
    "chest pain": ["chest ache", "precordial pain", "chest discomfort", "thoracic pain"],
    "dyspnea": ["shortness of breath", "breathlessness", "difficulty breathing", "can't breathe", "suffocation feeling"],
    "severe headache": ["splitting headache", "explosive headache", "worst headache of life"],
    "hemiplegia": ["hemiparesis", "one-sided limb weakness", "arm and leg weakness"],
    "slurred speech": ["unclear speech", "difficulty speaking", "expressive difficulty"],
    "coma": ["loss of consciousness", "unconscious", "unresponsive"],
    "hematemesis": ["vomiting blood", "vomiting fresh blood"],
    "melena": ["tarry stool", "black stool"],
    "severe abdominal pain": ["severe stomach pain", "severe belly pain", "writhing in pain"],
    "major hemorrhage": ["uncontrolled bleeding", "massive bleeding"],
}

# Outpatient-level symptom definitions
OUTPATIENT_SYMPTOMS = {
    "mild headache": ("Mild Headache", TriageLevel.OUTPATIENT, 0.60, "Neurology", "Common symptom, recommend outpatient evaluation"),
    "low-grade fever": ("Low-grade Fever", TriageLevel.OUTPATIENT, 0.50, "Internal Medicine", "Temperature <38.5°C can be seen in outpatient setting"),
    "cough": ("Cough", TriageLevel.OUTPATIENT, 0.40, "Pulmonology", "Cough without dyspnea can be managed in outpatient setting"),
    "runny nose": ("Rhinorrhea", TriageLevel.OUTPATIENT, 0.30, "ENT", "Common symptom of upper respiratory tract infection"),
    "sore throat": ("Sore Throat", TriageLevel.OUTPATIENT, 0.40, "ENT", "Common in upper respiratory tract infection"),
    "mild abdominal pain": ("Mild Abdominal Pain", TriageLevel.OUTPATIENT, 0.50, "Gastroenterology", "Mild abdominal pain without red flags"),
    "diarrhea": ("Diarrhea", TriageLevel.OUTPATIENT, 0.50, "Gastroenterology", "Diarrhea without bloody stool or dehydration"),
    "rash": ("Rash", TriageLevel.OUTPATIENT, 0.40, "Dermatology", "Rash without fever"),
    "joint pain": ("Joint Pain", TriageLevel.OUTPATIENT, 0.40, "Rheumatology", "Chronic joint pain"),
    "insomnia": ("Insomnia", TriageLevel.OUTPATIENT, 0.30, "Neurology/Psychology", "Sleep disorder"),
    "fatigue": ("Fatigue", TriageLevel.OUTPATIENT, 0.40, "Internal Medicine", "Rule out anemia, hypothyroidism, etc."),
}


def expand_synonyms(text: str) -> str:
    """Expand synonyms, replacing all synonyms with standard symptom names"""
    expanded = text
    for standard, synonyms in SYNONYMS.items():
        for syn in synonyms:
            if syn in expanded:
                expanded = expanded.replace(syn, standard)
    return expanded


def extract_red_flags(text: str) -> List[Tuple[str, TriageLevel, float, str, str]]:
    """Extract red flags from symptom description"""
    expanded_text = expand_synonyms(text)
    found_flags = []

    for keyword, (name, level, weight, dept, reason) in RED_FLAGS.items():
        if keyword in expanded_text:
            found_flags.append((name, level, weight, dept, reason))

    # Deduplicate and sort by weight
    seen = set()
    unique_flags = []
    for flag in sorted(found_flags, key=lambda x: x[2], reverse=True):
        if flag[0] not in seen:
            seen.add(flag[0])
            unique_flags.append(flag)

    return unique_flags


def calculate_temperature(text: str) -> Optional[float]:
    """Extract body temperature from text"""
    # Match various temperature formats
    patterns = [
        r'(\d+\.?\d*)\s*degrees?',
        r'(\d+\.?\d*)\s*°C',
        r'(\d+\.?\d*)\s*°',
        r'temperature\s*(\d+\.?\d*)',
        r'fever\s*(\d+\.?\d*)',
        r'temp\s*(\d+\.?\d*)',
    ]
    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            try:
                temp = float(match.group(1))
                # Determine if Celsius or Fahrenheit
                if temp > 50:  # Convert Fahrenheit to Celsius
                    temp = (temp - 32) * 5 / 9
                return temp
            except ValueError:
                continue
    return None


def extract_outpatient_symptoms(text: str) -> List[Tuple[str, TriageLevel, float, str, str]]:
    """Extract outpatient-level symptoms from symptom description"""
    found = []
    text_lower = text.lower()
    
    # Check temperature first
    temperature = calculate_temperature(text)
    
    for keyword, (name, level, weight, dept, reason) in OUTPATIENT_SYMPTOMS.items():
        if keyword in text:
            found.append((name, level, weight, dept, reason))
    
    # If no specific symptoms found but temperature is present, assess based on temperature
    if not found and temperature:
        if temperature < 38:
            found.append(("Low-grade Fever", TriageLevel.OUTPATIENT, 0.50, "Internal Medicine", f"Temperature {temperature:.1f}°C, recommend outpatient visit"))
        elif temperature < 39:
            found.append(("Moderate Fever", TriageLevel.OUTPATIENT, 0.60, "Fever Clinic", f"Temperature {temperature:.1f}°C, can be managed in outpatient setting"))
    
    # Recognize common symptom keywords
    if not found:
        common_keywords = {
            "headache": ("Headache", TriageLevel.OUTPATIENT, 0.60, "Neurology", "Headache without red flags can be evaluated in outpatient setting"),
            "dizziness": ("Dizziness", TriageLevel.OUTPATIENT, 0.50, "Neurology/ENT", "Dizziness requires evaluation"),
            "nausea": ("Nausea", TriageLevel.OUTPATIENT, 0.50, "Gastroenterology", "Nausea without vomiting can be managed in outpatient setting"),
            "vomiting": ("Vomiting", TriageLevel.OUTPATIENT, 0.60, "Gastroenterology", "Non-severe vomiting can be managed in outpatient setting"),
            "cold": ("Cold Symptoms", TriageLevel.OUTPATIENT, 0.50, "Internal Medicine", "Common cold symptoms"),
            "fever": ("Fever", TriageLevel.OUTPATIENT, 0.60, "Fever Clinic", "Fever requires further evaluation"),
            "high temperature": ("Fever", TriageLevel.OUTPATIENT, 0.60, "Fever Clinic", "Fever requires further evaluation"),
        }
        for keyword, symptom_info in common_keywords.items():
            if keyword in text:
                found.append(symptom_info)
    
    return found


def triage(symptom_text: str) -> TriageResult:
    """
    Main triage function

    Args:
        symptom_text: Symptom description text

    Returns:
        TriageResult: Triage result
    """
    if not symptom_text or not symptom_text.strip():
        return TriageResult(
            triage_level=TriageLevel.OUTPATIENT.value,
            confidence=0.0,
            red_flags=[],
            reason="No symptom description provided",
            recommendation="Please provide a symptom description for triage",
            department="Unknown"
        )

    # Extract red flags
    red_flags = extract_red_flags(symptom_text)

    # Extract temperature
    temperature = calculate_temperature(symptom_text)

    # Check high fever
    if temperature and temperature >= 40:
        red_flags.append(("High Fever (>=40°C)", TriageLevel.EMERGENCY, 0.90, "Emergency", "High fever with altered consciousness requires urgent management"))
    elif temperature and temperature >= 39:
        red_flags.append(("High Fever (>=39°C)", TriageLevel.URGENT, 0.70, "Fever Clinic", "High fever requires prompt medical attention"))

    # If red flags are present
    if red_flags:
        # Take the most severe level
        max_level = max(red_flags, key=lambda x: x[2])
        level = max_level[1]
        confidence = min(0.95, max(flag[2] for flag in red_flags))
        flag_names = [flag[0] for flag in red_flags]
        departments = list(dict.fromkeys(flag[3] for flag in red_flags))  # Deduplicate while preserving order
    
        if level == TriageLevel.EMERGENCY:
            recommendation = "Proceed to emergency immediately, call 911 if necessary"
        elif level == TriageLevel.URGENT:
            recommendation = "Seek medical attention within 2-4 hours, emergency or fever clinic"
        else:
            recommendation = "Schedule an outpatient appointment as soon as possible"

        # Combine reasons
        reasons = [f"{flag[0]}: {flag[4]}" for flag in red_flags[:3]]
        reason = "; ".join(reasons)

        return TriageResult(
            triage_level=level.value,
            confidence=confidence,
            red_flags=flag_names,
            reason=reason,
            recommendation=recommendation,
            department="/".join(departments[:2])
        )

    # Check outpatient-level symptoms
    outpatient = extract_outpatient_symptoms(symptom_text)
    if outpatient:
        max_symptom = max(outpatient, key=lambda x: x[2])
        return TriageResult(
            triage_level=TriageLevel.OUTPATIENT.value,
            confidence=max_symptom[2],
            red_flags=[],
            reason=f"Symptom '{max_symptom[0]}' has no red flags, suitable for outpatient management",
            recommendation="Schedule an outpatient appointment; seek urgent care if symptoms worsen",
            department=max_symptom[3]
        )

    # Unrecognized symptoms
    return TriageResult(
        triage_level=TriageLevel.OUTPATIENT.value,
        confidence=0.30,
        red_flags=[],
        reason="Unrecognized symptoms, recommend medical evaluation",
        recommendation="Schedule a general internal medicine outpatient appointment for initial assessment",
        department="Internal Medicine"
    )


def interactive_mode():
    """Interactive mode"""
    print("=" * 50)
    print("Symptom Checker Triage")
    print("=" * 50)
    print("Enter symptom description (e.g. 'chest pain, dyspnea'), type 'quit' to exit")
    print("-" * 50)

    while True:
        try:
            user_input = input("\nSymptom description: ").strip()
            if user_input.lower() in ['quit', 'exit', 'q']:
                print("Thank you for using. Goodbye!")
                break

            if not user_input:
                continue

            result = triage(user_input)
            print_result(result)

        except KeyboardInterrupt:
            print("\nThank you for using. Goodbye!")
            break
        except EOFError:
            break


def supports_color() -> bool:
    """Detect whether the terminal supports color"""
    import os
    return sys.stdout.isatty() and os.environ.get('TERM') not in ('dumb', '')


def print_result(result: TriageResult, verbose: bool = False):
    """Print triage result"""
    use_color = supports_color()
    level_colors = {
        "emergency": "\033[91m" if use_color else "",  # Red
        "urgent": "\033[93m" if use_color else "",     # Yellow
        "outpatient": "\033[92m" if use_color else "", # Green
    }
    reset = "\033[0m" if use_color else ""

    level_display = {
        "emergency": "🔴 Emergency",
        "urgent": "🟠 Urgent",
        "outpatient": "🟢 Outpatient",
    }

    level = result.triage_level
    color = level_colors.get(level, "")

    print("\n" + "=" * 50)
    print(f"Triage Level: {color}{level_display.get(level, level)}{reset}")
    print(f"Confidence: {result.confidence:.0%}")

    if result.red_flags:
        print(f"\n🚨 Identified Red Flags:")
        for flag in result.red_flags:
            print(f"   - {flag}")

    print(f"\n📋 Triage Reason:")
    print(f"   {result.reason}")

    print(f"\n💡 Recommendation:")
    print(f"   {result.recommendation}")

    print(f"\n🏥 Suggested Department: {result.department}")

    if verbose:
        print(f"\n⚠️  Disclaimer: {result.warning}")

    print("=" * 50)


def main():
    parser = argparse.ArgumentParser(
        description="Symptom Checker Triage - recommends triage level based on red flags",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  python main.py "chest pain, radiating to left arm, sweating"
  python main.py --interactive
  python main.py "headache, fever 38 degrees" --verbose
        """
    )
    parser.add_argument(
        "symptoms",
        nargs="?",
        help="Symptom description (e.g. 'chest pain, dyspnea')"
    )
    parser.add_argument(
        "-i", "--interactive",
        action="store_true",
        help="Enter interactive mode"
    )
    parser.add_argument(
        "-v", "--verbose",
        action="store_true",
        help="Show detailed information"
    )
    parser.add_argument(
        "--json",
        action="store_true",
        help="Output in JSON format"
    )

    args = parser.parse_args()

    if args.interactive:
        interactive_mode()
        return

    if not args.symptoms:
        parser.print_help()
        print("\nError: Please provide symptom description or use --interactive for interactive mode")
        sys.exit(1)

    result = triage(args.symptoms)

    if args.json:
        print(json.dumps(asdict(result), ensure_ascii=False, indent=2))
    else:
        print_result(result, verbose=args.verbose)


if __name__ == "__main__":
    main()

ClawHub Coding Automation+2

A@clawhub-aipoch-ai-772015cadb

Survival Curve Risk Table

Skill

Analyze data with `survival-curve-risk-table` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.

---
name: survival-curve-risk-table
description: Analyze data with `survival-curve-risk-table` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
license: MIT
skill-author: AIPOCH
---
# Survival Curve Risk Table Generator

## When to Use

- Use this skill when the task needs Automatically align and add "Number at risk" table below Kaplan-Meier.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

- Scope-focused workflow aligned to: Analyze data with `survival-curve-risk-table` using a reproducible workflow, explicit validation, and structured outputs for review-ready interpretation.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `lifelines`: `unspecified`. Declared in `requirements.txt`.
- `matplotlib`: `unspecified`. Declared in `requirements.txt`.
- `numpy`: `unspecified`. Declared in `requirements.txt`.
- `pandas`: `unspecified`. Declared in `requirements.txt`.
- `pil`: `unspecified`. Declared in `requirements.txt`.
- `pillow`: `unspecified`. Declared in `requirements.txt`.
- `seaborn`: `unspecified`. Declared in `requirements.txt`.

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Data Analytics/survival-curve-risk-table"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py

# Example invocation: python scripts/main.py --help

# Example invocation: python scripts/main.py --input "Audit validation sample with explicit symptoms, history, assessment, and next-step plan."
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Function Overview

Automatically add "Number at risk" tables to Kaplan-Meier survival curves that meet clinical oncology journal standards. Automatically align time points and generate publication-quality combined figures.

## Usage Trigger Conditions

- Need to add number at risk tables to KM survival curves
- Generate survival plots that meet journal requirements such as NEJM, Lancet, JCO
- Risk tables needed in clinical trial reports
- Medical paper chart standardization before submission
- Precise alignment of risk tables with survival curve time axis

## Core Functions

### 1. Automatic Number at Risk Calculation
- Automatically calculate number at risk at each time point from survival data
- Support right-censored data processing
- Count remaining observed subjects by group
- Automatic handling of censoring events

### 2. Journal Standard Formats
- **NEJM Standard**: Clean time axis, groups arranged horizontally
- **Lancet Standard**: Complete statistical information, vertical alignment
- **JCO Standard**: Censoring symbols marked, group comparison
- Support custom journal templates

### 3. Precise Alignment
- Time axis precisely aligned with curve X-axis
- Automatic adjustment of table spacing and font size
- Responsive layout adapts to different image sizes
- Support horizontal/vertical layouts

### 4. Output Formats
- High-quality PNG/JPEG images
- PDF vector graphics
- SVG editable format
- PowerPoint embeddable format

## Usage

### Example 1: Basic Risk Table Generation

```text

# Example invocation: python scripts/main.py \
    --input survival_data.csv \
    --time-col time \
    --event-col event \
    --group-col treatment \
    --output risk_table.png
```

### Example 2: Specify Journal Style

```text

# Example invocation: python scripts/main.py \
    --input survival_data.csv \
    --time-col time \
    --event-col status \
    --group-col arm \
    --style NEJM \
    --time-points 0,6,12,18,24,30,36 \
    --output figure_1a.pdf
```

### Example 3: Combined Figure Generation (Curve + Risk Table)

```text

# Example invocation: python scripts/main.py \
    --input survival_data.csv \
    --time-col months \
    --event-col death \
    --group-col group \
    --km-plot km_curve.png \
    --combine \
    --output combined_figure.png
```

### Example 4: Batch Generate Multi-Timepoint Tables

```text

# Example invocation: python scripts/main.py \
    --input survival_data.csv \
    --time-col time \
    --event-col event \
    --group-col treatment \
    --time-points 0,12,24,36,48,60 \
    --format both \
    --output-dir ./output/
```

### Example 5: Using Existing Survival Data (Python API)

```python
from scripts.main import RiskTableGenerator

# Initialize generator
generator = RiskTableGenerator(
    style="JCO",
    time_points=[0, 6, 12, 18, 24, 30],
    figure_size=(8, 6)
)

# Load survival data
generator.load_data(
    df=survival_df,
    time_col="time",
    event_col="event",
    group_col="treatment_arm"
)

# Generate risk table
generator.generate_risk_table(
    output_path="risk_table.png",
    show_censored=True
)

# Generate combined figure (KM curve + risk table)
generator.generate_combined_plot(
    km_plot_path="km_curve.png",
    output_path="combined_figure.pdf"
)
```

## Input Data Format

### CSV Format Example

```csv
time,event,treatment_arm
0,0,Experimental
3.2,1,Experimental
5.1,0,Experimental
12.3,1,Control
18.7,0,Control
24.0,1,Experimental
...
```

### Required Columns

| Column Name | Description | Type |
|------|------|------|
| time | Follow-up time (months) | Numeric |
| event | Event occurrence flag | 0=Censored, 1=Event |
| group | Treatment group (optional) | Text/Categorical |

### Supported Data Formats

- CSV (.csv)
- Excel (.xlsx, .xls)
- SAS (.sas7bdat)
- RData (.rda, .rds)
- Python pickle (.pkl)

## Journal Style Configuration

### NEJM Style

```json
{
  "style": "NEJM",
  "font_family": "Helvetica",
  "font_size": 8,
  "time_points": [0, 6, 12, 18, 24, 30, 36],
  "table_height": 0.15,
  "show_grid": false,
  "separator_lines": true
}
```

### Lancet Style

```json
{
  "style": "Lancet",
  "font_family": "Times New Roman",
  "font_size": 9,
  "time_points": [0, 12, 24, 36, 48, 60],
  "table_height": 0.18,
  "show_grid": true,
  "header_bold": true
}
```

### JCO Style

```json
{
  "style": "JCO",
  "font_family": "Arial",
  "font_size": 8,
  "time_points": [0, 6, 12, 18, 24, 30],
  "table_height": 0.16,
  "show_censored": true,
  "censor_symbol": "+"
}
```

## Command Line Parameters

### Required Parameters

| Parameter | Description | Example |
|------|------|------|
| `--input` | Input data file path | `data.csv` |
| `--time-col` | Time column name | `time` |
| `--event-col` | Event column name | `event` |

### Optional Parameters

| Parameter | Description | Default Value |
|------|------|--------|
| `--group-col` | Group column name | `None` |
| `--output` | Output file path | `risk_table.png` |
| `--style` | Journal style | `NEJM` |
| `--time-points` | Time point list | Auto-calculated |
| `--format` | Output format | `png` |
| `--width` | Image width | `8` (inches) |
| `--height` | Image height | `6` (inches) |
| `--dpi` | Image resolution | `300` |
| `--font-size` | Font size | `8` |
| `--show-censored` | Show censored count | `False` |
| `--combine` | Combine with KM curve | `False` |
| `--km-plot` | KM curve image path | `None` |

## Output Format

### Standalone Risk Table
```
┌─────────────────────────────────────────────────────────┐
│  Number at risk                                         │
├─────────┬─────┬─────┬─────┬─────┬─────┬─────┬───────────┤
│ Group   │  0  │ 12  │ 24  │ 36  │ 48  │ 60  │ 72 (mo)   │
├─────────┼─────┼─────┼─────┼─────┼─────┼─────┼───────────┤
│ Exp     │ 150 │ 142 │ 128 │ 105 │  89 │  72 │  58       │
│ Control │ 148 │ 135 │ 118 │  92 │  76 │  61 │  45       │
└─────────┴─────┴─────┴─────┴─────┴─────┴─────┴───────────┘
```

### Combined Figure Layout
```
┌─────────────────────────────────────┐
│                                     │
│     Kaplan-Meier Survival Curve     │
│                                     │
│    ━━━━━━━━━  Experimental         │
│    ─ ─ ─ ─ ─  Control              │
│                                     │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Number at risk                      │
│ Exp    150  142  128  105   89   72 │
│ Ctrl   148  135  118   92   76   61 │
│         0   12   24   36   48   60  │
└─────────────────────────────────────┘
```

## Algorithm Description

### Number at Risk Calculation

```
For each time point t:
    For each group g:
        N_at_risk(t, g) = N_total(g) 
                          - Σ(patients with events occurring ≤ t)
                          - Σ(patients censored occurring < t)
```

### Time Point Selection Strategy

1. **Auto Mode**: Automatically select equally spaced time points based on data distribution
2. **Fixed Interval**: Select at specified intervals (e.g., every 6 months)
3. **Custom**: User-specified specific time points
4. **Event-driven**: Select based on event occurrence density

## Quality Checklist

- [ ] Time axis precisely aligned with X-axis
- [ ] Number at risk calculations correct (can manually spot-check)
- [ ] Group labels clear and readable
- [ ] Font size meets journal requirements (≥8pt)
- [ ] Image resolution ≥300 DPI
- [ ] Color contrast meets accessibility standards
- [ ] Censoring marks (if present) clearly distinguishable
- [ ] Export format meets submission requirements

## Journal-Specific Notes

### NEJM
- Minimum font size 8pt
- Recommended time unit: months
- Fixed spacing between risk table and curve

### Lancet
- Minimum font size 9pt
- Support multi-group display (max 4 groups)
- Table grid lines optional

### JCO
- Need to label censoring information
- Support risk table + censoring table double-layer structure
- Recommend labeling median follow-up time

## FAQ

### Q: How are time points automatically determined?
A: Default uses quantiles in the data (0%, 25%, 50%, 75%, 100%) or fixed intervals (e.g., every 12 months)

### Q: How to handle multiple groups?
A: Automatically detect group column, support up to 6 groups. Exceeding automatically uses pagination or reduced font

### Q: Can it work with KM curves generated by Python/R?
A: Yes, supports importing external KM curve images for combination

## Dependency Requirements

```
numpy >= 1.20.0
pandas >= 1.3.0
matplotlib >= 3.4.0
seaborn >= 0.11.0
lifelines >= 0.27.0  (optional, for survival analysis)
Pillow >= 8.0.0      (image processing)
```

## Related Tools

- lifelines: Python survival analysis library
- survminer: R survival curve visualization
- ggsurvplot: ggplot2 survival plot extension

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `survival-curve-risk-table` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `survival-curve-risk-table` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

## Inputs to Collect

- Required inputs: the user goal, the primary data or source file, and the requested output format.
- Optional inputs: output directory, formatting preferences, and validation constraints.
- If a required input is unavailable, return a short clarification request before continuing.

## Output Contract

- Return a short summary, the main deliverables, and any assumptions that materially affect interpretation.
- If execution is partial, label what succeeded, what failed, and the next safe recovery step.
- Keep the final answer within the documented scope of the skill.

## Validation and Safety Rules

- Validate identifiers, file paths, and user-provided parameters before execution.
- Do not fabricate results, metrics, citations, or downstream conclusions.
- Use safe fallback behavior when dependencies, credentials, or required inputs are missing.
- Surface any execution failure with a concise diagnosis and recovery path.

FILE:README.md
# Survival Curve Risk Table Generator

Automatically align and add "Number at risk" tables below Kaplan-Meier survival curves, consistent with Clinical Oncology Journal standards.
## Install
```bash
pip install -r requirements.txt
```

## Quick Start
### 1. Create sample data
```bash
python scripts/main.py --create-sample-data sample_data.csv
```

### 2. Generate risk table
```bash
python scripts/main.py \
    --input sample_data.csv \
    --time-col time \
    --event-col event \
    --group-col treatment \
    --style NEJM \
    --output risk_table.png
```

### 3. Generate combination chart (KM curve + risk table)
```bash
python scripts/main.py \
    --input sample_data.csv \
    --time-col time \
    --event-col event \
    --group-col treatment \
    --combine \
    --output combined_figure.png
```

## Command line parameters
|parameter|illustrate|Example|
|------|------|------|
| `--input` |Enter data file path| `data.csv` |
| `--time-col` |time column name| `time` |
| `--event-col` |Event column name (1=event, 0=censored)| `event` |
| `--group-col` |Grouping column name| `treatment` |
| `--output` |Output file path| `risk_table.png` |
| `--style` |Journal style (NEJM/Lancet/JCO)| `NEJM` |
| `--time-points` |Custom time point| `0,6,12,18,24,30` |
| `--combine` |Generate combination diagram| - |
| `--show-censored` |Show censored number of people| - |

## Input data format
CSV format example:
```csv
time,event,treatment
12.5,1,Experimental
18.3,0,Experimental
24.0,1,Control
...
```

## Python API

```python
from scripts.main import RiskTableGenerator

# initialization
generator = RiskTableGenerator(style="NEJM")

# Load data
generator.load_data_from_file(
    "data.csv",
    time_col="time",
    event_col="event",
    group_col="treatment"
)

# Generate risk table
generator.generate_risk_table("risk_table.png")

# Generate combination diagram
generator.generate_combined_plot(output_path="combined.png")
```

## Journal style
- **NEJM**: New England Journal of Medicine style- **Lancet**: The Lancet style- **JCO**: Journal of Clinical Oncology style
## License
MIT License

FILE:references/runtime_checklist.md
# Runtime Checklist

- Category: `Data Analysis`
- Validate the user goal, required inputs, and output format before taking action.
- Ask a targeted clarification question when a required input is missing.
- Keep the response scoped to the documented workflow and state assumptions explicitly.
- Run a non-destructive smoke check before any file-dependent or data-dependent command.
- Recommended smoke check: `python -m py_compile scripts/main.py`
- If execution fails, stop and return a concise recovery path instead of fabricating results.

FILE:requirements.txt
lifelines
matplotlib
numpy
pandas
pil
pillow
seaborn

FILE:scripts/main.py
#!/usr/bin/env python3
"""Survival Curve Risk Table Generator
Automatically align and add "Number at risk" table below Kaplan-Meier survival curve
Comply with clinical oncology journal standards (NEJM, Lancet, JCO, etc.)

Author:OpenClaw
Version: 1.0.0"""

import argparse
import sys
import warnings
from pathlib import Path
from typing import List, Optional, Tuple, Dict, Union
import json

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.gridspec import GridSpec

# Try importing optional dependencies
try:
    from PIL import Image
    HAS_PIL = True
except ImportError:
    HAS_PIL = False
    warnings.warn("PIL not installed. Image combining features disabled.")

try:
    from lifelines import KaplanMeierFitter
    from lifelines.statistics import logrank_test
    HAS_LIFELINES = True
except ImportError:
    HAS_LIFELINES = False
    warnings.warn("lifelines not installed. Survival analysis features limited.")


class RiskTableGenerator:
    """Survival Curve Risk Table Generator
    
    Main functions:
    1. Calculate the number of people at risk at each time point from survival data
    2. Generate tables that comply with journal standards
    3. Combine with KM curve to generate publication-level images"""
    
    # Predefined journal style configurations
    JOURNAL_STYLES = {
        "NEJM": {
            "font_family": "Helvetica",
            "font_size": 8,
            "table_height_ratio": 0.15,
            "show_grid": False,
            "separator_lines": True,
            "header_bold": True,
            "time_label": "mo",
        },
        "Lancet": {
            "font_family": "Times New Roman",
            "font_size": 9,
            "table_height_ratio": 0.18,
            "show_grid": True,
            "separator_lines": True,
            "header_bold": True,
            "time_label": "months",
        },
        "JCO": {
            "font_family": "Arial",
            "font_size": 8,
            "table_height_ratio": 0.16,
            "show_grid": False,
            "separator_lines": False,
            "header_bold": False,
            "time_label": "mo",
            "show_censored": True,
            "censor_symbol": "+",
        },
        "custom": {
            "font_family": "Arial",
            "font_size": 8,
            "table_height_ratio": 0.15,
            "show_grid": False,
            "separator_lines": True,
            "header_bold": True,
            "time_label": "mo",
        }
    }
    
    def __init__(
        self,
        style: str = "NEJM",
        time_points: Optional[List[float]] = None,
        figure_size: Tuple[float, float] = (8, 6),
        dpi: int = 300,
        custom_style: Optional[Dict] = None
    ):
        """Initialize risk table generator
        
        Args:
            style: journal style (NEJM, Lancet, JCO, custom)
            time_points: Custom time point list
            figure_size: image size (width, height) inches
            dpi: image resolution
            custom_style: Custom style configuration (used when style='custom')"""
        self.style_name = style
        self.style = self.JOURNAL_STYLES.get(style, self.JOURNAL_STYLES["NEJM"]).copy()
        
        if custom_style and style == "custom":
            self.style.update(custom_style)
        
        self.time_points = time_points
        self.figure_size = figure_size
        self.dpi = dpi
        
        self.data = None
        self.time_col = None
        self.event_col = None
        self.group_col = None
        self.groups = None
        
    def load_data(
        self,
        df: pd.DataFrame,
        time_col: str,
        event_col: str,
        group_col: Optional[str] = None
    ) -> None:
        """Load survival data
        
        Args:
            df: survival data DataFrame
            time_col: time column name
            event_col: event column name (1=event, 0=censored)
            group_col: grouping column name (optional)"""
        self.data = df.copy()
        self.time_col = time_col
        self.event_col = event_col
        self.group_col = group_col
        
        # Validate required columns
        required_cols = [time_col, event_col]
        missing_cols = [col for col in required_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(f"Missing required columns: {missing_cols}")
        
        if group_col and group_col not in df.columns:
            raise ValueError(f"Group column '{group_col}' not found in data")
        
        # Extract group information
        if group_col:
            self.groups = df[group_col].unique().tolist()
        else:
            self.groups = ["All"]
            self.data["__group__"] = "All"
            self.group_col = "__group__"
        
        # Data type checking
        if not pd.api.types.is_numeric_dtype(self.data[time_col]):
            raise ValueError(f"Time column '{time_col}' must be numeric")
        
        # Make sure the event column is numeric
        self.data[event_col] = pd.to_numeric(self.data[event_col], errors='coerce')
        
        # Automatically determine time point (if not specified)
        if self.time_points is None:
            self._auto_select_time_points()
    
    def load_data_from_file(
        self,
        file_path: str,
        time_col: str,
        event_col: str,
        group_col: Optional[str] = None
    ) -> None:
        """Load survival data from file
        
        Supported formats: CSV, Excel, SAS, pickle"""
        path = Path(file_path)
        
        if not path.exists():
            raise FileNotFoundError(f"File not found: {file_path}")
        
        suffix = path.suffix.lower()
        
        if suffix == '.csv':
            df = pd.read_csv(file_path)
        elif suffix in ['.xlsx', '.xls']:
            df = pd.read_excel(file_path)
        elif suffix == '.sas7bdat':
            df = pd.read_sas(file_path)
        elif suffix in ['.pkl', '.pickle']:
            df = pd.read_pickle(file_path)
        else:
            raise ValueError(f"Unsupported file format: {suffix}")
        
        self.load_data(df, time_col, event_col, group_col)
    
    def _auto_select_time_points(self, n_points: int = 7) -> None:
        """Automatically select time points
        
        Strategy: Use equally spaced time points to cover the 0 to maximum follow-up time of the data"""
        if self.data is None:
            raise ValueError("No data loaded")
        
        max_time = self.data[self.time_col].max()
        
        # Generate equidistant time points (containing 0 and maximum values)
        self.time_points = np.linspace(0, max_time, n_points).tolist()
        self.time_points = [round(t, 1) for t in self.time_points]
    
    def calculate_number_at_risk(self) -> pd.DataFrame:
        """Calculate the number of people at risk at each time point
        
        Returns:
            DataFrame, columns are time points, behavior grouping, value is the number of people at risk"""
        if self.data is None:
            raise ValueError("No data loaded")
        
        results = []
        
        for group in self.groups:
            group_data = self.data[self.data[self.group_col] == group]
            n_total = len(group_data)
            
            row = {"Group": group}
            for t in self.time_points:
                # The number of people at risk = the total number of people - the number of people who had the incident before time t - the number of people who were censored before time t
                events_before = len(group_data[
                    (group_data[self.time_col] <= t) & 
                    (group_data[self.event_col] == 1)
                ])
                censored_before = len(group_data[
                    (group_data[self.time_col] < t) & 
                    (group_data[self.event_col] == 0)
                ])
                
                n_at_risk = n_total - events_before - censored_before
                row[f"t_{t}"] = max(0, n_at_risk)
            
            results.append(row)
        
        return pd.DataFrame(results)
    
    def calculate_censored_counts(self) -> pd.DataFrame:
        """Calculate the censored number of people in each time interval (for JCO style)"""
        if self.data is None:
            raise ValueError("No data loaded")
        
        results = []
        
        for group in self.groups:
            group_data = self.data[self.data[self.group_col] == group]
            row = {"Group": group}
            
            for i, t in enumerate(self.time_points):
                if i == 0:
                    censored_in_interval = 0
                else:
                    t_prev = self.time_points[i - 1]
                    censored_in_interval = len(group_data[
                        (group_data[self.time_col] > t_prev) &
                        (group_data[self.time_col] <= t) &
                        (group_data[self.event_col] == 0)
                    ])
                row[f"t_{t}"] = censored_in_interval
            
            results.append(row)
        
        return pd.DataFrame(results)
    
    def generate_risk_table(
        self,
        output_path: str,
        show_censored: Optional[bool] = None,
        show_events: bool = False,
        title: Optional[str] = None
    ) -> None:
        """Generate independent risk table images
        
        Args:
            output_path: output file path
            show_censored: whether to display the censored number of people (default is read from the style configuration)
            show_events: whether to display the number of events
            title: table title"""
        if self.data is None:
            raise ValueError("No data loaded. Call load_data() first.")
        
        if show_censored is None:
            show_censored = self.style.get("show_censored", False)
        
        # Calculated data
        risk_df = self.calculate_number_at_risk()
        
        # Set font
        plt.rcParams['font.family'] = self.style.get("font_family", "Arial")
        
        # Create graphics
        fig, ax = plt.subplots(figsize=self.figure_size, dpi=self.dpi)
        ax.axis('off')
        
        # Prepare tabular data
        table_data = self._prepare_table_data(risk_df, show_censored)
        
        # draw table
        self._draw_risk_table(ax, table_data, title)
        
        # save
        plt.tight_layout()
        plt.savefig(output_path, dpi=self.dpi, bbox_inches='tight', 
                   facecolor='white', edgecolor='none')
        plt.close()
        
        print(f"Risk table saved to: {output_path}")
    
    def _prepare_table_data(
        self,
        risk_df: pd.DataFrame,
        show_censored: bool
    ) -> List[List[str]]:
        """Prepare tabular data"""
        # Header
        time_label = self.style.get("time_label", "mo")
        headers = ["Number at risk"]
        for t in self.time_points:
            if t == self.time_points[-1]:
                headers.append(f"{int(t)} ({time_label})")
            else:
                headers.append(str(int(t)))
        
        # data row
        rows = []
        for _, row in risk_df.iterrows():
            group_name = row["Group"]
            data_row = [group_name]
            for t in self.time_points:
                data_row.append(str(int(row[f"t_{t}"])))
            rows.append(data_row)
        
        # If you need to display the censored number of people (JCO style)
        if show_censored:
            censored_df = self.calculate_censored_counts()
            for _, row in censored_df.iterrows():
                group_name = row["Group"]
                data_row = [f"  ({self.style.get('censor_symbol', '+')})"]
                for t in self.time_points:
                    count = int(row[f"t_{t}"])
                    data_row.append(str(count) if count > 0 else "")
                rows.append(data_row)
        
        return [headers] + rows
    
    def _draw_risk_table(
        self,
        ax,
        table_data: List[List[str]],
        title: Optional[str] = None
    ) -> None:
        """Draw a risk table"""
        font_size = self.style.get("font_size", 8)
        show_grid = self.style.get("show_grid", False)
        header_bold = self.style.get("header_bold", True)
        
        # Create table
        table = ax.table(
            cellText=table_data[1:],
            colLabels=table_data[0],
            loc='center',
            cellLoc='center'
        )
        
        table.auto_set_font_size(False)
        table.set_fontsize(font_size)
        table.scale(1, 2)
        
        # Set style
        for i, key in enumerate(table.get_celld().keys()):
            cell = table.get_celld()[key]
            row, col = key
            
            # Header style
            if row == 0:
                cell.set_text_props(fontweight='bold' if header_bold else 'normal')
                cell.set_facecolor('#f0f0f0')
                cell.set_edgecolor('black' if show_grid else 'none')
            else:
                # Data row style
                cell.set_edgecolor('black' if show_grid else 'none')
                
                # The first column (group name) is left aligned
                if col == 0:
                    cell.set_text_props(ha='left')
                    cell._loc = 'left'
        
        # Add title
        if title:
            ax.set_title(title, fontsize=font_size + 2, fontweight='bold', pad=10)
    
    def generate_combined_plot(
        self,
        km_plot_path: Optional[str] = None,
        output_path: str = "combined_survival_plot.png",
        km_ax: Optional[matplotlib.axes.Axes] = None,
        show_km_plot: bool = True,
        km_title: Optional[str] = None
    ) -> None:
        """Generate a combined graph of KM curve and risk table
        
        Args:
            km_plot_path: external KM curve image path (optional)
            output_path: output file path
            km_ax: pre-generated KM curve axes (optional)
            show_km_plot: whether to display KM curve
            km_title: KM curve title"""
        if not HAS_LIFELINES and km_ax is None and km_plot_path is None:
            raise ImportError("lifelines is required for generating KM plots. "
                            "Install with: pip install lifelines")
        
        # Set font
        plt.rcParams['font.family'] = self.style.get("font_family", "Arial")
        
        # Create a graphic layout
        fig_height = self.figure_size[1]
        table_height_ratio = self.style.get("table_height_ratio", 0.15)
        
        if show_km_plot:
            fig = plt.figure(figsize=(self.figure_size[0], fig_height), dpi=self.dpi)
            gs = GridSpec(2, 1, height_ratios=[1 - table_height_ratio, table_height_ratio],
                         hspace=0.05)
            
            # Upper part: KM curve
            ax_km = fig.add_subplot(gs[0])
            
            if km_plot_path and Path(km_plot_path).exists():
                # Use external images
                img = plt.imread(km_plot_path)
                ax_km.imshow(img)
                ax_km.axis('off')
            elif km_ax is not None:
                # Use the provided axes (complicated, not implemented yet)
                pass
            else:
                # Automatically generate KM curve
                self._plot_km_curve(ax_km, km_title)
            
            # Lower part: risk table
            ax_table = fig.add_subplot(gs[1])
            ax_table.axis('off')
            
            # Prepare and draw tables
            risk_df = self.calculate_number_at_risk()
            show_censored = self.style.get("show_censored", False)
            table_data = self._prepare_table_data(risk_df, show_censored)
            self._draw_risk_table(ax_table, table_data)
            
            # Align X axis
            if hasattr(self, '_km_xlim'):
                ax_table.set_xlim(self._km_xlim)
        else:
            # Generate risk table only
            self.generate_risk_table(output_path)
            return
        
        plt.savefig(output_path, dpi=self.dpi, bbox_inches='tight',
                   facecolor='white', edgecolor='none')
        plt.close()
        
        print(f"Combined plot saved to: {output_path}")
    
    def _plot_km_curve(
        self,
        ax: matplotlib.axes.Axes,
        title: Optional[str] = None
    ) -> None:
        """Use lifelines to draw KM curve"""
        if not HAS_LIFELINES:
            raise ImportError("lifelines is required for generating KM plots")
        
        colors = plt.cm.tab10(np.linspace(0, 1, len(self.groups)))
        
        for i, group in enumerate(self.groups):
            group_data = self.data[self.data[self.group_col] == group]
            
            kmf = KaplanMeierFitter()
            kmf.fit(
                durations=group_data[self.time_col],
                event_observed=group_data[self.event_col],
                label=group
            )
            
            kmf.plot_survival_function(
                ax=ax,
                ci_show=False,
                color=colors[i],
                linewidth=2
            )
        
        ax.set_xlabel(f"Time ({self.style.get('time_label', 'months')})", fontsize=10)
        ax.set_ylabel("Survival Probability", fontsize=10)
        ax.set_ylim(0, 1.05)
        ax.legend(loc='lower left', frameon=False)
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        
        if title:
            ax.set_title(title, fontsize=12, fontweight='bold')
        
        # Save X-axis range for alignment
        self._km_xlim = ax.get_xlim()
    
    def export_risk_table_data(self, output_path: str) -> None:
        """Export risk table data to CSV"""
        risk_df = self.calculate_number_at_risk()
        
        # Rename columns to more friendly names
        time_label = self.style.get("time_label", "mo")
        col_map = {"Group": "Group"}
        for t in self.time_points:
            col_map[f"t_{t}"] = f"{int(t)} {time_label}"
        
        risk_df = risk_df.rename(columns=col_map)
        risk_df.to_csv(output_path, index=False)
        print(f"Risk table data exported to: {output_path}")


def create_sample_data(output_path: str, n_patients: int = 300, seed: int = 42) -> None:
    """Create sample survival data for testing"""
    np.random.seed(seed)
    
    # Generate two sets of data
    n_per_group = n_patients // 2
    
    # Experimental Group: Better Survival
    exp_times = np.random.exponential(scale=36, size=n_per_group)
    exp_events = np.random.binomial(1, 0.6, n_per_group)
    
    # control group
    ctrl_times = np.random.exponential(scale=24, size=n_per_group)
    ctrl_events = np.random.binomial(1, 0.7, n_per_group)
    
    # Combined data
    df = pd.DataFrame({
        'time': np.concatenate([exp_times, ctrl_times]),
        'event': np.concatenate([exp_events, ctrl_events]),
        'treatment': ['Experimental'] * n_per_group + ['Control'] * n_per_group
    })
    
    # Censored for more than 60 months
    df.loc[df['time'] > 60, 'event'] = 0
    df.loc[df['time'] > 60, 'time'] = 60
    
    df.to_csv(output_path, index=False)
    print(f"Sample data created: {output_path}")


def main():
    """Command line entry"""
    parser = argparse.ArgumentParser(
        description="Generate Number at Risk table for Kaplan-Meier survival curves",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""Examples:
  #Basic usage
  python main.py --input data.csv --time-col time --event-col event --output risk_table.png
  
  #Specify journal style
  python main.py --input data.csv --time-col time --event-col event --style NEJM --output figure.pdf
  
  # Combination chart (automatically generate KM curve)
  python main.py --input data.csv --time-col time --event-col event --combine --output combined.png
  
  #Specify time point
  python main.py --input data.csv --time-col time --event-col event --time-points 0,6,12,18,24,30,36
  
  #Create sample data
  python main.py --create-sample-data sample.csv"""
    )
    
    # input parameters
    parser.add_argument('--input', '-i', type=str, help='Input data file path')
    parser.add_argument('--time-col', type=str, help='Time column name')
    parser.add_argument('--event-col', type=str, help='Event column name (1=event, 0=censored)')
    parser.add_argument('--group-col', type=str, help='Group column name (optional)')
    
    # Output parameters
    parser.add_argument('--output', '-o', type=str, default='risk_table.png',
                       help='Output file path (default: risk_table.png)')
    parser.add_argument('--output-dir', type=str, help='Output directory for batch processing')
    
    # style parameters
    parser.add_argument('--style', type=str, default='NEJM',
                       choices=['NEJM', 'Lancet', 'JCO', 'custom'],
                       help='Journal style (default: NEJM)')
    parser.add_argument('--time-points', type=str,
                       help='Comma-separated time points (e.g., 0,6,12,18,24)')
    
    # Image parameters
    parser.add_argument('--width', type=float, default=8, help='Figure width in inches (default: 8)')
    parser.add_argument('--height', type=float, default=6, help='Figure height in inches (default: 6)')
    parser.add_argument('--dpi', type=int, default=300, help='DPI resolution (default: 300)')
    parser.add_argument('--font-size', type=int, help='Font size (overrides style default)')
    
    # Function switch
    parser.add_argument('--combine', action='store_true',
                       help='Generate combined KM plot with risk table')
    parser.add_argument('--km-plot', type=str, help='Path to existing KM plot image')
    parser.add_argument('--show-censored', action='store_true',
                       help='Show censored counts in table')
    parser.add_argument('--show-events', action='store_true',
                       help='Show event counts in table')
    parser.add_argument('--export-data', action='store_true',
                       help='Export risk table data as CSV')
    
    # tool
    parser.add_argument('--create-sample-data', type=str, metavar='PATH',
                       help='Create sample survival data for testing')
    
    args = parser.parse_args()
    
    # Create sample data
    if args.create_sample_data:
        create_sample_data(args.create_sample_data)
        return
    
    # Validate required parameters
    if not args.input:
        parser.error("--input is required (or use --create-sample-data to generate test data)")
    
    if not args.time_col or not args.event_col:
        parser.error("--time-col and --event-col are required")
    
    # Analysis time point
    time_points = None
    if args.time_points:
        time_points = [float(x.strip()) for x in args.time_points.split(',')]
    
    # Custom style
    custom_style = {}
    if args.font_size:
        custom_style['font_size'] = args.font_size
    
    # Initialization generator
    generator = RiskTableGenerator(
        style=args.style,
        time_points=time_points,
        figure_size=(args.width, args.height),
        dpi=args.dpi,
        custom_style=custom_style if custom_style else None
    )
    
    # Load data
    print(f"Loading data from: {args.input}")
    generator.load_data_from_file(
        args.input,
        args.time_col,
        args.event_col,
        args.group_col
    )
    print(f"Loaded {len(generator.data)} records")
    print(f"Groups: {generator.groups}")
    print(f"Time points: {generator.time_points}")
    
    # Export data
    if args.export_data:
        data_output = args.output.replace('.png', '.csv').replace('.pdf', '.csv')
        generator.export_risk_table_data(data_output)
    
    # generate image
    if args.combine:
        print("Generating combined plot...")
        generator.generate_combined_plot(
            km_plot_path=args.km_plot,
            output_path=args.output
        )
    else:
        print("Generating risk table...")
        generator.generate_risk_table(
            output_path=args.output,
            show_censored=args.show_censored,
            show_events=args.show_events
        )
    
    print("Done!")


if __name__ == "__main__":
    main()

ClawHub Data Analysis Research+2

A@clawhub-aipoch-ai-772015cadb

Study Limitations Drafter

Skill

Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.

---
name: study-limitations-drafter
description: Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
license: MIT
skill-author: AIPOCH
---
# Study Limitations Drafter

Professional limitation statement generator.

## When to Use

- Use this skill when the task needs Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
- Use this skill for academic writing tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

- Scope-focused workflow aligned to: Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.

## Example Usage

```bash
cd "20260318/scientific-skills/Academic Writing/study-limitations-drafter"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Use Cases
- Manuscript limitation sections
- Grant proposal risk assessment
- Peer review responses

## Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `limitations` | list[str] | Yes | - | List of study design constraints |
| `severity` | str | No | "minor" | Impact level: "minor" or "major" |
| `mitigation` | list[str] | No | - | Steps taken to address each limitation |

## Returns
- Professionally worded limitation paragraphs
- Balanced tone (honest but not self-defeating)
- Mitigation strategies

## Example
"While the single-center design limits generalizability..."

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

No additional Python packages required.

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `study-limitations-drafter` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `study-limitations-drafter` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## References

- [references/audit-reference.md](references/audit-reference.md) - Supported scope, audit commands, and fallback boundaries

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/audit-reference.md
# Audit Reference

## Scope

- Skill: `study-limitations-drafter`
- Core purpose: Use study limitations drafter for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
- Use only within the documented workflow and category boundary defined in `SKILL.md`

## Supported Audit Paths

- `python -m py_compile scripts/main.py`
- `python scripts/main.py --help`

## Fallback Boundary

If required inputs are incomplete, the skill should still return:

- the missing required inputs
- the steps that can still be completed safely
- assumptions that need confirmation before execution
- the next checks before accepting the final deliverable

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Study Limitations Drafter
Transform study design flaws into professional limitation statements.
"""

import argparse


class LimitationsDrafter:
    """Draft study limitations sections."""
    
    LIMITATION_TEMPLATES = {
        "sample_size": {
            "issue": "Small sample size",
            "statement": "The sample size in this study was limited, which may have reduced statistical power to detect smaller effects."
        },
        "retrospective": {
            "issue": "Retrospective design",
            "statement": "The retrospective nature of this study introduces potential selection bias and limits causal inference."
        },
        "single_center": {
            "issue": "Single-center study",
            "statement": "Findings from this single-center study may not generalize to other populations or healthcare settings."
        },
        "short_followup": {
            "issue": "Short follow-up",
            "statement": "The relatively short follow-up period limits assessment of long-term outcomes and delayed effects."
        },
        "missing_data": {
            "issue": "Missing data",
            "statement": "Missing data for some variables may have introduced bias, though we employed appropriate imputation methods."
        }
    }
    
    def generate_limitations(self, issues):
        """Generate limitations section."""
        sections = []
        
        sections.append("STUDY LIMITATIONS")
        sections.append("-"*60)
        sections.append("")
        
        for issue in issues:
            if issue in self.LIMITATION_TEMPLATES:
                template = self.LIMITATION_TEMPLATES[issue]
                sections.append(f"• {template['statement']}")
                sections.append("")
        
        sections.append("Despite these limitations, this study provides valuable insights...")
        
        return "\n".join(sections)


def main():
    parser = argparse.ArgumentParser(description="Study Limitations Drafter")
    parser.add_argument("--issues", "-i", required=True, 
                       help="Comma-separated limitation types")
    parser.add_argument("--output", "-o", help="Output file")
    
    args = parser.parse_args()
    
    drafter = LimitationsDrafter()
    
    issues = [i.strip() for i in args.issues.split(",")]
    
    text = drafter.generate_limitations(issues)
    print(text)
    
    if args.output:
        with open(args.output, 'w') as f:
            f.write(text)
        print(f"\nSaved to: {args.output}")


if __name__ == "__main__":
    main()

ClawHub Coding Research+2

A@clawhub-aipoch-ai-772015cadb

Statistical Analysis Advisor

Skill

Recommends appropriate statistical methods (T-test vs ANOVA, etc.) based.

---
name: statistical-analysis-advisor
description: Recommends appropriate statistical methods (T-test vs ANOVA, etc.) based.
license: MIT
skill-author: AIPOCH
---
# Statistical Analysis Advisor

Intelligent statistical test recommendation engine that guides users through selecting the right statistical methods for their data.

## When to Use

- Use this skill when the task needs Recommends appropriate statistical methods (T-test vs ANOVA, etc.) based.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

- Scope-focused workflow aligned to: Recommends appropriate statistical methods (T-test vs ANOVA, etc.) based.
- Packaged executable path(s): `scripts/main.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `dataclasses`: `unspecified`. Declared in `requirements.txt`.
- `enum`: `unspecified`. Declared in `requirements.txt`.

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Data Analytics/statistical-analysis-advisor"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/main.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Capabilities

1. **Statistical Test Selection**
   - Compares and recommends between T-test, ANOVA, Chi-square, Mann-Whitney, Kruskal-Wallis, etc.
   - Considers data type, distribution, sample size, and research question
   - Provides decision tree logic for test selection

2. **Assumption Checking**
   - Normality tests (Shapiro-Wilk, Kolmogorov-Smirnov)
   - Homogeneity of variance (Levene's test, Bartlett's test)
   - Independence verification
   - Outlier detection guidance

3. **Power Analysis & Sample Size**
   - Effect size estimation (Cohen's d, eta-squared, Cramér's V)
   - Sample size calculations for desired power
   - Post-hoc power analysis

## Usage

```python
from scripts.main import StatisticalAdvisor

advisor = StatisticalAdvisor()

# Get test recommendation
recommendation = advisor.recommend_test(
    data_type="continuous",
    groups=2,
    independent=True,
    distribution="normal"
)

# Check assumptions
assumptions = advisor.check_assumptions(
    data=[group1, group2],
    test_type="independent_ttest"
)

# Power analysis
power = advisor.calculate_power(
    effect_size=0.5,
    alpha=0.05,
    sample_size=30
)
```

## Input Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| data_type | str | "continuous", "categorical", "ordinal" |
| groups | int | Number of groups/comparison levels |
| independent | bool | Independent or paired/related samples |
| distribution | str | "normal", "non-normal", "unknown" |
| sample_size | int | Current or planned sample size |

## Technical Difficulty: High ⚠️

**Warning**: Statistical recommendations have significant implications for research validity. This skill requires human verification of all recommendations before application in published research.

## References

- See `references/statistical_tests_guide.md` for detailed test selection criteria
- See `references/assumption_tests.md` for assumption checking procedures
- See `references/power_analysis_guide.md` for power calculation methods

## Limitations

- Does not perform actual data analysis (recommendations only)
- Cannot access raw data directly
- Complex multivariate designs may require specialized consultation
- Bayesian alternatives not covered comprehensively

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `statistical-analysis-advisor` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `statistical-analysis-advisor` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:references/assumption_tests.md
# Assumption Checking Procedures

## Overview

Statistical tests rely on specific assumptions. This guide provides systematic procedures for checking these assumptions and actions to take when assumptions are violated.

## 1. Normality Assumption

### 1.1 Visual Methods

#### Histogram with Normal Overlay
```python
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

def plot_normality_check(data, title="Normality Check"):
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Histogram
    axes[0].hist(data, bins='auto', density=True, alpha=0.7)
    mu, sigma = np.mean(data), np.std(data)
    x = np.linspace(min(data), max(data), 100)
    axes[0].plot(x, stats.norm.pdf(x, mu, sigma), 'r-', lw=2)
    axes[0].set_title(f'Histogram (μ={mu:.2f}, σ={sigma:.2f})')
    
    # Q-Q Plot
    stats.probplot(data, dist="norm", plot=axes[1])
    axes[1].set_title('Q-Q Plot')
    
    plt.suptitle(title)
    plt.tight_layout()
    return fig
```

#### Q-Q Plot Interpretation
- **Points on diagonal line**: Normal distribution
- **S-shape**: Skewness present
- **Inverted U**: Light-tailed distribution
- **U-shape**: Heavy-tailed distribution

### 1.2 Statistical Tests

#### Shapiro-Wilk Test
**Best for:** n < 50

```python
from scipy import stats

statistic, p_value = stats.shapiro(data)
# H0: Data comes from normal distribution
# If p < 0.05, reject normality
```

**Properties:**
- Most powerful normality test
- Sensitive to small sample sizes
- Recommended for n < 50

#### Kolmogorov-Smirnov Test
**Best for:** Larger samples, comparing to theoretical distribution

```python
statistic, p_value = stats.kstest(data, 'norm', args=(mean, std))
```

**Note:** Lass powerful than Shapiro-Wilk for normality testing.

#### Anderson-Darling Test
**Best for:** Detecting tail deviations

```python
result = stats.anderson(data, dist='norm')
# Compare statistic to critical values at different significance levels
```

### 1.3 When to Care About Normality

| Test | Normality Importance | Sample Size Robustness |
|------|---------------------|----------------------|
| T-Test | Moderate | Robust with n > 30 per group |
| ANOVA | Moderate-High | Robust with balanced n > 30 |
| Regression | High (residuals) | Depends on design |

### 1.4 Actions if Normality Violated

1. **Transform data:**
   - Log transformation (right skew)
   - Square root transformation (count data)
   - Box-Cox transformation (find optimal λ)

2. **Use non-parametric alternative:**
   - Mann-Whitney U instead of t-test
   - Kruskal-Wallis instead of ANOVA

3. **Use robust methods:**
   - Trimmed means
   - Bootstrap confidence intervals

## 2. Homogeneity of Variance (Homoscedasticity)

### 2.1 Levene's Test
**Recommended for:** Non-normal distributions

```python
from scipy import stats

statistic, p_value = stats.levene(group1, group2, group3)
# H0: Equal variances across groups
# If p < 0.05, variances are unequal
```

### 2.2 Bartlett's Test
**Recommended for:** Normal distributions (more powerful)

```python
statistic, p_value = stats.bartlett(group1, group2, group3)
```

**Caution:** Sensitive to non-normality.

### 2.3 Fligner-Killeen Test
**Recommended for:** Non-normal data, very robust

```python
statistic, p_value = stats.fligner(group1, group2, group3)
```

### 2.4 Actions if Homogeneity Violated

| Test | Alternative |
|------|-------------|
| Independent T-Test | Welch's T-Test |
| One-Way ANOVA | Welch's ANOVA |
| Repeated Measures | Use mixed models |

## 3. Independence Assumption

### 3.1 Checking Independence

**Study Design Review:**
- Were subjects randomly assigned?
- Are measurements truly independent?
- Any clustering/nesting present?

### 3.2 Durbin-Watson Test (for regression/time series)
```python
from statsmodels.stats.stattools import durbin_watson

dw_stat = durbin_watson(residuals)
# ~2: No autocorrelation
# < 1.5 or > 2.5: Possible autocorrelation
```

### 3.3 Actions if Independence Violated

- Use mixed-effects models
- Include cluster-robust standard errors
- Use generalized estimating equations (GEE)

## 4. Linearity (for Regression/Correlation)

### 4.1 Visual Check
```python
# Residuals vs Fitted plot
plt.scatter(fitted_values, residuals)
plt.axhline(y=0, color='r', linestyle='--')
```

**Patterns:**
- **Random scatter**: Linearity satisfied
- **Curved pattern**: Non-linearity present
- **Funnel shape**: Heteroscedasticity

### 4.2 Actions if Non-linear

- Polynomial terms
- Transform variables
- Use non-linear regression
- Splines/GAMs

## 5. Sphericity (Repeated Measures)

### 5.1 Mauchly's Test
```r
# In R
library(car)
model <- aov(outcome ~ time + Error(subject/time), data = df)
mauchly.test(model)
```

### 5.2 Corrections
- Greenhouse-Geisser (ε < 0.75)
- Huynh-Feldt (ε > 0.75)

## 6. Outlier Detection

### 6.1 Methods

#### Z-Score Method
```python
z_scores = np.abs(stats.zscore(data))
outliers = np.where(z_scores > 3)[0]  # |z| > 3
```

#### IQR Method
```python
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
outliers = data[(data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)]
```

#### Modified Z-Score (using MAD)
```python
median = np.median(data)
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad
outliers = np.where(np.abs(modified_z) > 3.5)[0]
```

### 6.2 Actions for Outliers

1. **Investigate**: Data entry error?
2. **Report**: Analyze with and without outliers
3. **Transform**: Reduce outlier impact
4. **Robust methods**: Use resistant statistics

## 7. Multicollinearity (Regression)

### 7.1 Variance Inflation Factor (VIF)
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
```

**Interpretation:**
- VIF = 1: No correlation
- VIF 1-5: Moderate correlation
- VIF > 5: High correlation (problematic)
- VIF > 10: Severe multicollinearity

### 7.2 Actions
- Remove correlated predictors
- Combine into composite variable
- Use PCA
- Ridge regression

## 8. Comprehensive Assumption Check Function

```python
def comprehensive_assumption_check(data, groups=None, test_type="ttest"):
    """
    Comprehensive assumption checking for common statistical tests.
    
    Parameters:
    -----------
    data : array-like
        Data to check (or list of arrays for multiple groups)
    groups : list of array-like, optional
        Separate groups for group comparisons
    test_type : str
        Type of test planned ("ttest", "anova", "regression")
    
    Returns:
    --------
    dict : Assumption check results
    """
    results = {}
    
    if groups is not None:
        # Multiple groups
        all_data = np.concatenate(groups)
        
        # Normality per group
        results['normality'] = {}
        for i, group in enumerate(groups):
            stat, pval = stats.shapiro(group)
            results['normality'][f'group_{i+1}'] = {
                'statistic': stat,
                'p_value': pval,
                'passed': pval > 0.05
            }
        
        # Homogeneity
        stat, pval = stats.levene(*groups)
        results['homogeneity'] = {
            'test': 'Levene',
            'statistic': stat,
            'p_value': pval,
            'passed': pval > 0.05
        }
        
        # Sample sizes
        results['sample_sizes'] = [len(g) for g in groups]
    
    else:
        # Single sample or paired differences
        stat, pval = stats.shapiro(data)
        results['normality'] = {
            'statistic': stat,
            'p_value': pval,
            'passed': pval > 0.05
        }
    
    return results
```

## 9. Quick Reference: Assumption Violation Actions

| Assumption | Test | If Violated |
|------------|------|-------------|
| Normality | Shapiro-Wilk | Non-parametric test or transformation |
| Homogeneity | Levene's | Welch's test or Brown-Forsythe |
| Independence | Design review | Mixed models or GEE |
| Linearity | Residual plots | Polynomial/spline terms |
| Sphericity | Mauchly's | GG/HF corrections |
| Multicollinearity | VIF | Remove/combine variables |

## References

1. Tabachnick, B. G., & Fidell, L. S. (2019). Using Multivariate Statistics (7th ed.).
2. Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.).
3. Wilcox, R. R. (2017). Introduction to Robust Estimation and Hypothesis Testing (4th ed.).

FILE:references/power_analysis_guide.md
# Power Analysis Guide

## Overview

Statistical power is the probability of correctly rejecting a false null hypothesis (1 - β). This guide covers power analysis for common statistical tests and sample size determination.

## 1. Key Concepts

### 1.1 Components of Power Analysis

| Component | Symbol | Description |
|-----------|--------|-------------|
| Significance Level | α | Probability of Type I error (typically 0.05) |
| Power | 1-β | Probability of detecting true effect (typically 0.80) |
| Effect Size | d, f, r | Standardized magnitude of effect |
| Sample Size | n | Number of observations |

### 1.2 Effect Size Conventions

#### Cohen's d (for t-tests)
```
d = (M1 - M2) / SD_pooled
```

| Effect Size | d | Interpretation |
|-------------|---|----------------|
| Small | 0.2 | Barely perceptible |
| Medium | 0.5 | Observable difference |
| Large | 0.8 | Obvious difference |

#### Cohen's f (for ANOVA)
```
f = σ_means / σ_within
```

| Effect Size | f | η² |
|-------------|---|-----|
| Small | 0.1 | 0.01 |
| Medium | 0.25 | 0.06 |
| Large | 0.4 | 0.14 |

#### Cramér's V (for Chi-Square)
```
V = sqrt(χ² / (n * (k - 1)))
```

| Effect Size | V |
|-------------|---|
| Small | 0.1 |
| Medium | 0.3 |
| Large | 0.5 |

## 2. Power Calculation Methods

### 2.1 Analytical Approach

#### Two-Sample T-Test
```python
import scipy.stats as stats
import numpy as np

def power_ttest(effect_size, n_per_group, alpha=0.05):
    """Calculate power for independent t-test."""
    # Non-centrality parameter
    ncp = effect_size * np.sqrt(n_per_group / 2)
    
    # Critical t-value
    df = 2 * n_per_group - 2
    t_crit = stats.t.ppf(1 - alpha/2, df)
    
    # Power calculation
    power = 1 - stats.nct.cdf(t_crit, df, ncp) + stats.nct.cdf(-t_crit, df, ncp)
    return power
```

#### Sample Size Formula (Two-Sample T-Test)
```python
def sample_size_ttest(effect_size, power=0.8, alpha=0.05):
    """Calculate required sample size per group."""
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)
    
    n = 2 * ((z_alpha + z_beta) / effect_size) ** 2
    return int(np.ceil(n))
```

### 2.2 Simulation Approach

```python
def power_by_simulation(effect_size, n_per_group, n_simulations=10000, alpha=0.05):
    """
    Estimate power through Monte Carlo simulation.
    More flexible for complex designs.
    """
    significant_results = 0
    
    for _ in range(n_simulations):
        # Simulate data under alternative hypothesis
        group1 = np.random.normal(0, 1, n_per_group)
        group2 = np.random.normal(effect_size, 1, n_per_group)
        
        # Perform test
        _, p_value = stats.ttest_ind(group1, group2)
        
        if p_value < alpha:
            significant_results += 1
    
    return significant_results / n_simulations
```

## 3. Practical Power Analysis Examples

### 3.1 Two-Sample T-Test

```python
from statsmodels.stats.power import TTestIndPower

power_analysis = TTestIndPower()

# Calculate power
effect_size = 0.5  # Medium effect
alpha = 0.05
n_per_group = 30

power = power_analysis.power(effect_size=effect_size, 
                             nobs1=n_per_group, 
                             alpha=alpha)
print(f"Power: {power:.2%}")

# Calculate sample size
required_n = power_analysis.solve_power(effect_size=effect_size,
                                         power=0.8,
                                         alpha=alpha)
print(f"Required n per group: {int(np.ceil(required_n))}")
```

### 3.2 One-Way ANOVA

```python
from statsmodels.stats.power import FTestAnovaPower

power_analysis = FTestAnovaPower()

# Calculate power for 3 groups
effect_size = 0.25  # f = 0.25 (medium)
alpha = 0.05
n_per_group = 20
k_groups = 3

power = power_analysis.power(effect_size=effect_size,
                             nobs=n_per_group,
                             alpha=alpha,
                             k_groups=k_groups)

# Calculate sample size
required_n = power_analysis.solve_power(effect_size=effect_size,
                                        power=0.8,
                                        alpha=alpha,
                                        k_groups=k_groups)
```

### 3.3 Chi-Square Test

```python
def power_chi_square(effect_size, n, df=1, alpha=0.05):
    """
    Approximate power for chi-square test.
    effect_size: Cramér's V or phi coefficient
    """
    # Non-centrality parameter
    ncp = n * effect_size ** 2
    
    # Critical value
    crit = stats.chi2.ppf(1 - alpha, df)
    
    # Power
    power = 1 - stats.ncx2.cdf(crit, df, ncp)
    return power
```

## 4. Sample Size Determination

### 4.1 Planning Table: T-Test

| Effect Size | Power = 0.80 | Power = 0.90 | Power = 0.95 |
|-------------|--------------|--------------|--------------|
| Small (d=0.2) | 393/group | 526/group | 651/group |
| Medium (d=0.5) | 64/group | 86/group | 105/group |
| Large (d=0.8) | 26/group | 34/group | 42/group |

### 4.2 Planning Table: ANOVA (3 groups)

| Effect Size | Power = 0.80 | Power = 0.90 |
|-------------|--------------|--------------|
| Small (f=0.1) | 109/group | 145/group |
| Medium (f=0.25) | 18/group | 24/group |
| Large (f=0.4) | 8/group | 10/group |

### 4.3 Accounting for Attrition

```python
def adjust_for_attrition(n_required, attrition_rate=0.20):
    """
    Adjust sample size for expected dropout.
    
    Parameters:
    -----------
    n_required : int
        Required completers
    attrition_rate : float
        Expected dropout rate (0.20 = 20%)
    
    Returns:
    --------
    int : Adjusted enrollment target
    """
    n_enroll = int(np.ceil(n_required / (1 - attrition_rate)))
    return n_enroll

# Example
n_needed = 64  # Completers needed
n_enroll = adjust_for_attrition(n_needed, attrition_rate=0.20)
print(f"Enroll {n_enroll} to get {n_needed} completers with 20% attrition")
```

## 5. Power Curves

```python
import matplotlib.pyplot as plt

def plot_power_curve(effect_sizes, sample_sizes, alpha=0.05):
    """Generate power curves for different sample sizes."""
    fig, ax = plt.subplots(figsize=(10, 6))
    
    for n in sample_sizes:
        powers = []
        for es in effect_sizes:
            power = power_analysis.power(effect_size=es, nobs1=n, alpha=alpha)
            powers.append(power)
        ax.plot(effect_sizes, powers, label=f'n={n}', marker='o')
    
    ax.axhline(y=0.8, color='r', linestyle='--', label='Target Power (80%)')
    ax.set_xlabel('Effect Size (Cohen\'s d)')
    ax.set_ylabel('Power')
    ax.set_title('Power Curves for Independent T-Test')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return fig

# Generate curve
effect_sizes = np.arange(0.1, 1.0, 0.1)
sample_sizes = [20, 30, 50, 100]
plot_power_curve(effect_sizes, sample_sizes)
```

## 6. Post-Hoc Power Analysis

**Warning:** Post-hoc power analysis is controversial. Better to report:
- Observed effect size
- Confidence interval
- p-value

If you must calculate observed power:

```python
def observed_power_from_pvalue(p_value, n_per_group, alpha=0.05):
    """
    Calculate observed power from obtained p-value.
    Not recommended for general use.
    """
    # Back-calculate effect size from p-value
    df = 2 * n_per_group - 2
    t_observed = stats.t.ppf(1 - p_value/2, df)
    
    # Approximate effect size
    d_observed = t_observed * np.sqrt(2/n_per_group)
    
    # Calculate power at observed effect
    power = power_analysis.power(effect_size=d_observed, 
                                 nobs1=n_per_group, 
                                 alpha=alpha)
    return power, d_observed
```

## 7. Sensitivity Analysis

```python
def sensitivity_analysis(n_per_group, min_power=0.8, alpha=0.05):
    """
    Determine minimum detectable effect size for given sample.
    """
    # Binary search for effect size that achieves target power
    low, high = 0.01, 2.0
    
    while high - low > 0.001:
        mid = (low + high) / 2
        power = power_analysis.power(effect_size=mid, 
                                     nobs1=n_per_group, 
                                     alpha=alpha)
        if power < min_power:
            low = mid
        else:
            high = mid
    
    return mid

# Example: What effect size can we detect with n=50?
mde = sensitivity_analysis(50, min_power=0.8)
print(f"Minimum detectable effect size (d): {mde:.2f}")
```

## 8. Multiplicity Corrections

When conducting multiple tests, adjust sample size calculations:

```python
def bonferroni_adjusted_alpha(alpha, n_tests):
    """Conservative correction for multiple comparisons."""
    return alpha / n_tests

def fdr_adjusted_sample_size(n_original, n_tests, alpha=0.05):
    """
    Approximate adjustment for FDR control.
    Conservative approach: increase sample size by factor.
    """
    # Benjamini-Hochberg adjustment factor (approximate)
    adjustment = np.sum(1 / np.arange(1, n_tests + 1))
    return int(np.ceil(n_original * adjustment))
```

## 9. Equivalence and Non-Inferiority

```python
def sample_size_equivalence(margin, sd, alpha=0.05, power=0.8):
    """
    Sample size for equivalence test (TOST).
    
    Parameters:
    -----------
    margin : float
        Equivalence margin
    sd : float
        Standard deviation
    """
    # Two one-sided tests approach
    z_alpha = stats.norm.ppf(1 - alpha)
    z_beta = stats.norm.ppf(power)
    
    n = 2 * ((z_alpha + z_beta) * sd / margin) ** 2
    return int(np.ceil(n))
```

## 10. Reporting Guidelines

Include in methods section:

```
Sample size was determined based on a power analysis for a 
independent samples t-test. With an expected medium effect 
size (Cohen's d = 0.5), α = 0.05, and desired power of 0.80, 
a minimum of 64 participants per group (128 total) was required. 
To account for an anticipated 15% attrition rate, we aimed to 
enroll 150 participants.
```

## 11. Software Comparison

| Software | Function/Package | Pros | Cons |
|----------|-----------------|------|------|
| Python | statsmodels | Free, flexible | Requires coding |
| R | pwr package | Comprehensive | Learning curve |
| G*Power | GUI application | User-friendly | Limited customization |
| Stata | power command | Integrated | Commercial |

## References

1. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.).
2. Faul, F., et al. (2009). G*Power 3.1. Behavior Research Methods, 41(4), 1149-1160.
3. Perugini, M., et al. (2018). A Practical Primer To Power Analysis.
4. Lakens, D. (2021). Sample Size Justification. PsyArXiv.

FILE:references/statistical_tests_guide.md
# Statistical Tests Guide

## Overview

This guide provides detailed criteria for selecting appropriate statistical tests based on research design, data characteristics, and distributional assumptions.

## 1. Two-Group Comparisons

### 1.1 Independent Samples

#### T-Test (Independent)
**Use when:**
- Continuous outcome variable
- Two independent groups
- Approximately normal distribution
- Homogeneity of variances (can use Welch's if violated)

**Formula:**
```
t = (M1 - M2) / sqrt(((s1²/n1) + (s2²/n2)))
```

**Assumptions:**
1. Independence of observations
2. Normality (robust to violations with n > 30 per group)
3. Homogeneity of variance (use Welch's t-test if violated)

#### Mann-Whitney U Test (Wilcoxon Rank-Sum)
**Use when:**
- Ordinal or continuous non-normal data
- Two independent groups
- Assumption of normality severely violated
- Outliers present

**Advantages:**
- Robust to outliers
- No distributional assumptions
- Works with ordinal data

**Disadvantages:**
- Less powerful than t-test when normality holds
- Tests different null hypothesis (distribution equality vs. mean equality)

### 1.2 Paired/Related Samples

#### Paired T-Test
**Use when:**
- Same subjects measured twice
- Matched pairs design
- Differences are approximately normal

**Formula:**
```
t = M_diff / (SD_diff / sqrt(n))
```

#### Wilcoxon Signed-Rank Test
**Use when:**
- Paired design with non-normal differences
- Ordinal paired data
- Robust alternative to paired t-test

## 2. Three or More Groups (ANOVA Family)

### 2.1 One-Way ANOVA
**Use when:**
- Continuous outcome
- Three or more independent groups
- Normal distribution within each group
- Homogeneity of variances

**Formula:**
```
F = MS_between / MS_within
```

**Post-hoc tests:**
- Tukey HSD (all pairwise comparisons)
- Bonferroni (conservative, planned comparisons)
- Scheffé (complex comparisons)

### 2.2 Welch's ANOVA
**Use when:**
- Heterogeneous variances across groups
- More robust than standard ANOVA

### 2.3 Kruskal-Wallis H Test
**Use when:**
- Non-normal distributions
- Ordinal data with 3+ groups
- Non-parametric alternative to one-way ANOVA

**Post-hoc:** Dunn's test with Bonferroni correction

### 2.4 Repeated Measures ANOVA
**Use when:**
- Same subjects measured at multiple time points
- Assesses within-subject effects

**Assumptions:**
1. Sphericity (Mauchly's test)
2. Normality of residuals
3. No significant outliers

**Correction for sphericity violation:**
- Greenhouse-Geisser (conservative)
- Huynh-Feldt (less conservative)

### 2.5 Friedman Test
**Use when:**
- Repeated measures with non-normal data
- Non-parametric alternative to repeated measures ANOVA

## 3. Categorical Data Analysis

### 3.1 Chi-Square Test of Independence
**Use when:**
- Two categorical variables
- Independent observations
- Expected frequencies ≥ 5

**Formula:**
```
χ² = Σ((O - E)² / E)
```

**Requirements:**
- No more than 20% of cells with expected count < 5
- No cells with expected count < 1

### 3.2 Fisher's Exact Test
**Use when:**
- Small sample sizes
- Expected cell counts < 5
- 2×2 contingency tables

**Advantages:**
- Exact p-values (not approximate)
- Valid for any sample size

**Limitations:**
- Computationally intensive for large tables

### 3.3 McNemar's Test
**Use when:**
- Paired categorical data
- Before/after design with binary outcome
- Matched case-control studies

## 4. Correlation Analysis

### 4.1 Pearson Correlation
**Use when:**
- Two continuous variables
- Linear relationship
- Bivariate normality

### 4.2 Spearman's Rank Correlation
**Use when:**
- Ordinal data
- Non-linear monotonic relationship
- Outliers present
- Non-normal distributions

### 4.3 Kendall's Tau
**Use when:**
- Small sample sizes
- Tied ranks present
- More robust than Spearman

## 5. Regression Analysis

### 5.1 Linear Regression
**Use when:**
- Continuous outcome
- Linear relationship with predictors
- Independent errors

**Assumptions:**
1. Linearity
2. Independence of errors
3. Homoscedasticity
4. Normality of residuals
5. No multicollinearity

### 5.2 Logistic Regression
**Use when:**
- Binary outcome variable
- Predicting probability of event

### 5.3 Poisson Regression
**Use when:**
- Count data as outcome
- Rate/ratio outcomes

## 6. Decision Tree Summary

```
Data Type?
├── Continuous
│   ├── 2 Groups
│   │   ├── Independent → T-Test / Mann-Whitney
│   │   └── Paired → Paired T-Test / Wilcoxon
│   └── 3+ Groups
│       ├── Independent → ANOVA / Kruskal-Wallis
│       └── Repeated → RM-ANOVA / Friedman
├── Categorical
│   ├── 2×2 Table → Chi-Square / Fisher's Exact
│   └── Larger Table → Chi-Square
└── Ordinal
    └── Non-parametric tests (Mann-Whitney, Kruskal-Wallis)
```

## 7. Effect Size Measures

| Test | Effect Size | Small | Medium | Large |
|------|-------------|-------|--------|-------|
| T-Test | Cohen's d | 0.2 | 0.5 | 0.8 |
| ANOVA | η² (eta-squared) | 0.01 | 0.06 | 0.14 |
| ANOVA | f | 0.1 | 0.25 | 0.4 |
| Chi-Square | Cramér's V | 0.1 | 0.3 | 0.5 |
| Correlation | r | 0.1 | 0.3 | 0.5 |

## 8. Software Implementation

### Python
```python
import scipy.stats as stats

# T-Test
stats.ttest_ind(group1, group2)
stats.ttest_ind(group1, group2, equal_var=False)  # Welch's

# ANOVA
stats.f_oneway(group1, group2, group3)

# Non-parametric
stats.mannwhitneyu(group1, group2)
stats.kruskal(group1, group2, group3)

# Chi-Square
stats.chi2_contingency(contingency_table)
```

### R
```r
# T-Test
t.test(outcome ~ group, data = df)
t.test(outcome ~ group, data = df, var.equal = FALSE)  # Welch's

# ANOVA
aov(outcome ~ group, data = df)

# Non-parametric
wilcox.test(outcome ~ group, data = df)
kruskal.test(outcome ~ group, data = df)

# Chi-Square
chisq.test(table)
fisher.test(table)
```

## References

1. Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE.
2. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.).
3. Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing.

FILE:requirements.txt
dataclasses
enum

FILE:scripts/main.py
#!/usr/bin/env python3
"""
Statistical Analysis Advisor
Recommends appropriate statistical methods based on dataset characteristics.
"""

from dataclasses import dataclass
from typing import List, Dict, Optional, Tuple, Any
from enum import Enum
import math


class DataType(Enum):
    CONTINUOUS = "continuous"
    CATEGORICAL = "categorical"
    ORDINAL = "ordinal"


class Distribution(Enum):
    NORMAL = "normal"
    NON_NORMAL = "non-normal"
    UNKNOWN = "unknown"


@dataclass
class TestRecommendation:
    """Result container for statistical test recommendations."""
    recommended_test: str
    alternative_tests: List[str]
    assumptions: List[str]
    warnings: List[str]
    rationale: str


@dataclass
class AssumptionCheck:
    """Result container for assumption checking."""
    assumption: str
    test_name: str
    satisfied: bool
    details: str
    recommendation: str


@dataclass
class PowerAnalysis:
    """Result container for power analysis."""
    effect_size: float
    sample_size: int
    alpha: float
    power: float
    interpretation: str


class StatisticalAdvisor:
    """
    Intelligent advisor for statistical test selection and experimental design.
    """
    
    def __init__(self):
        self.test_decision_tree = self._build_decision_tree()
        self.assumption_tests = self._load_assumption_tests()
    
    def _build_decision_tree(self) -> Dict:
        """Build the statistical test decision tree."""
        return {
            DataType.CONTINUOUS: {
                2: {  # Two groups
                    True: {  # Independent
                        Distribution.NORMAL: "Independent Samples T-Test",
                        Distribution.NON_NORMAL: "Mann-Whitney U Test",
                        Distribution.UNKNOWN: "Mann-Whitney U Test (robust)"
                    },
                    False: {  # Paired
                        Distribution.NORMAL: "Paired Samples T-Test",
                        Distribution.NON_NORMAL: "Wilcoxon Signed-Rank Test",
                        Distribution.UNKNOWN: "Wilcoxon Signed-Rank Test (robust)"
                    }
                },
                "3+": {  # Three or more groups
                    True: {  # Independent
                        Distribution.NORMAL: "One-Way ANOVA",
                        Distribution.NON_NORMAL: "Kruskal-Wallis H Test",
                        Distribution.UNKNOWN: "Kruskal-Wallis H Test (robust)"
                    },
                    False: {  # Repeated measures
                        Distribution.NORMAL: "Repeated Measures ANOVA",
                        Distribution.NON_NORMAL: "Friedman Test",
                        Distribution.UNKNOWN: "Friedman Test (robust)"
                    }
                }
            },
            DataType.CATEGORICAL: {
                2: {
                    True: {
                        "default": "Chi-Square Test of Independence",
                        "expected_low": "Fisher's Exact Test"
                    }
                },
                "3+": {
                    True: {
                        "default": "Chi-Square Test of Independence"
                    }
                }
            },
            DataType.ORDINAL: {
                2: {
                    True: {
                        "default": "Mann-Whitney U Test"
                    },
                    False: {
                        "default": "Wilcoxon Signed-Rank Test"
                    }
                },
                "3+": {
                    True: {
                        "default": "Kruskal-Wallis H Test"
                    },
                    False: {
                        "default": "Friedman Test"
                    }
                }
            }
        }
    
    def _load_assumption_tests(self) -> Dict:
        """Load assumption checking procedures."""
        return {
            "independent_ttest": {
                "normality": {
                    "test": "Shapiro-Wilk",
                    "alternative": "Kolmogorov-Smirnov",
                    "action_if_violated": "Use Mann-Whitney U Test"
                },
                "homogeneity": {
                    "test": "Levene's Test",
                    "alternative": "Bartlett's Test",
                    "action_if_violated": "Use Welch's T-Test"
                },
                "independence": {
                    "check": "Study design review",
                    "action_if_violated": "Non-independent data requires mixed models"
                }
            },
            "paired_ttest": {
                "normality": {
                    "test": "Shapiro-Wilk on differences",
                    "action_if_violated": "Use Wilcoxon Signed-Rank Test"
                },
                "independence_of_pairs": {
                    "check": "Study design review",
                    "action_if_violated": "Requires specialized paired analysis"
                }
            },
            "anova": {
                "normality": {
                    "test": "Shapiro-Wilk (per group)",
                    "action_if_violated": "Consider transformation or Kruskal-Wallis"
                },
                "homogeneity": {
                    "test": "Levene's Test",
                    "action_if_violated": "Use Welch's ANOVA or Brown-Forsythe"
                },
                "independence": {
                    "check": "Study design review",
                    "action_if_violated": "Requires mixed-effects models"
                }
            },
            "chi_square": {
                "expected_frequencies": {
                    "rule": "All expected counts ≥ 5",
                    "alternative_rule": "No more than 20% below 5",
                    "action_if_violated": "Use Fisher's Exact Test"
                },
                "independence": {
                    "check": "Study design review",
                    "action_if_violated": "Use McNemar's Test for paired data"
                }
            }
        }
    
    def recommend_test(
        self,
        data_type: str,
        groups: int,
        independent: bool,
        distribution: str = "unknown",
        paired: Optional[bool] = None,
        expected_cell_counts: Optional[int] = None
    ) -> TestRecommendation:
        """
        Recommend appropriate statistical test based on data characteristics.
        
        Args:
            data_type: "continuous", "categorical", or "ordinal"
            groups: Number of groups (2 or more)
            independent: Whether samples are independent
            distribution: "normal", "non-normal", or "unknown"
            paired: Whether samples are paired (overrides independent if set)
            expected_cell_counts: For chi-square, minimum expected count
        
        Returns:
            TestRecommendation with recommended test and alternatives
        """
        # Normalize inputs
        data_type_enum = DataType(data_type.lower())
        dist_enum = Distribution(distribution.lower())
        
        # Handle paired override
        if paired is not None:
            independent = not paired
        
        # Determine group key
        group_key = 2 if groups == 2 else "3+"
        
        warnings = []
        alternative_tests = []
        
        # Special handling for categorical data
        if data_type_enum == DataType.CATEGORICAL:
            if expected_cell_counts is not None and expected_cell_counts < 5:
                recommended = "Fisher's Exact Test"
                alternative_tests = ["Chi-Square with Yates' correction", "Bootstrap methods"]
                warnings.append("Low expected cell counts - Fisher's Exact Test preferred")
            else:
                recommended = "Chi-Square Test of Independence"
                alternative_tests = ["Fisher's Exact Test (if sparse)", "G-test"]
            
            assumptions = [
                "Independence of observations",
                "Expected frequency ≥ 5 in at least 80% of cells",
                "Mutual exclusivity of categories"
            ]
            
            rationale = f"Chi-square tests are appropriate for categorical data with {groups} groups."
            
        else:
            # Lookup in decision tree
            try:
                tree = self.test_decision_tree[data_type_enum][group_key][independent]
                if isinstance(tree, dict) and Distribution.NORMAL in tree:
                    recommended = tree.get(dist_enum, tree[Distribution.UNKNOWN])
                else:
                    recommended = tree.get("default", "Consultation recommended")
                
                # Determine alternatives
                if data_type_enum == DataType.CONTINUOUS:
                    if groups == 2:
                        if independent:
                            alternative_tests = ["Welch's T-Test", "Yuen's Trimmed Mean T-Test"]
                        else:
                            alternative_tests = ["Permutation Test", "Bootstrap CI"]
                    else:  # 3+ groups
                        if independent:
                            alternative_tests = ["Welch's ANOVA", "Brown-Forsythe Test"]
                        else:
                            alternative_tests = ["Mixed-effects Model", "GEE"]
                
                assumptions = self._get_assumptions_for_test(recommended)
                rationale = self._get_rationale(recommended, data_type_enum, groups, independent, dist_enum)
                
            except KeyError:
                recommended = "Consultation recommended"
                alternative_tests = []
                assumptions = []
                warnings.append("Complex design - specialized consultation advised")
                rationale = "The specified design requires specialized statistical consultation."
        
        # Add sample size warnings
        if groups >= 2 and data_type_enum == DataType.CONTINUOUS:
            warnings.extend(self._check_sample_size_recommendations(groups))
        
        return TestRecommendation(
            recommended_test=recommended,
            alternative_tests=alternative_tests,
            assumptions=assumptions,
            warnings=warnings,
            rationale=rationale
        )
    
    def _get_assumptions_for_test(self, test_name: str) -> List[str]:
        """Get assumptions list for a specific test."""
        test_key = test_name.lower().replace(" ", "_").replace("-", "_")
        
        assumption_map = {
            "independent": "independent_ttest",
            "paired": "paired_ttest",
            "anova": "anova",
            "chi_square": "chi_square",
            "mann_whitney": "independent_ttest",  # Uses same assumptions check
            "wilcoxon": "paired_ttest",
            "kruskal_wallis": "anova"
        }
        
        for key, value in assumption_map.items():
            if key in test_key:
                test_info = self.assumption_tests.get(value, {})
                return [f"{k}: {v.get('test', v.get('check', 'Required'))}" 
                        for k, v in test_info.items()]
        
        return ["Consult references/assumption_tests.md for detailed requirements"]
    
    def _get_rationale(
        self,
        test_name: str,
        data_type: DataType,
        groups: int,
        independent: bool,
        distribution: Distribution
    ) -> str:
        """Generate rationale for test recommendation."""
        parts = []
        
        parts.append(f"Selected {test_name} based on:")
        parts.append(f"- Data type: {data_type.value}")
        parts.append(f"- Groups: {groups}")
        parts.append(f"- Sample relationship: {'Independent' if independent else 'Paired/Related'}")
        parts.append(f"- Distribution: {distribution.value}")
        
        if "mann-whitney" in test_name.lower() or "wilcoxon" in test_name.lower() or "kruskal" in test_name.lower():
            parts.append("\nNon-parametric test chosen due to non-normal distribution or ordinal data.")
            parts.append("These tests are robust to outliers and don't assume normality.")
        
        if "welch" in test_name.lower():
            parts.append("\nWelch's test used for unequal variances.")
        
        return "\n".join(parts)
    
    def _check_sample_size_recommendations(self, groups: int) -> List[str]:
        """Check and return sample size recommendations."""
        warnings = []
        
        if groups == 2:
            warnings.append("Minimum n=20 per group recommended for robust normality tests")
        else:
            warnings.append(f"Minimum n=30 per group recommended for ANOVA with {groups} groups")
        
        return warnings
    
    def check_assumptions(
        self,
        test_type: str,
        sample_sizes: Optional[List[int]] = None,
        normality_pvalues: Optional[List[float]] = None,
        levene_pvalue: Optional[float] = None
    ) -> List[AssumptionCheck]:
        """
        Check statistical assumptions for a given test.
        
        Args:
            test_type: Type of test being considered
            sample_sizes: List of sample sizes per group
            normality_pvalues: P-values from normality tests
            levene_pvalue: P-value from Levene's homogeneity test
        
        Returns:
            List of AssumptionCheck results
        """
        results = []
        
        test_key = test_type.lower().replace(" ", "_")
        
        # Check normality
        if normality_pvalues is not None:
            for i, pval in enumerate(normality_pvalues):
                satisfied = pval > 0.05
                results.append(AssumptionCheck(
                    assumption=f"Normality (Group {i+1})",
                    test_name="Shapiro-Wilk",
                    satisfied=satisfied,
                    details=f"p-value = {pval:.4f}",
                    recommendation="Use non-parametric alternative" if not satisfied else "Assumption satisfied"
                ))
        
        # Check homogeneity
        if levene_pvalue is not None:
            satisfied = levene_pvalue > 0.05
            results.append(AssumptionCheck(
                assumption="Homogeneity of Variance",
                test_name="Levene's Test",
                satisfied=satisfied,
                details=f"p-value = {levene_pvalue:.4f}",
                recommendation="Use Welch's test" if not satisfied else "Assumption satisfied"
            ))
        
        # Sample size check
        if sample_sizes is not None:
            for i, n in enumerate(sample_sizes):
                satisfied = n >= 30
                results.append(AssumptionCheck(
                    assumption=f"Adequate Sample Size (Group {i+1})",
                    test_name="Rule of Thumb",
                    satisfied=satisfied,
                    details=f"n = {n}",
                    recommendation="Consider power analysis" if not satisfied else "Sample size adequate"
                ))
        
        return results
    
    def calculate_power(
        self,
        effect_size: float,
        alpha: float = 0.05,
        sample_size: Optional[int] = None,
        power: Optional[float] = None,
        groups: int = 2
    ) -> PowerAnalysis:
        """
        Calculate statistical power or required sample size.
        
        Args:
            effect_size: Cohen's d (for t-tests) or f (for ANOVA)
            alpha: Significance level (default 0.05)
            sample_size: Sample size per group (if known)
            power: Desired power (if calculating sample size)
            groups: Number of groups
        
        Returns:
            PowerAnalysis with results
        
        Note:
            Either sample_size OR power must be provided (not both)
        """
        if sample_size is None and power is None:
            raise ValueError("Either sample_size or power must be provided")
        
        if sample_size is not None and power is not None:
            raise ValueError("Provide only sample_size OR power, not both")
        
        # Approximate power calculation using normal approximation
        # For more precise calculations, use specialized libraries like statsmodels
        
        if power is not None:
            # Calculate required sample size
            # Using simplified formula: n = 2*((Z_(1-alpha/2) + Z_power)/effect_size)^2 for two groups
            z_alpha = 1.96  # for alpha=0.05
            z_power = {0.8: 0.84, 0.9: 1.28, 0.95: 1.645}.get(power, 0.84)
            
            n_per_group = int(2 * ((z_alpha + z_power) / effect_size) ** 2)
            
            if groups > 2:
                # Rough adjustment for ANOVA
                n_per_group = int(n_per_group * (1 + 0.1 * (groups - 2)))
            
            calculated_power = power
            calculated_n = n_per_group
            interpretation = f"To achieve {power*100:.0f}% power with effect size {effect_size}, need n={n_per_group} per group."
            
        else:
            # Calculate achieved power
            z_alpha = 1.96
            ncp = effect_size * math.sqrt(sample_size / 2)  # non-centrality parameter
            
            # Approximate power using normal distribution
            import statistics
            calculated_power = 1 - statistics.NormalDist().cdf(z_alpha - ncp)
            calculated_power = min(calculated_power, 0.999)  # Cap at 99.9%
            
            calculated_n = sample_size
            
            if calculated_power >= 0.8:
                interpretation = f"Power ({calculated_power:.1%}) is adequate (≥80%)."
            elif calculated_power >= 0.6:
                interpretation = f"Power ({calculated_power:.1%}) is marginal. Consider increasing sample size."
            else:
                interpretation = f"Power ({calculated_power:.1%}) is insufficient. Increase sample size or effect size."
        
        return PowerAnalysis(
            effect_size=effect_size,
            sample_size=calculated_n,
            alpha=alpha,
            power=calculated_power,
            interpretation=interpretation
        )
    
    def get_effect_size_interpretation(self, effect_size: float, metric: str = "cohens_d") -> str:
        """
        Interpret effect size magnitude.
        
        Args:
            effect_size: Calculated effect size value
            metric: Type of effect size metric
        
        Returns:
            Interpretation string
        """
        interpretations = {
            "cohens_d": {
                (0, 0.2): "Negligible",
                (0.2, 0.5): "Small",
                (0.5, 0.8): "Medium",
                (0.8, float('inf')): "Large"
            },
            "eta_squared": {
                (0, 0.01): "Small",
                (0.01, 0.06): "Medium",
                (0.06, 0.14): "Large",
                (0.14, float('inf')): "Very Large"
            },
            "cramers_v": {
                (0, 0.1): "Small",
                (0.1, 0.3): "Medium",
                (0.3, 0.5): "Large",
                (0.5, float('inf')): "Very Large"
            }
        }
        
        metric_interp = interpretations.get(metric, interpretations["cohens_d"])
        
        for (lower, upper), label in metric_interp.items():
            if lower <= abs(effect_size) < upper:
                return f"{label} effect size ({metric} = {effect_size:.3f})"
        
        return f"Effect size = {effect_size:.3f} (interpretation not available)"
    
    def compare_tests(self, scenario: Dict[str, Any]) -> Dict[str, TestRecommendation]:
        """
        Compare multiple test options for a given scenario.
        
        Args:
            scenario: Dictionary with data characteristics
        
        Returns:
            Dictionary mapping test names to recommendations
        """
        recommendations = {}
        
        # Parametric option
        if scenario.get("distribution") == "normal":
            rec = self.recommend_test(**scenario)
            recommendations["parametric"] = rec
        
        # Non-parametric option
        non_normal_scenario = scenario.copy()
        non_normal_scenario["distribution"] = "non-normal"
        rec = self.recommend_test(**non_normal_scenario)
        recommendations["non_parametric"] = rec
        
        # Robust option
        robust_scenario = scenario.copy()
        robust_scenario["distribution"] = "unknown"
        rec = self.recommend_test(**robust_scenario)
        recommendations["robust"] = rec
        
        return recommendations


def main():
    """Example usage and testing."""
    advisor = StatisticalAdvisor()
    
    # Example 1: Two-group comparison with normal distribution
    print("=" * 60)
    print("Example 1: Two-group independent comparison")
    print("=" * 60)
    
    rec = advisor.recommend_test(
        data_type="continuous",
        groups=2,
        independent=True,
        distribution="normal"
    )
    
    print(f"Recommended: {rec.recommended_test}")
    print(f"Alternatives: {', '.join(rec.alternative_tests)}")
    print(f"Rationale:\n{rec.rationale}")
    print(f"Assumptions:")
    for assumption in rec.assumptions:
        print(f"  - {assumption}")
    
    # Example 2: Power analysis
    print("\n" + "=" * 60)
    print("Example 2: Power Analysis")
    print("=" * 60)
    
    power_result = advisor.calculate_power(
        effect_size=0.5,
        alpha=0.05,
        sample_size=30
    )
    
    print(f"Effect size: {power_result.effect_size}")
    print(f"Sample size: {power_result.sample_size}")
    print(f"Achieved power: {power_result.power:.1%}")
    print(f"Interpretation: {power_result.interpretation}")
    
    # Example 3: Sample size calculation
    print("\n" + "=" * 60)
    print("Example 3: Sample Size Calculation")
    print("=" * 60)
    
    size_result = advisor.calculate_power(
        effect_size=0.5,
        alpha=0.05,
        power=0.8
    )
    
    print(f"For 80% power with effect size 0.5:")
    print(f"Required sample size: {size_result.sample_size} per group")
    print(f"Total N: {size_result.sample_size * 2}")
    
    # Example 4: Effect size interpretation
    print("\n" + "=" * 60)
    print("Example 4: Effect Size Interpretation")
    print("=" * 60)
    
    for es in [0.1, 0.4, 0.7, 1.2]:
        interp = advisor.get_effect_size_interpretation(es, "cohens_d")
        print(f"Cohen's d = {es}: {interp}")


if __name__ == "__main__":
    main()

ClawHub Coding Testing+2

A@clawhub-aipoch-ai-772015cadb

Spatial Transcriptomics Mapper

Skill

Map spatial transcriptomics data from 10x Genomics Visium/Xenium onto.

---
name: spatial-transcriptomics-mapper
description: Map spatial transcriptomics data from 10x Genomics Visium/Xenium onto.
license: MIT
skill-author: AIPOCH
---
# Spatial Transcriptomics Mapper (ID: 196)

## When to Use

- Use this skill when the task is to Map spatial transcriptomics data from 10x Genomics Visium/Xenium onto.
- Use this skill for data analysis tasks that require explicit assumptions, bounded scope, and a reproducible output format.
- Use this skill when you need a documented fallback path for missing inputs, execution errors, or partial evidence.

## Key Features

See `## Features` above for related details.

- Scope-focused workflow aligned to: Map spatial transcriptomics data from 10x Genomics Visium/Xenium onto.
- Packaged executable path(s): `scripts/__init__.py` plus 1 additional script(s).
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

See `## Prerequisites` above for related details.

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `h5py`: `unspecified`. Declared in `requirements.txt`.
- `matplotlib`: `unspecified`. Declared in `requirements.txt`.
- `numpy`: `unspecified`. Declared in `requirements.txt`.
- `pandas`: `unspecified`. Declared in `requirements.txt`.
- `pil`: `unspecified`. Declared in `requirements.txt`.
- `pyarrow`: `unspecified`. Declared in `requirements.txt`.
- `scanpy`: `unspecified`. Declared in `requirements.txt`.
- `seaborn`: `unspecified`. Declared in `requirements.txt`.
- `squidpy`: `unspecified`. Declared in `requirements.txt`.
- `tifffile`: `unspecified`. Declared in `requirements.txt`.

## Example Usage

See `## Usage` above for related details.

```bash
cd "20260318/scientific-skills/Data Analytics/spatial-transcriptomics-mapper"
python -m py_compile scripts/main.py
python scripts/main.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/main.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/__init__.py` with additional helper scripts under `scripts/`.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Quick Check

Use this command to verify that the packaged script entry point can be parsed before deeper execution.

```bash
python -m py_compile scripts/main.py
```

## Audit-Ready Commands

Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.

```bash
python -m py_compile scripts/main.py
python scripts/main.py --help
```

## Workflow

1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
2. Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
3. Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
4. Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
5. If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.

## Description

Spatial Transcriptomics analysis tool for processing 10x Genomics Visium or Xenium data, projecting gene expression data back onto tissue section images to draw "gene-space" distribution maps. Supports gene expression visualization, spatial clustering analysis, and morphological feature correlation.

## Features

- **Visium Data Processing**: Supports Space Ranger output (filtered_feature_bc_matrix.h5, spatial/tissue_positions_list.csv, spatial/tissue_lowres_image.png)
- **Xenium Data Processing**: Supports Xenium Explorer output (.h5, transcripts.parquet, nucleus_boundaries.parquet)
- **Gene Expression Mapping**: Projects expression of specified genes onto tissue images
- **Spatial Clustering Visualization**: Displays spatial distribution of Seurat/Scanpy clustering results
- **Multi-gene Joint Analysis**: Supports combined visualization of multiple genes
- **High-resolution Output**: Supports high-resolution image export

## Installation

```text

# Required dependencies
pip install scanpy squidpy matplotlib seaborn pillow numpy pandas h5py

# Optional: For Xenium data processing
pip install pyarrow dask

# Optional: For advanced image processing
pip install opencv-python scikit-image
```

## Quick Start - Test Data

Generate sample Visium data to test the tool:

```text

# Generate test data
python scripts/generate_test_data.py \
  --platform visium \
  --output ./test_data/visium_sample \
  --n-spots 500 \
  --n-genes 1000

# Run analysis on test data
python scripts/main.py \
  --platform visium \
  --data-dir ./test_data/visium_sample \
  --gene GENE_0000 \
  --output ./test_output/
```

## Usage

### Basic - Visium Data

```text
python scripts/main.py \
  --platform visium \
  --data-dir /path/to/spaceranger/outs/ \
  --gene PIK3CA \
  --output ./output/
```

### Basic - Xenium Data

```text
python scripts/main.py \
  --platform xenium \
  --data-dir /path/to/xenium/outs/ \
  --gene PIK3CA \
  --output ./output/
```

### Multiple Genes

```text
python scripts/main.py \
  --platform visium \
  --data-dir /path/to/data/ \
  --genes PIK3CA,PTEN,EGFR \
  --mode overlay \
  --output ./output/
```

### With Clustering Results

```text
python scripts/main.py \
  --platform visium \
  --data-dir /path/to/data/ \
  --cluster-file ./clusters.csv \
  --output ./output/
```

## Input File Structure

### Visium (Space Ranger output)
```
outs/
├── filtered_feature_bc_matrix.h5    # Gene expression matrix
├── raw_feature_bc_matrix.h5         # Raw counts (optional)
├── spatial/
│   ├── tissue_positions_list.csv    # Spot positions
│   ├── tissue_lowres_image.png      # Low-res H&E image
│   ├── tissue_hires_image.png       # High-res H&E image
│   └── scalefactors_json.json       # Scale factors
└── web_summary.html
```

### Xenium
```
outs/
├── cell_feature_matrix.h5           # Cell x gene matrix
├── transcripts.parquet              # Transcript coordinates
├── nucleus_boundaries.parquet       # Cell boundaries
├── cell_boundaries.parquet
├── morphology_focus.ome.tif         # Morphology image
└── experiment.xenium
```

## Output Files

- `{gene}_spatial_map.png`: Single gene spatial expression map
- `{gene}_heatmap.png`: Gene expression heatmap
- `multi_gene_overlay.png`: Multi-gene overlay map (if using --mode overlay)
- `cluster_spatial_map.png`: Cluster spatial distribution map
- `combined_report.html`: Comprehensive HTML report

## Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `--platform` | str | required | Platform type: visium or xenium |
| `--data-dir` | str | required | Data directory path |
| `--gene` | str | optional | Single gene name |
| `--genes` | list | optional | Multiple genes, comma-separated |
| `--mode` | str | single | Mode: single/overlay/multi |
| `--cluster-file` | str | optional | Clustering result CSV file path |
| `--output` | str | ./output | Output directory |
| `--dpi` | int | 300 | Output image DPI |
| `--cmap` | str | viridis | Color map scheme |
| `--spot-size` | float | 1.0 | Visium spot size factor |
| `--alpha` | float | 0.8 | Transparency (0-1) |
| `--min-count` | int | 0 | Minimum expression filter |
| --crop | str | optional | Crop region (x1,y1,x2,y2) |

## Examples

### Example 1: Single Gene Visualization
```text
python scripts/main.py \
  --platform visium \
  --data-dir ./visium_sample/outs/ \
  --gene EPCAM \
  --cmap Reds \
  --output ./results/
```

### Example 2: Tumor Marker Combination
```text
python scripts/main.py \
  --platform visium \
  --data-dir ./breast_cancer/outs/ \
  --genes PIK3CA,ERBB2,ESR1,PGR \
  --mode multi \
  --cmap plasma \
  --output ./tumor_markers/
```

### Example 3: Xenium Subcellular Resolution
```text
python scripts/main.py \
  --platform xenium \
  --data-dir ./xenium_lung/outs/ \
  --genes SFTPB,SFTPC,SCGB1A1 \
  --dpi 600 \
  --output ./xenium_results/
```

### Example 4: Spatial Clustering Visualization
```text
python scripts/main.py \
  --platform visium \
  --data-dir ./sample/outs/ \
  --cluster-file ./seurat_clusters.csv \
  --output ./clusters/
```

## API Usage

```python
from skills.spatial_transcriptomics_mapper.scripts.main import SpatialMapper

# Initialize
mapper = SpatialMapper(
    platform="visium",
    data_dir="/path/to/data",
    output_dir="./output"
)

# Load data
mapper.load_data()

# Plot single gene
mapper.plot_gene_spatial(
    gene="PIK3CA",
    cmap="viridis",
    save_path="./output/pik3ca.png"
)

# Plot multiple genes
mapper.plot_multi_genes(
    genes=["PIK3CA", "PTEN", "EGFR"],
    mode="grid",
    save_path="./output/multi.png"
)

# Get spatial statistics
stats = mapper.get_spatial_stats(gene="PIK3CA")
```

## Notes

- Visium data uses low-resolution images by default to improve processing speed, can use --hires parameter to enable high resolution
- For large Xenium datasets, it is recommended to use --crop parameter to specify region of interest
- Color map reference: https://matplotlib.org/stable/tutorials/colors/colormaps.html
- For large samples, consider using --downsample parameter to reduce resolution

## References

- 10x Genomics Visium: https://www.10xgenomics.com/products/spatial-gene-expression
- 10x Genomics Xenium: https://www.10xgenomics.com/platforms/xenium
- Scanpy: https://scanpy.readthedocs.io/
- Squidpy: https://squidpy.readthedocs.io/

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited

## Prerequisites

```text

# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support

## Output Requirements

Every final response should make these items explicit when they are relevant:

- Objective or requested deliverable
- Inputs used and assumptions introduced
- Workflow or decision path
- Core result, recommendation, or artifact
- Constraints, risks, caveats, or validation needs
- Unresolved items and next-step checks

## Error Handling

- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If `scripts/main.py` fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
- Do not fabricate files, citations, data, search results, or execution outcomes.

## Input Validation

This skill accepts requests that match the documented purpose of `spatial-transcriptomics-mapper` and include enough context to complete the workflow safely.

Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:

> `spatial-transcriptomics-mapper` only handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.

## Response Template

Use the following fixed structure for non-trivial requests:

1. Objective
2. Inputs Received
3. Assumptions
4. Workflow
5. Deliverable
6. Risks and Limits
7. Next Checks

If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.

FILE:requirements.txt
h5py
matplotlib
numpy
pandas
pil
pyarrow
scanpy
seaborn
squidpy
tifffile

FILE:scripts/main.py
#!/usr/bin/env python3
"""Spatial Transcriptomics Mapper
Process 10x Visium or Xenium data to project gene expression back to tissue section images

Author:OpenClaw
Version: 1.0.0"""

import os
import sys
import json
import argparse
import warnings
from pathlib import Path
from typing import List, Dict, Optional, Tuple, Union
import logging

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.colors import Normalize, LinearSegmentedColormap
from PIL import Image

# Optional imports with fallbacks
try:
    import scanpy as sc
    SCANPY_AVAILABLE = True
except ImportError:
    SCANPY_AVAILABLE = False
    warnings.warn("Scanpy not available. Some features may be limited.")

try:
    import squidpy as sq
    SQUIDPY_AVAILABLE = True
except ImportError:
    SQUIDPY_AVAILABLE = False

try:
    import seaborn as sns
    SEABORN_AVAILABLE = True
except ImportError:
    SEABORN_AVAILABLE = False

try:
    import pyarrow.parquet as pq
    PYARROW_AVAILABLE = True
except ImportError:
    PYARROW_AVAILABLE = False


# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


class SpatialMapper:
    """Spatial transcriptome data mapper, supports Visium and Xenium platforms"""
    
    def __init__(
        self,
        platform: str,
        data_dir: str,
        output_dir: str = "./output",
        dpi: int = 300,
        cmap: str = "viridis"
    ):
        """Initialize SpatialMapper
        
        Parameters
        ----------
        platform :str
            Platform type: 'visium' or 'xenium'
        data_dir : str
            Data directory path
        output_dir : str
            output directory path
        dpi : int
            Output image DPI
        cmap : str
            color mapping scheme"""
        self.platform = platform.lower()
        self.data_dir = Path(data_dir)
        self.output_dir = Path(output_dir)
        self.dpi = dpi
        self.cmap = cmap
        
        # Validate platform
        if self.platform not in ["visium", "xenium"]:
            raise ValueError(f"Unsupported platform: {platform}. Use 'visium' or 'xenium'.")
        
        # Create output directory
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # Data containers
        self.adata = None  # AnnData object for Visium
        self.gene_names = []
        self.image = None
        self.scale_factors = {}
        self.positions = None
        
        # Xenium specific
        self.transcripts = None
        self.boundaries = None
        
        logger.info(f"Initialized SpatialMapper for {platform}")
        logger.info(f"Data directory: {data_dir}")
        logger.info(f"Output directory: {output_dir}")
    
    def load_data(self) -> None:
        """Load spatial transcriptome data"""
        if self.platform == "visium":
            self._load_visium_data()
        elif self.platform == "xenium":
            self._load_xenium_data()
    
    def _load_visium_data(self) -> None:
        """Load Visium data (Space Ranger output)"""
        if not SCANPY_AVAILABLE:
            raise ImportError("Scanpy is required for Visium data processing. Install with: pip install scanpy")
        
        logger.info("Loading Visium data...")
        
        # Check for required files
        matrix_file = self.data_dir / "filtered_feature_bc_matrix.h5"
        if not matrix_file.exists():
            # Try alternative path structure
            matrix_file = self.data_dir / "filtered_gene_bc_matrices" / "h5" / "filtered_feature_bc_matrix.h5"
        
        if not matrix_file.exists():
            raise FileNotFoundError(f"Matrix file not found: {matrix_file}")
        
        # Load AnnData
        self.adata = sc.read_visium(
            self.data_dir,
            count_file=matrix_file.name if matrix_file.parent == self.data_dir else None,
            load_images=True
        )
        
        # Store gene names
        self.gene_names = self.adata.var_names.tolist()
        logger.info(f"Loaded {len(self.gene_names)} genes and {self.adata.n_obs} spots")
        
        # Load scale factors
        scalefactors_file = self.data_dir / "spatial" / "scalefactors_json.json"
        if scalefactors_file.exists():
            with open(scalefactors_file) as f:
                self.scale_factors = json.load(f)
        
        # Load tissue image
        img_path = self.data_dir / "spatial" / "tissue_lowres_image.png"
        if img_path.exists():
            self.image = np.array(Image.open(img_path))
            logger.info(f"Loaded tissue image: {self.image.shape}")
    
    def _load_xenium_data(self) -> None:
        """Load Xenium data"""
        logger.info("Loading Xenium data...")
        
        # Load cell feature matrix
        matrix_file = self.data_dir / "cell_feature_matrix.h5"
        if matrix_file.exists() and SCANPY_AVAILABLE:
            self.adata = sc.read_10x_h5(str(matrix_file))
            self.gene_names = self.adata.var_names.tolist()
            logger.info(f"Loaded {len(self.gene_names)} genes")
        
        # Load transcripts
        transcripts_file = self.data_dir / "transcripts.parquet"
        if transcripts_file.exists():
            if PYARROW_AVAILABLE:
                self.transcripts = pq.read_table(transcripts_file).to_pandas()
                logger.info(f"Loaded {len(self.transcripts)} transcripts")
            else:
                logger.warning("PyArrow not available, skipping transcripts loading")
        
        # Load cell boundaries
        boundaries_file = self.data_dir / "nucleus_boundaries.parquet"
        if boundaries_file.exists() and PYARROW_AVAILABLE:
            self.boundaries = pq.read_table(boundaries_file).to_pandas()
            logger.info(f"Loaded {len(self.boundaries)} boundary points")
        
        # Load morphology image
        img_file = self.data_dir / "morphology_focus.ome.tif"
        if img_file.exists():
            try:
                from tifffile import imread
                self.image = imread(str(img_file))
                logger.info(f"Loaded morphology image: {self.image.shape}")
            except ImportError:
                logger.warning("tifffile not available for OME-TIFF loading")
    
    def plot_gene_spatial(
        self,
        gene: str,
        save_path: Optional[str] = None,
        title: Optional[str] = None,
        spot_size: float = 1.0,
        alpha: float = 0.8,
        vmin: Optional[float] = None,
        vmax: Optional[float] = None,
        show_image: bool = True,
        figsize: Tuple[int, int] = (10, 10)
    ) -> plt.Figure:
        """Plot spatial expression of individual genes
        
        Parameters
        ----------
        gene :str
            Gene name
        save_path: str, optional
            save path
        title: str, optional
            Chart title
        spot_size : float
            Spot size factor
        alpha : float
            Transparency
        vmin, vmax: float, optional
            color range
        show_image : bool
            Whether to display tissue images
        figsize : tuple
            Image size
            
        Returns
        -------
        fig : matplotlib.figure.Figure"""
        if gene not in self.gene_names:
            raise ValueError(f"Gene '{gene}' not found in dataset")
        
        if self.platform == "visium":
            return self._plot_visium_gene(
                gene, save_path, title, spot_size, alpha, 
                vmin, vmax, show_image, figsize
            )
        else:
            return self._plot_xenium_gene(
                gene, save_path, title, alpha, vmin, vmax, figsize
            )
    
    def _plot_visium_gene(
        self,
        gene: str,
        save_path: Optional[str],
        title: Optional[str],
        spot_size: float,
        alpha: float,
        vmin: Optional[float],
        vmax: Optional[float],
        show_image: bool,
        figsize: Tuple[int, int]
    ) -> plt.Figure:
        """Drawing Visium single gene space map"""
        fig, ax = plt.subplots(figsize=figsize)
        
        # Get gene expression
        gene_idx = self.gene_names.index(gene)
        expression = self.adata.X[:, gene_idx].toarray().flatten() if hasattr(self.adata.X, 'toarray') else self.adata.X[:, gene_idx]
        
        # Get spatial coordinates
        spatial = self.adata.obsm['spatial']
        
        # Plot tissue image background
        if show_image and self.image is not None:
            ax.imshow(self.image)
            # Scale coordinates to image
            if 'tissue_lowres_scalef' in self.scale_factors:
                scale = self.scale_factors['tissue_lowres_scalef']
                spatial_scaled = spatial * scale
            else:
                spatial_scaled = spatial
        else:
            spatial_scaled = spatial
        
        # Filter zero expression for better visualization
        nonzero_mask = expression > 0
        
        # Plot spots
        scatter = ax.scatter(
            spatial_scaled[nonzero_mask, 0],
            spatial_scaled[nonzero_mask, 1],
            c=expression[nonzero_mask],
            cmap=self.cmap,
            s=spot_size * 50,
            alpha=alpha,
            edgecolors='none',
            vmin=vmin or np.percentile(expression[expression > 0], 1),
            vmax=vmax or np.percentile(expression[expression > 0], 99)
        )
        
        # Add colorbar
        cbar = plt.colorbar(scatter, ax=ax, shrink=0.5)
        cbar.set_label(f'{gene} Expression', fontsize=10)
        
        # Styling
        ax.set_title(title or f'{gene} Spatial Expression', fontsize=14, fontweight='bold')
        ax.set_xlabel('X coordinate', fontsize=10)
        ax.set_ylabel('Y coordinate', fontsize=10)
        ax.axis('off' if show_image else 'on')
        
        plt.tight_layout()
        
        # Save figure
        if save_path:
            plt.savefig(save_path, dpi=self.dpi, bbox_inches='tight', 
                       facecolor='white', edgecolor='none')
            logger.info(f"Saved figure to {save_path}")
        
        return fig
    
    def _plot_xenium_gene(
        self,
        gene: str,
        save_path: Optional[str],
        title: Optional[str],
        alpha: float,
        vmin: Optional[float],
        vmax: Optional[float],
        figsize: Tuple[int, int]
    ) -> plt.Figure:
        """Mapping Xenium single gene space"""
        fig, ax = plt.subplots(figsize=figsize)
        
        if self.transcripts is not None:
            # Filter transcripts for this gene
            gene_transcripts = self.transcripts[self.transcripts['feature_name'] == gene]
            
            if len(gene_transcripts) == 0:
                logger.warning(f"No transcripts found for gene {gene}")
                ax.text(0.5, 0.5, f"No transcripts found for {gene}", 
                       ha='center', va='center', transform=ax.transAxes)
            else:
                # Plot transcripts
                scatter = ax.scatter(
                    gene_transcripts['x_location'],
                    gene_transcripts['y_location'],
                    c=gene_transcripts.get('qv', 20),  # Quality value if available
                    cmap=self.cmap,
                    s=1,
                    alpha=alpha,
                    vmin=vmin or 0,
                    vmax=vmax or 40
                )
                cbar = plt.colorbar(scatter, ax=ax, shrink=0.5)
                cbar.set_label('Quality Value', fontsize=10)
                
                logger.info(f"Plotted {len(gene_transcripts)} transcripts for {gene}")
        else:
            ax.text(0.5, 0.5, "Transcript data not available", 
                   ha='center', va='center', transform=ax.transAxes)
        
        # Styling
        ax.set_title(title or f'{gene} Transcript Locations', fontsize=14, fontweight='bold')
        ax.set_aspect('equal')
        ax.axis('off')
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=self.dpi, bbox_inches='tight',
                       facecolor='white', edgecolor='none')
            logger.info(f"Saved figure to {save_path}")
        
        return fig
    
    def plot_multi_genes(
        self,
        genes: List[str],
        mode: str = "grid",
        save_path: Optional[str] = None,
        spot_size: float = 1.0,
        alpha: float = 0.8,
        figsize_per_gene: Tuple[int, int] = (5, 5)
    ) -> plt.Figure:
        """Mapping the space of multiple genes
        
        Parameters
        ----------
        genes : list
            Gene name list
        mode : str
            Layout mode: 'grid' or 'overlay'
        save_path: str, optional
            save path
        spot_size : float
            Spot size
        alpha : float
            Transparency
        figsize_per_gene : tuple
            Subgraph size for each gene
            
        Returns
        -------
        fig : matplotlib.figure.Figure"""
        # Validate genes
        missing_genes = [g for g in genes if g not in self.gene_names]
        if missing_genes:
            logger.warning(f"Genes not found: {missing_genes}")
            genes = [g for g in genes if g in self.gene_names]
        
        if not genes:
            raise ValueError("No valid genes to plot")
        
        if mode == "grid":
            n_genes = len(genes)
            n_cols = min(3, n_genes)
            n_rows = (n_genes + n_cols - 1) // n_cols
            
            fig, axes = plt.subplots(
                n_rows, n_cols,
                figsize=(figsize_per_gene[0] * n_cols, figsize_per_gene[1] * n_rows)
            )
            axes = np.atleast_1d(axes).flatten()
            
            for i, gene in enumerate(genes):
                ax = axes[i]
                self._plot_gene_on_axis(ax, gene, spot_size, alpha)
            
            # Hide unused subplots
            for i in range(len(genes), len(axes)):
                axes[i].axis('off')
        
        elif mode == "overlay":
            fig, ax = plt.subplots(figsize=(10, 10))
            self._plot_genes_overlay(ax, genes, spot_size, alpha)
        
        else:
            raise ValueError(f"Unknown mode: {mode}")
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=self.dpi, bbox_inches='tight',
                       facecolor='white', edgecolor='none')
            logger.info(f"Saved multi-gene figure to {save_path}")
        
        return fig
    
    def _plot_gene_on_axis(
        self,
        ax: plt.Axes,
        gene: str,
        spot_size: float,
        alpha: float
    ) -> None:
        """Plot gene expression on specified axis"""
        if self.platform != "visium" or self.adata is None:
            return
        
        gene_idx = self.gene_names.index(gene)
        expression = self.adata.X[:, gene_idx].toarray().flatten() if hasattr(self.adata.X, 'toarray') else self.adata.X[:, gene_idx]
        spatial = self.adata.obsm['spatial']
        
        # Scale coordinates
        if 'tissue_lowres_scalef' in self.scale_factors:
            scale = self.scale_factors['tissue_lowres_scalef']
            spatial_scaled = spatial * scale
        else:
            spatial_scaled = spatial
        
        # Plot
        nonzero_mask = expression > 0
        scatter = ax.scatter(
            spatial_scaled[nonzero_mask, 0],
            spatial_scaled[nonzero_mask, 1],
            c=expression[nonzero_mask],
            cmap=self.cmap,
            s=spot_size * 30,
            alpha=alpha,
            edgecolors='none'
        )
        
        # Add tissue image background
        if self.image is not None:
            ax.imshow(self.image, alpha=0.3)
        
        ax.set_title(gene, fontsize=12, fontweight='bold')
        ax.axis('off')
        plt.colorbar(scatter, ax=ax, shrink=0.5)
    
    def _plot_genes_overlay(
        self,
        ax: plt.Axes,
        genes: List[str],
        spot_size: float,
        alpha: float
    ) -> None:
        """Plot multiple genes overlay (using different colors)"""
        if self.platform != "visium" or self.adata is None:
            return
        
        # Use distinct colors for each gene
        colors = plt.cm.Set1(np.linspace(0, 1, len(genes)))
        
        spatial = self.adata.obsm['spatial']
        if 'tissue_lowres_scalef' in self.scale_factors:
            scale = self.scale_factors['tissue_lowres_scalef']
            spatial_scaled = spatial * scale
        else:
            spatial_scaled = spatial
        
        for i, gene in enumerate(genes):
            gene_idx = self.gene_names.index(gene)
            expression = self.adata.X[:, gene_idx].toarray().flatten() if hasattr(self.adata.X, 'toarray') else self.adata.X[:, gene_idx]
            
            nonzero_mask = expression > 0
            ax.scatter(
                spatial_scaled[nonzero_mask, 0],
                spatial_scaled[nonzero_mask, 1],
                c=[colors[i]],
                s=spot_size * 30 * (expression[nonzero_mask] / expression[nonzero_mask].max()),
                alpha=alpha,
                edgecolors='none',
                label=gene
            )
        
        if self.image is not None:
            ax.imshow(self.image, alpha=0.3)
        
        ax.legend(loc='upper right', title='Genes')
        ax.set_title('Multi-gene Overlay', fontsize=14, fontweight='bold')
        ax.axis('off')
    
    def plot_cluster_spatial(
        self,
        cluster_file: Optional[str] = None,
        cluster_key: str = "cluster",
        save_path: Optional[str] = None,
        spot_size: float = 1.5,
        figsize: Tuple[int, int] = (10, 10)
    ) -> plt.Figure:
        """Plot spatial clustering
        
        Parameters
        ----------
        cluster_file: str, optional
            Clustering result CSV file path
        cluster_key : str
            Clustering obs key in AnnData
        save_path: str, optional
            save path
        spot_size : float
            Spot size
        figsize : tuple
            Image size
            
        Returns
        -------
        fig : matplotlib.figure.Figure"""
        if self.platform != "visium" or self.adata is None:
            raise ValueError("Cluster spatial plot only supported for Visium data")
        
        # Load cluster assignments if provided
        if cluster_file:
            clusters = pd.read_csv(cluster_file, index_col=0)
            cluster_col = clusters.columns[0]
            self.adata.obs['cluster'] = clusters[cluster_col].astype('category')
        elif cluster_key in self.adata.obs:
            self.adata.obs['cluster'] = self.adata.obs[cluster_key].astype('category')
        else:
            raise ValueError(f"No cluster information found. Provide cluster_file or ensure '{cluster_key}' exists in adata.obs")
        
        fig, ax = plt.subplots(figsize=figsize)
        
        # Get coordinates
        spatial = self.adata.obsm['spatial']
        if 'tissue_lowres_scalef' in self.scale_factors:
            scale = self.scale_factors['tissue_lowres_scalef']
            spatial_scaled = spatial * scale
        else:
            spatial_scaled = spatial
        
        # Get unique clusters
        clusters = self.adata.obs['cluster'].cat.categories
        n_clusters = len(clusters)
        
        # Use distinct colors
        colors = plt.cm.tab20(np.linspace(0, 1, max(n_clusters, 20)))
        
        # Plot each cluster
        for i, cluster in enumerate(clusters):
            mask = self.adata.obs['cluster'] == cluster
            ax.scatter(
                spatial_scaled[mask, 0],
                spatial_scaled[mask, 1],
                c=[colors[i % 20]],
                s=spot_size * 50,
                alpha=0.8,
                edgecolors='black',
                linewidths=0.3,
                label=f'Cluster {cluster}'
            )
        
        # Add tissue image
        if self.image is not None:
            ax.imshow(self.image, alpha=0.3, zorder=0)
        
        ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.0), title='Clusters')
        ax.set_title('Spatial Cluster Distribution', fontsize=14, fontweight='bold')
        ax.axis('off')
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=self.dpi, bbox_inches='tight',
                       facecolor='white', edgecolor='none')
            logger.info(f"Saved cluster plot to {save_path}")
        
        return fig
    
    def get_spatial_stats(self, gene: str) -> Dict:
        """Get spatial statistics of genes
        
        Parameters
        ----------
        gene :str
            Gene name
            
        Returns
        -------
        dict
            Statistics Dictionary"""
        if gene not in self.gene_names:
            raise ValueError(f"Gene '{gene}' not found")
        
        gene_idx = self.gene_names.index(gene)
        
        if self.adata is not None:
            expression = self.adata.X[:, gene_idx].toarray().flatten() if hasattr(self.adata.X, 'toarray') else self.adata.X[:, gene_idx]
        else:
            return {}
        
        spatial = self.adata.obsm.get('spatial', None)
        
        stats = {
            'gene': gene,
            'total_counts': int(expression.sum()),
            'mean_expression': float(expression.mean()),
            'max_expression': float(expression.max()),
            'expressed_spots': int((expression > 0).sum()),
            'detection_rate': float((expression > 0).mean()),
        }
        
        if spatial is not None:
            # Find hotspot (location with highest expression)
            hotspot_idx = expression.argmax()
            stats['hotspot_location'] = {
                'x': float(spatial[hotspot_idx, 0]),
                'y': float(spatial[hotspot_idx, 1])
            }
        
        return stats
    
    def generate_report(
        self,
        genes: List[str],
        output_file: str = "spatial_report.html"
    ) -> str:
        """Generate comprehensive reports in HTML format
        
        Parameters
        ----------
        genes : list
            List of genes to include in the report
        output_file : str
            Output HTML file name
            
        Returns
        -------
        str
            Output file path"""
        report_path = self.output_dir / output_file
        
        # Generate plots for each gene
        plot_paths = []
        for gene in genes[:6]:  # Limit to 6 genes for report
            if gene in self.gene_names:
                plot_path = self.output_dir / f"{gene}_spatial_map.png"
                self.plot_gene_spatial(gene, save_path=str(plot_path))
                plot_paths.append((gene, f"{gene}_spatial_map.png"))
        
        # Generate cluster plot if available
        cluster_plot = None
        if self.adata is not None and 'cluster' in self.adata.obs:
            cluster_plot_path = self.output_dir / "cluster_spatial_map.png"
            self.plot_cluster_spatial(save_path=str(cluster_plot_path))
            cluster_plot = "cluster_spatial_map.png"
        
        # Build HTML
        html_content = f"""
<!DOCTYPE html>
<html>
<head>
    <title>Spatial Transcriptomics Report</title>
    <style>
        body {{ font-family: Arial, sans-serif; margin: 40px; background-color: #f5f5f5; }}
        h1 {{ color: #333; border-bottom: 3px solid #4CAF50; padding-bottom: 10px; }}
        h2 {{ color: #555; margin-top: 30px; }}
        .gene-grid {{ display: grid; grid-template-columns: repeat(auto-fit, minmax(400px, 1fr)); gap: 20px; margin-top: 20px; }}
        .gene-card {{ background: white; border-radius: 8px; padding: 15px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }}
        .gene-card img {{ width: 100%; height: auto; border-radius: 4px; }}
        .gene-name {{ font-size: 18px; font-weight: bold; color: #4CAF50; margin-bottom: 10px; }}
        .stats {{ background: white; padding: 20px; border-radius: 8px; margin-top: 20px; }}
        table {{ width: 100%; border-collapse: collapse; }}
        th, td {{ padding: 12px; text-align: left; border-bottom: 1px solid #ddd; }}
        th {{ background-color: #4CAF50; color: white; }}
        .info {{ background: #e3f2fd; padding: 15px; border-radius: 8px; margin-bottom: 20px; }}
    </style>
</head>
<body>
    <h1>🧬 Spatial Transcriptomics Analysis Report</h1>
    
    <div class="info">
        <strong>Platform:</strong> {self.platform.upper()}<br>
        <strong>Sample:</strong> {self.data_dir.name}<br>
        <strong>Total Genes:</strong> {len(self.gene_names)}<br>
        {f"<strong>Total Spots:</strong> {self.adata.n_obs}" if self.adata else ""}
    </div>
    
    <h2>📊 Gene Expression Maps</h2>
    <div class="gene-grid">
"""
        
        for gene, plot_file in plot_paths:
            stats = self.get_spatial_stats(gene)
            html_content += f"""
        <div class="gene-card">
            <div class="gene-name">{gene}</div>
            <img src="{plot_file}" alt="{gene} expression">
            <div style="margin-top: 10px; font-size: 12px; color: #666;">
                Expression: {stats.get('total_counts', 'N/A')} counts | 
                Detection rate: {stats.get('detection_rate', 0):.1%}
            </div>
        </div>
"""
        
        html_content += """
    </div>
"""
        
        if cluster_plot:
            html_content += f"""
    <h2>🔬 Spatial Clusters</h2>
    <div class="gene-card">
        <img src="{cluster_plot}" alt="Cluster map" style="max-width: 600px;">
    </div>
"""
        
        # Add statistics table
        html_content += """
    <h2>📈 Expression Statistics</h2>
    <div class="stats">
        <table>
            <tr>
                <th>Gene</th>
                <th>Total Counts</th>
                <th>Mean Expression</th>
                <th>Max Expression</th>
                <th>Detection Rate</th>
            </tr>
"""
        
        for gene in genes[:10]:
            if gene in self.gene_names:
                stats = self.get_spatial_stats(gene)
                html_content += f"""
            <tr>
                <td>{gene}</td>
                <td>{stats.get('total_counts', 0):,}</td>
                <td>{stats.get('mean_expression', 0):.4f}</td>
                <td>{stats.get('max_expression', 0):.1f}</td>
                <td>{stats.get('detection_rate', 0):.1%}</td>
            </tr>
"""
        
        html_content += """
        </table>
    </div>
    
    <footer style="margin-top: 40px; text-align: center; color: #999; font-size: 12px;">
        Generated by Spatial Transcriptomics Mapper v1.0
    </footer>
</body>
</html>
"""
        
        with open(report_path, 'w') as f:
            f.write(html_content)
        
        logger.info(f"Generated HTML report: {report_path}")
        return str(report_path)


def parse_args() -> argparse.Namespace:
    """Parse command line parameters"""
    parser = argparse.ArgumentParser(
        description="Spatial Transcriptomics Mapper - Project gene expression back to tissue section images",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""Examples:
  # Visium single gene analysis
  python main.py --platform visium --data-dir ./visium/outs --gene PIK3CA --output ./results/

  # Xenium Multigene Analysis
  python main.py --platform xenium --data-dir ./xenium/outs --genes SFTPB,SFTPC --mode multi

  # Spatial clustering visualization
  python main.py --platform visium --data-dir ./data/outs --cluster-file ./clusters.csv"""
    )
    
    # Required arguments
    parser.add_argument(
        "--platform",
        type=str,
        required=True,
        choices=["visium", "xenium"],
        help="Spatial transcriptome platform types"
    )
    parser.add_argument(
        "--data-dir",
        type=str,
        required=True,
        help="Data directory path"
    )
    
    # Gene selection
    gene_group = parser.add_mutually_exclusive_group()
    gene_group.add_argument(
        "--gene",
        type=str,
        help="single gene name"
    )
    gene_group.add_argument(
        "--genes",
        type=str,
        help="Multiple genes, separated by commas (eg: PIK3CA, PTEN, EGFR)"
    )
    
    # Analysis options
    parser.add_argument(
        "--mode",
        type=str,
        default="single",
        choices=["single", "overlay", "multi", "cluster"],
        help="Analysis mode (default: single)"
    )
    parser.add_argument(
        "--cluster-file",
        type=str,
        help="Clustering result CSV file path (barcode,cluster)"
    )
    
    # Output options
    parser.add_argument(
        "--output",
        type=str,
        default="./output",
        help="Output directory (default: ./output)"
    )
    parser.add_argument(
        "--dpi",
        type=int,
        default=300,
        help="Output image DPI (default: 300)"
    )
    parser.add_argument(
        "--cmap",
        type=str,
        default="viridis",
        help="Color mapping scheme (default: viridis)"
    )
    
    # Visualization options
    parser.add_argument(
        "--spot-size",
        type=float,
        default=1.0,
        help="Visium spot size factor (default: 1.0)"
    )
    parser.add_argument(
        "--alpha",
        type=float,
        default=0.8,
        help="Transparency 0-1 (default: 0.8)"
    )
    parser.add_argument(
        "--min-count",
        type=int,
        default=0,
        help="Minimum expression filter (default: 0)"
    )
    
    # Advanced options
    parser.add_argument(
        "--crop",
        type=str,
        help="Cropping area (format: x1,y1,x2,y2)"
    )
    parser.add_argument(
        "--report",
        action="store_true",
        help="Generate HTML report"
    )
    parser.add_argument(
        "-v", "--verbose",
        action="store_true",
        help="Verbose output"
    )
    
    return parser.parse_args()


def main():
    """main function"""
    args = parse_args()
    
    # Set log level
    if args.verbose:
        logging.getLogger().setLevel(logging.DEBUG)
    
    # Validate data directory
    if not os.path.exists(args.data_dir):
        logger.error(f"Data directory not found: {args.data_dir}")
        sys.exit(1)
    
    # Initialize mapper
    try:
        mapper = SpatialMapper(
            platform=args.platform,
            data_dir=args.data_dir,
            output_dir=args.output,
            dpi=args.dpi,
            cmap=args.cmap
        )
        
        # Load data
        logger.info("Loading data...")
        mapper.load_data()
        
        # Determine genes to analyze
        genes_to_analyze = []
        if args.gene:
            genes_to_analyze = [args.gene]
        elif args.genes:
            genes_to_analyze = [g.strip() for g in args.genes.split(",")]
        
        # Execute based on mode
        if args.mode == "cluster" or args.cluster_file:
            # Cluster visualization mode
            logger.info("Generating cluster spatial map...")
            output_path = Path(args.output) / "cluster_spatial_map.png"
            mapper.plot_cluster_spatial(
                cluster_file=args.cluster_file,
                save_path=str(output_path),
                spot_size=args.spot_size
            )
        
        elif args.mode == "multi" and len(genes_to_analyze) > 1:
            # Multi-gene grid mode
            logger.info(f"Generating multi-gene plot for: {genes_to_analyze}")
            output_path = Path(args.output) / "multi_gene_grid.png"
            mapper.plot_multi_genes(
                genes=genes_to_analyze,
                mode="grid",
                save_path=str(output_path),
                spot_size=args.spot_size,
                alpha=args.alpha
            )
        
        elif args.mode == "overlay" and len(genes_to_analyze) > 1:
            # Overlay mode
            logger.info(f"Generating overlay plot for: {genes_to_analyze}")
            output_path = Path(args.output) / "multi_gene_overlay.png"
            mapper.plot_multi_genes(
                genes=genes_to_analyze,
                mode="overlay",
                save_path=str(output_path),
                spot_size=args.spot_size,
                alpha=args.alpha
            )
        
        elif genes_to_analyze:
            # Single gene mode
            for gene in genes_to_analyze:
                if gene in mapper.gene_names:
                    logger.info(f"Processing gene: {gene}")
                    output_path = Path(args.output) / f"{gene}_spatial_map.png"
                    mapper.plot_gene_spatial(
                        gene=gene,
                        save_path=str(output_path),
                        spot_size=args.spot_size,
                        alpha=args.alpha
                    )
                    
                    # Print stats
                    stats = mapper.get_spatial_stats(gene)
                    logger.info(f"  Total counts: {stats.get('total_counts', 'N/A')}")
                    logger.info(f"  Detection rate: {stats.get('detection_rate', 0):.1%}")
                else:
                    logger.warning(f"Gene '{gene}' not found in dataset")
                    logger.info(f"Available genes (first 10): {mapper.gene_names[:10]}")
        
        else:
            logger.info("No genes specified. Use --gene or --genes to specify genes to analyze.")
            logger.info(f"Dataset contains {len(mapper.gene_names)} genes")
        
        # Generate HTML report if requested
        if args.report and genes_to_analyze:
            logger.info("Generating HTML report...")
            report_path = mapper.generate_report(genes=genes_to_analyze)
            logger.info(f"Report saved to: {report_path}")
        
        logger.info("Analysis complete!")
        
    except Exception as e:
        logger.error(f"Error during analysis: {e}")
        if args.verbose:
            import traceback
            traceback.print_exc()
        sys.exit(1)


if __name__ == "__main__":
    main()

FILE:scripts/__init__.py
# Spatial Transcriptomics Mapper

from .main import SpatialMapper

__version__ = "1.0.0"
__all__ = ["SpatialMapper"]

ClawHub Coding Data Analysis+2

A@clawhub-aipoch-ai-772015cadb

Previous2 / 10Next