@clawhub-xf280230439-netizen-d2584fb7d5
Academic review writer and formatting assistant for Chinese academic papers. Use this skill when users need to format, check, and refine academic literature...
---
name: workwork
description: Academic review writer and formatting assistant for Chinese academic papers. Use this skill when users need to format, check, and refine academic literature reviews, reference lists, or generate compliant Word documents according to Chinese academic standards.
---
# Workwork - Academic Review Writing & Formatting Assistant
## Purpose
This skill provides comprehensive tools for academic review writing and formatting, specifically optimized for Chinese academic standards. It automates reference checking, format validation, document generation, and quality control processes to help researchers produce compliant, professional academic papers efficiently.
## When to Use This Skill
Use this skill when:
- Writing or formatting academic literature reviews
- Checking reference list formats and citations
- Validating document structure according to Chinese academic standards
- Generating Word documents with proper formatting
- Filtering references by journal quality (e.g., removing non-core journals)
- Fixing reference numbering and citation consistency
- Proofreading for typos and grammar errors
## Core Capabilities
This skill provides **ten** functional modules accessible through scripts in the `scripts/` directory:
### 1. Literature Integrity Checker ⭐ NEW (CRITICAL)
- **Script**: `literature_integrity_checker.py`
- **Purpose**: Comprehensive validation of reference integrity and citation consistency
- **What it checks**:
- Reference completeness (authors, title, journal, year, pages)
- Citation consistency (all citations have corresponding references)
- Reference numbering continuity
- Duplicate references
- Citation format compliance
- Invalid/orphan citations
- **Consecutive duplicate citations** 🆕 NEW: Detects when the same reference is cited in 3+ consecutive sentences
- **Critical Feature**: Returns non-zero exit code on critical errors, preventing submission with errors
- **Auto-Open Feature** 🆕 NEW: Automatically opens the final Word document (`your_review_上标版.docx`) after successful check (no critical issues)
- **Usage**:
```bash
# Standard check with auto-open enabled (default)
python scripts/literature_integrity_checker.py your_review.md
# With deep verification
python scripts/literature_integrity_checker.py your_review.md --deep-check
# Disable auto-open
python scripts/literature_integrity_checker.py your_review.md --no-auto-open
```
- **Output**: `your_review_literature_integrity_report.md`
- **Note**: When check passes (no critical issues), the script automatically searches for and opens `your_review_上标版.docx` or `your_review.docx`
### 2. Unified Checker (RECOMMENDED - Most Efficient)
- **Script**: `unified_checker.py`
- **Purpose**: Run all checks at once in a single operation
- **What it does**:
- Reference format validation
- Citation accuracy checking
- Typo and grammar detection
- Document format compliance checking
- **Usage**:
```bash
python scripts/unified_checker.py your_review.md
```
- **Output**: `your_review_unified_report.md`
### 3. Reference Formatter
- **Script**: `reference_formatter.py`
- **Purpose**: Validate and format reference citations and bibliography
- **What it checks**:
- Citation marker positioning (before punctuation)
- Author name formatting (adding "等" when exceeding limit)
- Reference numbering continuity
- Duplicate reference numbers
- Figure/table citation markers
- **Configuration**: Edit `templates/ref_format_default.yml` for custom rules
- **Usage**:
```bash
python scripts/reference_formatter.py your_review.md
python scripts/reference_formatter.py your_review.md custom_config.yml
```
### 4. Typo & Grammar Checker
- **Script**: `typo_grammar_checker.py`
- **Purpose**: Detect common typos, grammar errors, punctuation issues
- **What it checks**:
- Common Chinese typos
- Mismatched quotes/brackets
- Punctuation usage
- Number formatting
- **Usage**:
```bash
python scripts/typo_grammar_checker.py your_review.md
```
### 5. Reference Accuracy Checker
- **Script**: `reference_accuracy_checker.py`
- **Purpose**: Verify citation validity and reference integrity
- **What it checks**:
- Invalid citations (referencing non-existent references)
- Orphaned references (not cited in text)
- Reference numbering continuity
- Citation numbering in text
- Citation marker positioning
- **Usage**:
```bash
python scripts/reference_accuracy_checker.py your_review.md
```
### 6. Document Format Checker
- **Script**: `document_format_checker.py`
- **Purpose**: Validate document structure and formatting
- **What it checks**:
- Main title presence and formatting
- Abstract and keywords sections
- Chapter structure and numbering
- References section
- Paragraph formatting
- Heading hierarchy
- List formatting
- **Usage**:
```bash
python scripts/document_format_checker.py your_review.md
```
### 7. Reference Filter
- **Script**: `filter_references.py`
- **Purpose**: Filter references based on journal quality lists
- **What it does**: Identifies and removes references from non-core journals
- **Setup required**: Edit the script's journal lists before running
- **Usage**:
```bash
python scripts/filter_references.py
```
### 8. Reference Number Fixer
- **Script**: `extract_and_fix_references.py`
- **Purpose**: Renumber references and update all citations in text
- **What it does**:
- Extracts all references
- Renumbers them sequentially
- Updates all citation markers in text
- Removes invalid citations
- **Warning**: Directly modifies the source file—backup first
- **Usage**:
```bash
python scripts/extract_and_fix_references.py
```
### 9. Word Document Generator
- **Script**: `create_word_doc_v3.js` (Node.js)
- **Purpose**: Generate Word documents with Chinese academic formatting
- **What it produces**:
- Font: SimSun (宋体) for Chinese, Times New Roman for English
- Font size: 小四 for body, 三号 for headings
- Line spacing: 1.5倍
- Margins: 2.54cm
- **Requirements**: Node.js 14+, `docx` npm package
- **Usage**:
```bash
node scripts/create_word_doc_v3.js
```
### 10. Word Document with Superscript Citations
- **Script**: `create_word_with_superscript.js` (Node.js)
- **Purpose**: Generate Word documents with **superscript reference citations** (右上标格式)
- **Important**: Citation format is `[n]` (with brackets) in superscript - e.g., `[1]`, `[2-3]`
- **Features**:
- **Smart Font Handling**: Chinese text uses SimSun (宋体), English/numbers/punctuation use Times New Roman
- **Superscript Citations**: Body text citations appear as superscript: `[¹]`, `[²⁻³]`
- **Italic Scientific Names**: Automatically detects and italicizes species Latin names (e.g., *Rhinopithecus bieti*)
- **Font Size**: Body text and references use 小四 (12pt), headings use appropriate sizes
- **Auto-Open**: Automatically opens the generated document after creation
- **What it produces**:
- Body text citations appear as superscript: `[¹]`, `[²⁻³]`
- Reference list entries remain normal format (not superscript)
- All other formatting same as standard generator
- **Requirements**: Node.js 14+, `docx` npm package
- **Usage**:
```bash
node scripts/create_word_with_superscript.js your_review.md
```
- **Output**: `your_review_上标版.docx` (auto-opens after generation)
### 11. Citation Pattern Analyzer ⭐ NEW
- **Script**: `analyze_citation_pattern.py`
- **Purpose**: Detect "whole-paragraph citation split into per-sentence annotations" pattern
- **What it checks**:
- Same reference cited in 2+ consecutive sentences within a paragraph
- "Serious" flag for 3+ consecutive sentences
- "Warning" flag for 2 consecutive sentences
- Distinguishes valid total-then-detail structures from unnecessary repetition
- **Academic rule**: When an entire paragraph derives from one source, cite only at the first or last sentence, not every sentence
- **Usage**:
```bash
python scripts/analyze_citation_pattern.py your_review.md
```
- **Output**: `your_review_citation_pattern_report.md`
For most use cases, follow this 9-step workflow:
### Step 1: Write Review
Create your academic review in Markdown format: `your_review.md`
### Step 2: Run Literature Integrity Check ⭐ CRITICAL
**这是最重要的检查步骤!** 运行文献完整性检查,确保所有引用和文献都正确无误:
```bash
python scripts/literature_integrity_checker.py your_review.md
```
This takes ~2 minutes and generates `your_review_literature_integrity_report.md`
**⚠️ 重要**: 此检查会发现以下严重错误:
- 无效引用(引用编号超出参考文献范围)
- 孤立引用(正文中有引用但参考文献列表中不存在)
- 参考文献编号不连续
- 重复的参考文献
- 文献信息不完整(缺少作者、标题、期刊、年份等)
**如果发现严重问题,脚本会返回错误码,阻止继续操作!**
### Step 3: Run Unified Check
Execute unified checker to run all validations at once:
```bash
python scripts/unified_checker.py your_review.md
```
This takes ~5 minutes and generates `your_review_unified_report.md`
### Step 4: Review Report
Read the unified report:
```bash
cat your_review_unified_report.md
```
The report contains:
- Problem statistics (total/critical/warnings/info)
- Detailed issues by category
- Fix recommendations
### Step 5: Fix Issues
Edit `your_review.md` based on report findings:
- Fix critical issues (🔴 must fix) first
- Then warnings (🟡 should fix)
- Finally info items (🔵 optional)
### Step 6: Filter References (Optional)
If you need to remove non-core journals:
1. Edit `scripts/filter_references.py` to set journal lists
2. Run: `python scripts/filter_references.py`
### Step 7: Fix Reference Numbers
If you removed references, renumber them:
```bash
python scripts/extract_and_fix_references.py
```
**Warning**: This modifies the source file directly—backup first!
### Step 8: Final Verification
Quick verification:
```bash
python scripts/simple_verify.py
```
**再次运行文献完整性检查(推荐)**:
```bash
python scripts/literature_integrity_checker.py your_review.md
```
✨ **新功能**: 如果检查通过(无严重问题),脚本会自动打开 `your_review_上标版.docx` 文档!
### Step 9: Generate Word Document
Create the final formatted Word document:
**Standard format** (inline citations):
```bash
node scripts/create_word_doc_v3.js
```
**Superscript format** (右上标引用) - RECOMMENDED for Chinese academic papers:
```bash
node scripts/create_word_with_superscript.js your_review.md
```
⚠️ **Important**:
- Superscript citations include brackets: `[1]`, `[2-3]` appear as superscript
- Automatically handles mixed Chinese/English fonts (宋体 for Chinese, Times New Roman for English/numbers)
- Auto-italicizes scientific names (e.g., *Rhinopithecus bieti*)
- Document auto-opens after generation
Output: `your_review_上标版.docx`
---
### 🚀 快速工作流(推荐)
对于新的文献综述,使用以下快速工作流:
```bash
# 1. 运行文献完整性检查(自动打开 Word 文档)
python scripts/literature_integrity_checker.py your_review.md
# 检查通过后,Word 文档会自动打开!无需手动操作
# 如果发现问题,修复后再次运行检查
```
## Alternative Workflows
### Complete Workflow (Step-by-Step)
Use when you need detailed information about each check type. Run each checker individually:
1. `reference_formatter.py`
2. `typo_grammar_checker.py`
3. `reference_accuracy_checker.py`
4. `document_format_checker.py`
View each report separately for detailed analysis.
### Quick Fix Workflow
Use when you've only made minor edits after previous checks. Run only the relevant checker:
- Fixed citations? → `reference_accuracy_checker.py`
- Fixed typos? → `typo_grammar_checker.py`
- Fixed formatting? → `document_format_checker.py`
- Fixed reference format? → `reference_formatter.py`
## Important Notes
### Avoid Redundant Checks
- ❌ **Wrong**: Run unified check → fix → run unified check again → fix again...
- ✅ **Right**: Run unified check → fix → targeted recheck → generate Word
The unified checker is designed to be run once per major editing session. After fixing issues, only run the specific checkers relevant to what you changed.
### File Backup
- Scripts that modify source files: `extract_and_fix_references.py`, `filter_references.py`
- Always backup before running:
```bash
cp your_review.md your_review_backup.md
```
### Configuration Files
- Main config: `templates/ref_format_default.yml`
- Edit this file to customize:
- Citation style (superscript vs inline)
- Max authors before adding "等"
- Journal filtering rules
- Quality check options
### Dependencies
- Python 3.7+
- Node.js 14+
- Python packages: `pyyaml`, `python-docx` (optional)
- Node.js package: `docx`
## Example Use Cases
### Case 1: First Draft - Full Check
```bash
# Step 1: Run literature integrity check (CRITICAL)
python scripts/literature_integrity_checker.py my_review.md
# Step 2: Run unified check on new draft
python scripts/unified_checker.py my_review.md
# Review and fix issues (open editor)
# ...
# Step 3: Generate Word document
node scripts/create_word_with_superscript.js my_review.md
# Total time: ~12 minutes
```
### Case 2: Reference Numbering Fix
```bash
# Backup first
cp my_review.md my_review_backup.md
# Remove non-core journals
python scripts/filter_references.py
# Renumber and update citations
python scripts/extract_and_fix_references.py
# Verify
python scripts/simple_verify.py
# Generate Word
node scripts/create_word_doc_v3.js
```
### Case 3: Minor Edits - Quick Fix
```bash
# After fixing a few typos manually, just recheck
python scripts/typo_grammar_checker.py my_review.md
# Regenerate Word
node scripts/create_word_doc_v3.js
# Total time: ~3 minutes
```
## Troubleshooting
### Python encoding issues on Windows
If you encounter GBK vs UTF-8 encoding errors:
- Ensure scripts are saved with UTF-8 encoding
- Use `encoding='utf-8'` in file operations
### Node.js docx package not found
```bash
npm install docx
```
### Missing YAML configuration
Copy `templates/ref_format_default.yml` to your working directory and edit it.
## Additional Resources
See `references/` directory for:
- Detailed workflow guides
- Configuration examples
- Troubleshooting tips
- Feature checklists
See scripts directory for:
- All executable checkers and formatters
- Helper utilities for document analysis
- Validation and repair tools
## Version History
- **v3.3.0** (2026-03-20): **功能更新 - 新增引用模式分析器 + 深度引用审查**
- **新增**: `analyze_citation_pattern.py` - 专项检测整段引用被拆分逐句标注的问题
- 检测连续句子中同一引用重复出现的模式
- 3句及以上标记为严重,2句标记为警告
- 生成详细报告含原文句子和修复建议
- **修复(典型案例)**: 学术文献综述中发现的严重引用错误类型:
- 引用编号指向错误文献(如用贻贝实验引用编号标注灵长类研究)→ 核查并更正
- 超出参考文献范围的引用编号 → 全部清理或替换为有效编号
- 引用文献内容与正文上下文不符 → 核对并修正
- 整段内容中间重复标注 → 合并为段首+段末引用
- **最佳实践总结**: 引用错配比重复引用更危险,需先核验参考文献实际内容
- **v3.2.0** (2026-03-19): **功能更新 - 新增自动打开最终文档功能**
- **新增**: 文献完整性检查器支持检查通过后自动打开 Word 文档
- 自动查找 `_上标版.docx` 或 `.docx` 文件
- 仅在无严重问题时自动打开
- 跨平台支持(Windows/macOS/Linux)
- 可通过 `--no-auto-open` 参数禁用
- **更新**: SKILL.md 文档,添加快速工作流说明
- **改进**: 优化用户体验,简化工作流程
- **v3.1.0** (2026-03-19): **功能更新 - 新增连续相同引用检查**
- **新增**: 连续相同引用检测功能
- 自动检测同一段落中连续三句话或更多句子引用同一篇文献的情况
- 遵循学术规范:同一段落中连续引用同一篇文献时,建议只在最后一句话保留引用
- 作为警告级别提示,不阻止文档生成
- **改进**: 优化检测逻辑,只标记相邻句子中的重复引用,避免误报段落首尾的正常引用模式
- **更新**: 更新文献完整性检查器功能说明和版本日志
- **v3.0.0** (2026-03-19): **重大更新 - 新增文献完整性检查系统**
- **新增**: `literature_integrity_checker.py` - 全面的文献完整性验证
- 检查文献信息完整性(作者、标题、期刊、年份、页码)
- 验证引用与文献列表的一致性
- 检测无效引用和孤立引用
- 检查参考文献编号连续性
- 识别重复文献
- **关键特性**: 发现严重错误时返回非零退出码,阻止错误提交
- **更新**: 推荐工作流程,将文献完整性检查作为关键步骤
- **改进**: 更严格的错误检测机制,杜绝文献引用错误
- **v2.3.0** (2026-03-19): Enhanced superscript Word generator with:
- Smart mixed font handling (Chinese SimSun + English Times New Roman)
- Auto-italicization of scientific names (Latin species names)
- Auto-open document after generation
- Body text and references use 小四 (12pt) font size
- **v2.2.0** (2026-03-19): Added superscript citation Word generator (`create_word_with_superscript.js`). **Important**: Superscript citations retain brackets `[n]` format.
- **v2.1.0** (2026-03-19): Added unified checker, optimized workflows, eliminated redundant checks
- **v2.0.0** (2026-03-19): Added typo checking, reference accuracy, document format validation
- **v1.0.0** (2026-03-19): Initial version with basic reference formatting and Word generation
FILE:references/features_checklist.md
# Workwork Features Checklist
## Core Features (8 Modules)
### Module 1: Reference Format Checker ✅
**Script:** `reference_formatter.py`
**Capabilities:**
- ✅ Check citation marker positioning
- ✅ Detect author list formatting (add "等" when exceeding limit)
- ✅ Validate reference numbering continuity
- ✅ Check duplicate reference numbers
- ✅ Verify figure/table citation markers
- ✅ Generate detailed format check reports
- ✅ Support user-defined format configuration
**Configuration:** `templates/ref_format_default.yml`
**Usage:**
```bash
python scripts/reference_formatter.py your_review.md
python scripts/reference_formatter.py your_review.md custom_config.yml
```
**Output:** `your_review_format_report.md`
---
### Module 2: Reference Manager & Filter ✅
**Script:** `filter_references.py`
**Capabilities:**
- ✅ Identify Chinese core journals
- ✅ Automatically remove non-core journal references
- ✅ Support custom journal blacklists/whitelists
- ✅ Check reference format completeness
**Usage:**
```bash
# Edit filter_references.py journal list first
python scripts/filter_references.py
```
**Note:** Requires editing journal lists before use
---
### Module 3: Reference Numbering System ✅
**Scripts:**
- `extract_and_fix_references.py` (main)
- `fix_references_precise.py` (precise fix)
**Capabilities:**
- ✅ Automatically renumber references
- ✅ Update citation numbers in text
- ✅ Detect and delete invalid citations
- ✅ Validate citation-reference correspondence
**Usage:**
```bash
python scripts/extract_and_fix_references.py
```
**Warning:** Directly modifies source file—backup first!
---
### Module 4: Format Normalization ✅
**Script:** `create_word_doc_v3.js`
**Capabilities:**
- ✅ Generate Word documents compliant with Chinese academic standards
- ✅ Set fonts (宋体/Times New Roman)
- ✅ Set font sizes (小四 for body, 三号 for headings)
- ✅ Set line spacing (1.5x)
- ✅ Set margins (2.54cm)
- ✅ Add headers/footers (optional)
**Dependencies:** Node.js 14+, `docx` npm package
**Usage:**
```bash
node scripts/create_word_doc_v3.js
```
**Output:** `your_review.docx`
---
### Module 5: Quality Control & Validation ✅
**Scripts:**
- `verify_review_comprehensive.py` (comprehensive validation)
- `simple_verify.py` (quick validation)
**Capabilities:**
- ✅ Check chapter structure completeness
- ✅ Validate citation numbering continuity
- ✅ Detect duplicate numbering
- ✅ Generate validation reports
**Usage:**
```bash
# Comprehensive validation
python scripts/verify_review_comprehensive.py
# Quick validation
python scripts/simple_verify.py
```
---
### Module 6: Typo & Grammar Checker ✅
**Script:** `typo_grammar_checker.py`
**Capabilities:**
- ✅ Check common typos
- ✅ Check grammar errors (mismatched quotes, brackets, etc.)
- ✅ Check punctuation usage
- ✅ Check number formatting
- ✅ Generate detailed check reports
**Usage:**
```bash
python scripts/typo_grammar_checker.py your_review.md
```
**Output:** `your_review_typo_grammar_report.md`
---
### Module 7: Reference Format & Citation Accuracy Checker ✅
**Script:** `reference_accuracy_checker.py`
**Capabilities:**
- ✅ Validate reference list format compliance
- ✅ Check citation accuracy (invalid citations, orphaned references)
- ✅ Validate reference numbering continuity
- ✅ Check citation numbering in text
- ✅ Check citation marker positioning
- ✅ Generate detailed check reports
**Usage:**
```bash
python scripts/reference_accuracy_checker.py your_review.md
```
**Output:** `your_review_reference_accuracy_report.md`
---
### Module 8: Document Format Checker ✅
**Script:** `document_format_checker.py`
**Capabilities:**
- ✅ Check main title
- ✅ Check abstract and keywords
- ✅ Check chapter structure and numbering
- ✅ Check references section
- ✅ Check paragraph formatting
- ✅ Check heading hierarchy
- ✅ Check list formatting
- ✅ Generate detailed check reports
**Usage:**
```bash
python scripts/document_format_checker.py your_review.md
```
**Output:** `your_review_document_format_report.md`
---
### Module 9: Unified Checker ✅ (v2.1.0 NEW)
**Script:** `unified_checker.py`
**Capabilities:**
- ✅ Integrate all check functions into single script
- ✅ Complete all checks in one operation
- ✅ Generate unified report
- ✅ Avoid redundant checks
- ✅ Improve efficiency
**Usage:**
```bash
python scripts/unified_checker.py your_review.md
```
**Output:** `your_review_unified_report.md`
---
## Configuration Templates
### Reference Format Configuration
**File:** `templates/ref_format_default.yml`
**Configuration Options:**
#### Citation Marker Style
```yaml
citation:
style: "superscript" # superscript or inline
position_rules:
before_punctuation: true
after_quote_in_sentence: true
before_period: true
```
#### Same Source Multi-Page Citation
```yaml
same_source_format: "[{ref}:{pages}]" # Example: [1:12-15]
```
#### Figure/Table References
```yaml
figure_table:
move_to_caption: true
inline_notes_format: "superscript_number" # ①② / dagger
```
#### Reference List
```yaml
reference_list:
sort_order: "order_of_appearance"
authors:
max_authors: 3
beyond_limit_suffix: "等"
```
#### Quality Check
```yaml
quality_check:
check_continuity: true
check_duplicates: true
check_invalid_refs: true
check_position: true
```
#### Processing Mode
```yaml
processing:
mode: "semi_auto" # semi_auto or auto
generate_report: true
backup_original: true
```
---
## File Structure
```
workwork/
├── SKILL.md # Main skill definition
├── scripts/ # Executable scripts
│ ├── unified_checker.py # Unified checker ✨ NEW
│ ├── reference_formatter.py # Reference format check ✅
│ ├── typo_grammar_checker.py # Typo & grammar check ✅
│ ├── reference_accuracy_checker.py # Reference accuracy check ✅
│ ├── document_format_checker.py # Document format check ✅
│ ├── filter_references.py # Reference filter ✅
│ ├── extract_and_fix_references.py # Reference numbering fix ✅
│ ├── fix_references_precise.py # Precise fix ✅
│ ├── create_word_doc_v3.js # Word generation ✅
│ └── ... (other utility scripts)
├── templates/ # Configuration templates
│ └── ref_format_default.yml # Default format config ✅
└── references/ # Reference documentation
├── workflow_guide.md # Workflow guide ✅
└── features_checklist.md # This file ✅
```
---
## Usage Scenarios
### Scenario 1: First Draft - Comprehensive Check
1. Run unified checker: `python scripts/unified_checker.py your_review.md`
2. Review report
3. Fix all issues
4. Filter references (optional)
5. Fix numbering
6. Final verification
7. Generate Word document
### Scenario 2: Reference Numbering Fix
1. Backup file
2. Filter references: `python scripts/filter_references.py`
3. Fix numbering: `python scripts/extract_and_fix_references.py`
4. Verify: `python scripts/simple_verify.py`
5. Generate Word: `node scripts/create_word_doc_v3.js`
### Scenario 3: Minor Edits - Quick Fix
1. Run targeted checker based on what changed
2. Review report
3. Generate Word document
---
## Output Format
### Unified Report Structure
```markdown
# Unified Check Report for [filename]
## Document Statistics
- Character count: [total]
- Reference count: [total]
- Citation count: [total]
## Problem Statistics
- Total issues: [total]
- Critical issues: [count] (must fix)
- Warning issues: [count] (should fix)
- Info issues: [count] (optional)
## Check Results
### 1. Reference Format Check
Status: ✅ / 🟡 / 🔴
Issues found: [count]
[Detailed issues...]
### 2. Citation Accuracy Check
Status: ✅ / 🟡 / 🔴
Issues found: [count]
[Detailed issues...]
### 3. Typo & Grammar Check
Status: ✅ / 🟡 / 🔴
Issues found: [count]
[Detailed issues...]
### 4. Document Format Check
Status: ✅ / 🟡 / 🔴
Issues found: [count]
[Detailed issues...]
## Recommendations
[Fix suggestions...]
```
---
## Best Practices
1. **Always Backup**: Backup important files before running scripts that modify them
2. **Stepwise Checking**: Follow check → review → fix workflow
3. **Customize Configuration**: Adjust config files for specific journal/school requirements
4. **Verify Results**: Run validation after each modification
5. **Version Control**: Use Git or similar for document version management
---
## Common Issues & Solutions
### Issue: Encoding Problems (GBK vs UTF-8)
**Solution:** Ensure scripts use `encoding='utf-8'` in file operations
### Issue: Node.js docx Package Not Found
**Solution:** Run `npm install docx`
### Issue: Missing Configuration File
**Solution:** Copy `templates/ref_format_default.yml` to working directory
---
## Dependencies
### Python
- Python 3.7+
- PyYAML (for config parsing)
- python-docx (optional, for Word processing)
### Node.js
- Node.js 12+
- docx (npm package)
### Optional Tools
- pandoc (for document conversion)
---
## Actual Use Case: Yunnan Snub-Nosed Monkey Review
| Item | Result |
|-------|--------|
| Original references | 98 |
| After filtering | 94 |
| Citation markers | 160 (all correct) |
| Numbering status | Continuous (1-94), no duplicates |
| Document length | 32,486 characters |
| Word document | 33KB, academic standard compliant |
| Quality score | ⭐⭐⭐⭐⭐ (5/5) |
| Check time | ~10 minutes (unified check) |
---
## Feature Version History
### v2.1.0 (2026-03-19) - Current Version
**New Features:**
- ✨ Added **unified checker function** (`unified_checker.py`)
- ✨ Optimized workflows with three workflow options
- ✨ Added usage guidelines and warnings
- ✨ Improved efficiency, eliminated redundant checks
**Improvements:**
- 🔧 Optimized document structure
- 🔧 Added loop check analysis report
- 🔧 Improved usage guide
**Bug Fixes:**
- 🐛 Fixed encoding issues (GBK vs UTF-8)
- 🐛 Fixed special character display issues
### v2.0.0 (2026-03-19)
**New Features:**
- ✨ Added typo and grammar checking
- ✨ Added reference format and citation accuracy checking
- ✨ Added document format compliance checking
- ✨ Enhanced all check report detail levels
- ✨ Optimized typo dictionary, reduced false positives
- ✨ Added comprehensive quality check summary report
### v1.0.0 (2026-03-19)
**Initial Version:**
- ✅ Implemented reference format checking
- ✅ Implemented reference filtering
- ✅ Implemented reference numbering repair
- ✅ Implemented Word document generation
- ✅ Implemented quality control checking
- ✅ Supported user-defined format configuration
---
## Skill Target Audience
- Academic researchers
- Graduate students, PhD candidates
- Paper writers
- Scientific researchers
## Skill Use Cases
- Writing academic review papers
- Standardizing reference formats
- Filtering and optimizing reference quality
- Checking paper formatting compliance
- Generating Word documents compliant with academic standards
---
**Document Version**: v2.1.0
**Last Updated**: 2026-03-19
**Skill Version**: v2.1.0
**Status**: ✅ Complete and ready to use
**Total Files**: 30+
**Total Code Lines**: 5000+
**Author**: WorkBuddy AI
**License**: MIT
FILE:references/workflow_guide.md
# Workwork Workflow Guide
## Complete Workflow Comparison
| Workflow | Steps | Time | Best For | Rating |
|----------|--------|------|-----------|---------|
| 🎯 Unified Check (Recommended) | 8 steps | ~10 min | First draft, comprehensive check | ⭐⭐⭐⭐⭐ |
| 📝 Step-by-Step | 11 steps | ~20 min | Detailed problem analysis | ⭐⭐⭐ |
| 🔄 Quick Fix | 3 steps | ~3 min | Minor edits | ⭐⭐⭐⭐⭐ |
## Workflow 1: Unified Check (Recommended)
**Best for:** First draft completion, full validation needed
### Process Flow
```
Step 1: Write Review
Format: Markdown
Output: your_review.md
↓
Step 2: Run Unified Check
Command: python scripts/unified_checker.py your_review.md
Automatically executes:
✓ Reference format check
✓ Citation accuracy check
✓ Typo and grammar check
✓ Document format compliance
Output: your_review_unified_report.md
Time: ~5 minutes
↓
Step 3: Review Unified Report
Command: cat your_review_unified_report.md
Report includes:
• Problem statistics (total/critical/warnings/info)
• Reference format issues details
• Citation accuracy issues details
• Typo and grammar issues details
• Document format issues details
• Fix recommendations
Time: ~2 minutes
↓
Step 4: Fix Issues
Fix order:
1. Critical issues (🔴 must fix)
2. Warning issues (🟡 should fix)
3. Info issues (🔵 optional)
Time: Depends on issue count
↓
Step 5: Filter References (Optional)
If removing non-core journals needed:
1. Edit scripts/filter_references.py journal list
2. Run: python scripts/filter_references.py
Time: ~3 minutes
↓
Step 6: Fix Reference Numbers
If references removed:
Command: python scripts/extract_and_fix_references.py
Note: Directly modifies source file—backup first
Time: ~2 minutes
↓
Step 7: Final Verification
Command: python scripts/simple_verify.py
Quick validation of fixes:
• Reference numbering continuity
• Reference completeness
Time: ~1 minute
↓
Step 8: Generate Word Document
Command: node scripts/create_word_doc_v3.js
Generates Word with academic formatting:
• Font: 宋体/Times New Roman
• Size: Body 小四, Headings 三号
• Line spacing: 1.5x
• Margins: 2.54cm
Input: your_review.md
Output: your_review.docx
Time: ~2 minutes
↓
✅ Complete!
Document ready for submission
```
### Command Example
```bash
# Enter workwork directory
cd ~/.workbuddy/skills/workwork
# Steps 1-2: Run unified check
python scripts/unified_checker.py your_review.md
# Step 3: Review report
cat your_review_unified_report.md
# Step 4: Manually fix issues (in text editor)
# Step 5: Filter references (optional)
# Edit scripts/filter_references.py then run
python scripts/filter_references.py
# Step 6: Fix numbering
python scripts/extract_and_fix_references.py
# Step 7: Final verification
python scripts/simple_verify.py
# Step 8: Generate Word document
node scripts/create_word_doc_v3.js
# Complete!
```
## Workflow 2: Step-by-Step
**Best for:** Detailed analysis of each specific issue type
### Process Flow
```
Step 1: Write Review
Format: Markdown
Input: Manual writing
Output: your_review.md
↓
Step 2: Check Reference Format
Command: python scripts/reference_formatter.py your_review.md
Output: your_review_format_report.md
Checks: Citation position, author format, numbering, duplicates
Time: ~3 minutes
↓
Step 3: Typo and Grammar Check
Command: python scripts/typo_grammar_checker.py your_review.md
Output: your_review_typo_grammar_report.md
Checks: Typos, grammar, punctuation, number format
Time: ~3 minutes
↓
Step 4: Reference Accuracy Check
Command: python scripts/reference_accuracy_checker.py your_review.md
Output: your_review_reference_accuracy_report.md
Checks: Reference format, citation validity, numbering
Time: ~3 minutes
↓
Step 5: Document Format Check
Command: python scripts/document_format_checker.py your_review.md
Output: your_review_document_format_report.md
Checks: Title, abstract, chapters, references, paragraphs
Time: ~3 minutes
↓
Step 6: Review All Check Reports
cat your_review_format_report.md
cat your_review_typo_grammar_report.md
cat your_review_reference_accuracy_report.md
cat your_review_document_format_report.md
Time: ~5 minutes
↓
Step 7: Confirm Fixes
Identify issues to fix based on reports
Time: Depends on issue count
↓
Step 8: Filter References (Optional)
Edit scripts/filter_references.py journal list
Run: python scripts/filter_references.py
Time: ~3 minutes
↓
Step 9: Fix Reference Numbers
Command: python scripts/extract_and_fix_references.py
Time: ~2 minutes
↓
Step 10: Comprehensive Quality Check
Command: python scripts/verify_review_comprehensive.py
Time: ~2 minutes
↓
Step 11: Generate Word Document
Command: node scripts/create_word_doc_v3.js
Time: ~2 minutes
↓
✅ Complete!
```
### Command Example
```bash
# Enter workwork directory
cd ~/.workbuddy/skills/workwork
# Step 2: Check reference format
python scripts/reference_formatter.py your_review.md
cat your_review_format_report.md
# Step 3: Typo and grammar check
python scripts/typo_grammar_checker.py your_review.md
cat your_review_typo_grammar_report.md
# Step 4: Reference accuracy check
python scripts/reference_accuracy_checker.py your_review.md
cat your_review_reference_accuracy_report.md
# Step 5: Document format check
python scripts/document_format_checker.py your_review.md
cat your_review_document_format_report.md
# Steps 6-7: Review reports and fix issues
# Step 8: Filter references (optional)
# Edit scripts/filter_references.py then run
python scripts/filter_references.py
# Step 9: Fix numbering
python scripts/extract_and_fix_references.py
# Step 10: Comprehensive check
python scripts/verify_review_comprehensive.py
# Step 11: Generate Word document
node scripts/create_word_doc_v3.js
# Complete!
```
## Workflow 3: Quick Fix
**Best for:** Minor edits, previous comprehensive check already done
### Process Flow
```
Step 1: Fix Issues
Based on previous errors
↓
├─ Fixed citation numbers
│ ↓
│ Run: python scripts/reference_accuracy_checker.py your_review.md
│ Time: ~2 minutes
│
├─ Fixed typos
│ ↓
│ Run: python scripts/typo_grammar_checker.py your_review.md
│ Time: ~2 minutes
│
├─ Fixed formatting
│ ↓
│ Run: python scripts/document_format_checker.py your_review.md
│ Time: ~2 minutes
│
└─ Fixed reference format
↓
Run: python scripts/reference_formatter.py your_review.md
Time: ~2 minutes
↓
Step 2: Review Check Report
Confirm fixes successful
↓
Step 3: Generate Word Document
Command: node scripts/create_word_doc_v3.js
Time: ~2 minutes
↓
✅ Complete!
```
### Command Examples
```bash
# Enter workwork directory
cd ~/.workbuddy/skills/workwork
# Case 1: Only fixed citation numbers
python scripts/reference_accuracy_checker.py your_review.md
node scripts/create_word_doc_v3.js
# Case 2: Only fixed typos
python scripts/typo_grammar_checker.py your_review.md
node scripts/create_word_doc_v3.js
# Case 3: Only fixed document format
python scripts/document_format_checker.py your_review.md
node scripts/create_word_doc_v3.js
# Case 4: Only fixed reference format
python scripts/reference_formatter.py your_review.md
node scripts/create_word_doc_v3.js
```
## Workflow Comparison
### Time Comparison
| Workflow | Steps | Check Time | Fix Time | Total Time |
|-----------|--------|-----------|-----------|------------|
| Unified Check | 8 steps | ~5 min | Variable | ~10 min |
| Step-by-Step | 11 steps | ~15 min | Variable | ~20 min |
| Quick Fix | 3 steps | ~2 min | Variable | ~3 min |
### Feature Comparison
| Feature | Unified | Step-by-Step | Quick Fix |
|---------|----------|--------------|-----------|
| One-pass check | ✅ | ❌ | ❌ |
| Detailed reports | ✅ | ✅ | ✅ |
| Flexibility | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Efficiency | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Avoids redundancy | ✅ | ❌ | N/A |
### Scenario Comparison
| Scenario | Unified | Step-by-Step | Quick Fix |
|----------|----------|--------------|-----------|
| First draft | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ❌ |
| Detailed analysis | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ❌ |
| Minor edits | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ |
| Time-critical | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ |
## Workflow Selection Guidance
### Choose Unified Check When:
- ✅ First draft just completed
- ✅ Need comprehensive validation
- ✅ Want efficient completion
- ✅ First time using the skill
### Choose Step-by-Step When:
- ✅ Need detailed information on each specific issue
- ✅ Want to understand document condition step by step
- ✅ Learning each checker's functionality
### Choose Quick Fix When:
- ✅ Only made minor edits
- ✅ Already ran comprehensive check
- ✅ Time is limited
- ✅ Only need to verify specific sections
## Important Warnings
### Avoid Redundant Checks
```
❌ Wrong approach:
Run unified check → fix issues → run unified check again → fix again...
✅ Correct approach:
Run unified check → fix issues → targeted recheck → generate Word
```
### File Backup
```
⚠️ Caution:
- extract_and_fix_references.py directly modifies source file
- filter_references.py may modify source file
✅ Backup command:
cp your_review.md your_review_backup.md
```
## Check Frequency Recommendations
- **First draft complete:** Use Unified Check workflow
- **After fixes:** Use Quick Fix workflow
- **Before submission:** Use Unified Check workflow
## Example Use Case: First Draft, Full Check
```bash
# 1. Run unified check
python scripts/unified_checker.py my_review.md
# 2. Review report
cat my_review_unified_report.md
# Report shows:
# - Total issues: 5
# - Critical: 2 (must fix)
# - Warnings: 2 (should fix)
# 3. Fix issues
# Manually fix all issues in text editor
# 4. Filter references (optional)
python scripts/filter_references.py
# 5. Fix numbering
python scripts/extract_and_fix_references.py
# 6. Final verification
python scripts/simple_verify.py
# 7. Generate Word document
node scripts/create_word_doc_v3.js
# Complete! Time: ~10 minutes
```
## Example Use Case: Fixed Citation Numbers
```bash
# 1. Run targeted check
python scripts/reference_accuracy_checker.py my_review.md
# 2. Review report
cat my_review_reference_accuracy_report.md
# Report shows: Citation numbers correct ✅
# 3. Generate Word document
node scripts/create_word_doc_v3.js
# Complete! Time: ~3 minutes
```
---
**Document Version**: v1.0
**Last Updated**: 2026-03-19
**Skill Version**: v2.1.0
**Status**: ✅ Complete and ready to use
FILE:scripts/analyze_citation_pattern.py
# -*- coding: utf-8 -*-
"""
专项检查:一整段引用参考文献被拆开逐句标注的问题
规则:同一段落内,同一参考文献编号在连续多个句子中重复出现,应合并为段末一次引用
"""
import re
import sys
def extract_paragraphs(text):
"""提取正文段落(非标题、非参考文献列表)"""
lines = text.split('\n')
paragraphs = []
in_refs = False
current_para = []
for line in lines:
line = line.rstrip()
# 检测是否进入参考文献区
if re.match(r'^#{1,3}\s*[一二三四五六七八九十参考文献]', line):
if '参考文献' in line:
in_refs = True
if current_para:
paragraphs.append('\n'.join(current_para))
current_para = []
continue
if in_refs:
continue
# 跳过标题、空行
if line.startswith('#') or not line.strip():
if current_para:
paragraphs.append('\n'.join(current_para))
current_para = []
continue
current_para.append(line)
if current_para:
paragraphs.append('\n'.join(current_para))
return paragraphs
def split_sentences(paragraph):
"""将段落分割为句子(按中文句号、叹号、问号)"""
# 匹配中文句子终止符(保留引用标记)
sentences = re.split(r'(?<=[。!?])', paragraph)
return [s.strip() for s in sentences if s.strip()]
def find_citations(sentence):
"""提取句子中的所有引用编号"""
# 匹配 [1], [1-3], [1,2,3], [46] 等格式
pattern = r'\[(\d+(?:[,-]\d+)*)\]'
matches = re.findall(pattern, sentence)
cited = set()
for m in matches:
# 处理范围 [1-3] -> {1,2,3}
parts = re.split(r'[,,]', m)
for part in parts:
if '-' in part:
a, b = part.split('-', 1)
try:
for i in range(int(a), int(b)+1):
cited.add(i)
except:
pass
else:
try:
cited.add(int(part))
except:
pass
return cited
def analyze_paragraph_citations(paragraph, para_idx):
"""分析一个段落中的引用重复模式"""
sentences = split_sentences(paragraph)
if len(sentences) < 2:
return []
issues = []
# 统计每个引用在哪些句子中出现
citation_positions = {} # ref_num -> [sentence_indices]
for i, sent in enumerate(sentences):
cited = find_citations(sent)
for ref in cited:
if ref not in citation_positions:
citation_positions[ref] = []
citation_positions[ref].append(i)
# 找出在多个连续句子中重复的引用
for ref, positions in citation_positions.items():
if len(positions) < 2:
continue
# 检查是否有连续位置
consecutive_groups = []
current_group = [positions[0]]
for i in range(1, len(positions)):
if positions[i] == positions[i-1] + 1:
current_group.append(positions[i])
else:
if len(current_group) >= 2:
consecutive_groups.append(current_group)
current_group = [positions[i]]
if len(current_group) >= 2:
consecutive_groups.append(current_group)
for group in consecutive_groups:
n = len(group)
severity = "严重" if n >= 3 else "警告"
# 提取相关句子
relevant_sents = [sentences[idx] for idx in group]
issue = {
'para_idx': para_idx,
'ref': ref,
'positions': group,
'count': n,
'severity': severity,
'sentences': relevant_sents
}
issues.append(issue)
return issues
def main():
if len(sys.argv) < 2:
print("用法: python analyze_citation_pattern.py <your_review.md>")
sys.exit(1)
input_file = sys.argv[1]
try:
with open(input_file, 'r', encoding='utf-8') as f:
text = f.read()
except FileNotFoundError:
print(f"文件未找到: {input_file}")
sys.exit(1)
paragraphs = extract_paragraphs(text)
print("=" * 70)
print("专项分析:整段引用被拆分逐句标注的问题")
print("=" * 70)
print(f"共分析 {len(paragraphs)} 个段落\n")
all_issues = []
for para_idx, para in enumerate(paragraphs):
issues = analyze_paragraph_citations(para, para_idx)
all_issues.extend(issues)
if not all_issues:
print("未发现整段引用被拆分逐句标注的问题!")
else:
print(f"共发现 {len(all_issues)} 个问题:\n")
# 按严重程度排序
serious = [i for i in all_issues if i['severity'] == '严重']
warnings = [i for i in all_issues if i['severity'] == '警告']
print(f" 严重(连续3句以上): {len(serious)} 个")
print(f" 警告(连续2句): {len(warnings)} 个")
print()
for idx, issue in enumerate(all_issues, 1):
print(f"【问题 {idx}】{issue['severity']} - 参考文献 [{issue['ref']}]")
print(f" 连续 {issue['count']} 个句子中重复引用")
print(f" 句子位置: {[i+1 for i in issue['positions']]}(段落中第几句)")
print(f" 原文句子:")
for s_idx, sent in enumerate(issue['sentences']):
# 截断过长的句子
display = sent[:100] + '...' if len(sent) > 100 else sent
print(f" [{s_idx+1}] {display}")
print()
# 生成修复建议
print("=" * 70)
print("修复建议")
print("=" * 70)
print()
print("对于[连续多个句子引用同一文献]的情况,学术规范建议:")
print(" 1. 如果整段内容均来自同一文献,只需在段末最后一句保留引用")
print(" 2. 如果段落首句引入文献,之后均是对该文献内容的描述,")
print(" 可在首句保留引用,后续句子中删除重复引用")
print(" 3. 避免每句话都重复标注同一引用编号")
print()
# 保存报告
output_file = input_file.replace('.md', '_citation_pattern_report.md')
with open(output_file, 'w', encoding='utf-8') as f:
f.write("# 引用模式分析报告\n\n")
f.write(f"**分析文件**: {input_file}\n\n")
f.write(f"**发现问题总数**: {len(all_issues)}\n\n")
if all_issues:
f.write("## 问题详情\n\n")
for idx, issue in enumerate(all_issues, 1):
f.write(f"### 问题 {idx}:{issue['severity']} - 参考文献 [{issue['ref']}]\n\n")
f.write(f"- **连续句子数**: {issue['count']}\n")
f.write(f"- **句子位置**: {[i+1 for i in issue['positions']]}(段落中第几句)\n\n")
f.write("**原文句子**:\n\n")
for s_idx, sent in enumerate(issue['sentences']):
f.write(f"> {s_idx+1}. {sent}\n\n")
f.write("**修复建议**: 删除中间句子中的重复引用,只保留首句或末句的引用\n\n")
f.write("---\n\n")
f.write("## 修复原则\n\n")
f.write("1. 整段内容来自同一文献 → 只在段末保留一次引用\n")
f.write("2. 段落首句引入文献 → 首句保留,后续删除重复\n")
f.write("3. 避免每句重复标注同一引用编号\n")
print(f"报告已保存至: {output_file}")
return len([i for i in all_issues if i['severity'] == '严重'])
if __name__ == '__main__':
exit_code = main()
sys.exit(exit_code)
FILE:scripts/check_duplicate_citations.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
检查文档中的同一段落重复引用问题
功能:扫描整个文档,找出同一段落中重复出现的引用组合
用法:
python check_duplicate_citations.py <输入文件.md>
"""
import re
import sys
def normalize_citation(citation_text):
"""标准化引用组合,用于比较"""
numbers = re.findall(r'\d+', citation_text)
return tuple(sorted(numbers))
def find_paragraphs(text):
"""将文本分割成段落"""
paragraphs = re.split(r'\n\s*\n', text)
return [p.strip() for p in paragraphs if p.strip()]
def check_paragraph(paragraph, para_index):
"""检查单个段落中的重复引用"""
citation_pattern = re.compile(r'\[(\d+(?:,\d+)*(?:-\d+)?)\]')
citations = list(citation_pattern.finditer(paragraph))
if not citations:
return []
# 统计每个引用组合出现的次数和位置
citation_stats = {}
for match in citations:
citation_key = normalize_citation(match.group(1))
original_text = match.group(0)
if citation_key not in citation_stats:
citation_stats[citation_key] = {
'count': 0,
'original': original_text,
'positions': []
}
citation_stats[citation_key]['count'] += 1
citation_stats[citation_key]['positions'].append((match.start(), match.end()))
# 找出重复的引用
duplicates = []
for key, stats in citation_stats.items():
if stats['count'] > 1:
duplicates.append({
'citation': stats['original'],
'count': stats['count'],
'positions': stats['positions']
})
return duplicates
def main():
if len(sys.argv) < 2:
print("用法: python check_duplicate_citations.py <输入文件.md>")
sys.exit(1)
input_file = sys.argv[1]
try:
with open(input_file, 'r', encoding='utf-8') as f:
content = f.read()
except FileNotFoundError:
print(f"错误: 找不到文件 '{input_file}'")
sys.exit(1)
except Exception as e:
print(f"错误: 读取文件失败 - {e}")
sys.exit(1)
print(f"正在检查: {input_file}")
print("=" * 60)
# 分离正文和参考文献部分
ref_section_match = re.search(r'\n## 参考文献\s*\n', content)
if ref_section_match:
main_text = content[:ref_section_match.start()]
else:
main_text = content
# 按段落分割
paragraphs = find_paragraphs(main_text)
# 检查每个段落
total_duplicates = 0
paragraphs_with_duplicates = 0
for i, para in enumerate(paragraphs):
duplicates = check_paragraph(para, i)
if duplicates:
paragraphs_with_duplicates += 1
total_duplicates += len(duplicates)
print(f"\n[段落 {i+1}] 发现重复引用:")
# 显示段落前100字符作为上下文
context = para[:100].replace('\n', ' ')
if len(para) > 100:
context += "..."
print(f" 上下文: {context}")
for dup in duplicates:
print(f" - 引用 {dup['citation']} 出现 {dup['count']} 次")
# 提取每个位置的上下文
for start, end in dup['positions']:
# 获取引用前后的文本
before = para[max(0, start-30):start]
after = para[end:min(len(para), end+30)]
print(f" 位置: ...{before}{dup['citation']}{after}...")
print("\n" + "=" * 60)
print("[检查统计]")
print(f" 总段落数: {len(paragraphs)}")
print(f" 有重复引用的段落: {paragraphs_with_duplicates}")
print(f" 重复引用组合数: {total_duplicates}")
if total_duplicates == 0:
print("\n[OK] 未发现同一段落中的重复引用问题!")
else:
print(f"\n[WARNING] 发现 {total_duplicates} 处重复引用需要处理")
print(" 建议运行: python merge_duplicate_citations_in_paragraphs.py")
if __name__ == '__main__':
main()
FILE:scripts/create_word_doc_v3.js
const { Document, Packer, Paragraph, TextRun, PageBreak, AlignmentType, HeadingLevel, convertInchesToTwip } = require('docx');
const fs = require('fs');
const path = require('path');
// 读取markdown文件
const inputFile = process.argv[2];
if (!inputFile) {
console.error('用法: node create_word_doc_v3.js <your_review.md>');
process.exit(1);
}
const markdown = fs.readFileSync(inputFile, 'utf-8');
// 按行分割
const lines = markdown.split('\n');
const children = [];
let currentSection = '';
for (let i = 0; i < lines.length; i++) {
let line = lines[i].trim();
if (!line) {
// 空行
continue;
}
// 一级标题 (#, ##)
if (line.startsWith('# ')) {
const text = line.replace('# ', '');
children.push(new Paragraph({
children: [new TextRun({
text: text,
font: '黑体',
size: 32, // 小三 (16pt)
bold: true
})],
alignment: AlignmentType.CENTER,
spacing: {
before: 240,
after: 240
}
}));
}
// 二级标题 (##)
else if (line.startsWith('## ')) {
const text = line.replace('## ', '');
children.push(new Paragraph({
children: [new TextRun({
text: text,
font: '黑体',
size: 28, // 四号 (14pt)
bold: true
})],
spacing: {
before: 160,
after: 120
}
}));
}
// 参考文献部分标题
else if (line === '## 参考文献' || line === '### 参考文献') {
children.push(new Paragraph({
children: [new TextRun({
text: '参考文献',
font: '黑体',
size: 28,
bold: true
})],
spacing: {
before: 160,
after: 120
}
}));
}
// 参考文献
else if (line.startsWith('[1]') || (line.startsWith('[') && line.includes(']'))) {
children.push(new Paragraph({
children: [new TextRun({
text: line.trim(),
font: 'Times New Roman',
size: 20.5 // 五号 (10.5pt)
})],
spacing: {
line: 280, // 1.25倍行距
before: 60,
after: 60
},
indent: {
firstLine: 0
}
}));
}
// 关键词行
else if (line.startsWith('关键词:') || line.startsWith('关键词:')) {
children.push(new Paragraph({
children: [new TextRun({
text: line,
font: '宋体',
size: 24, // 小四 (12pt)
bold: true
})],
spacing: {
line: 360, // 1.5倍行距
before: 120,
after: 120
},
indent: {
firstLine: convertInchesToTwip(0.3) // 首行缩进约2字符
}
}));
}
// 摘要行
else if (line.startsWith('摘要:') || line.startsWith('摘要:')) {
children.push(new Paragraph({
children: [new TextRun({
text: line,
font: '宋体',
size: 24 // 小四 (12pt)
})],
spacing: {
line: 360, // 1.5倍行距
before: 120,
after: 120
}
}));
}
// 正文段落
else {
children.push(new Paragraph({
children: [new TextRun({
text: line,
font: '宋体',
size: 24 // 小四 (12pt)
})],
spacing: {
line: 360, // 1.5倍行距
before: 60,
after: 60
},
indent: {
firstLine: convertInchesToTwip(0.3) // 首行缩进约2字符
}
}));
}
}
// 创建文档
const doc = new Document({
sections: [{
properties: {
page: {
margin: {
top: convertInchesToTwip(1), // 2.54cm (1英寸)
right: convertInchesToTwip(1), // 2.54cm (1英寸)
bottom: convertInchesToTwip(1), // 2.54cm (1英寸)
left: convertInchesToTwip(1) // 2.54cm (1英寸)
},
size: {
width: convertInchesToTwip(8.27), // A4 宽度
height: convertInchesToTwip(11.69) // A4 高度
}
}
},
children: children
}]
});
// 保存文档
const outputFile = path.basename(inputFile, path.extname(inputFile)) + '_筛选后.docx';
Packer.toBuffer(doc).then((buffer) => {
fs.writeFileSync(outputFile, buffer);
console.log(`Word文档已生成: outputFile`);
}).catch((err) => {
console.error('生成Word文档时出错:', err);
});
FILE:scripts/create_word_with_superscript.js
const { Document, Packer, Paragraph, TextRun, AlignmentType, convertInchesToTwip } = require('docx');
const fs = require('fs');
const path = require('path');
// 获取输入文件路径
const inputFile = process.argv[2] || 'your_review.md';
const outputFile = inputFile.replace('.md', '_上标版.docx');
const outputPath = path.resolve(outputFile);
// 读取markdown文件
const markdown = fs.readFileSync(inputFile, 'utf-8');
// 按行分割
const lines = markdown.split('\n');
const children = [];
let inReferencesSection = false;
// 字号定义
const FONT_SIZE_BODY = 24; // 小四 = 12pt = 24 half-points(正文和参考文献)
const FONT_SIZE_SMALL = 20; // 上标引用(略小)
const FONT_SIZE_H1 = 32; // 一级标题(小三 = 16pt)
const FONT_SIZE_H2 = 28; // 二级标题(四号 = 14pt)
// 判断字符是否为中文
function isChinese(char) {
return /[\u4e00-\u9fa5]/.test(char);
}
// 匹配物种拉丁文名
const SCIENTIFIC_NAME_REGEX = /[A-Z][a-z]+\s+[a-z]+/g;
// 将文本分割为不同片段(处理中文、英文、拉丁文名)
function parseTextWithFormatting(text) {
const runs = [];
// 匹配引用标记
const citationRegex = /\[(\d+(?:-\d+)?(?:,\s*\d+(?:-\d+)?)*)\]/g;
// 找出所有特殊标记的位置
const markers = [];
// 查找引用标记
let match;
while ((match = citationRegex.exec(text)) !== null) {
markers.push({
start: match.index,
end: match.index + match[0].length,
type: 'citation',
text: match[0]
});
}
// 查找拉丁文名(但不在引用标记内)
SCIENTIFIC_NAME_REGEX.lastIndex = 0;
while ((match = SCIENTIFIC_NAME_REGEX.exec(text)) !== null) {
// 检查是否与引用标记重叠
const isOverlapping = markers.some(m =>
(match.index >= m.start && match.index < m.end) ||
(match.index + match[0].length > m.start && match.index + match[0].length <= m.end)
);
if (!isOverlapping) {
markers.push({
start: match.index,
end: match.index + match[0].length,
type: 'latin',
text: match[0]
});
}
}
// 按位置排序
markers.sort((a, b) => a.start - b.start);
// 处理文本
let lastIndex = 0;
for (const marker of markers) {
// 处理标记前的普通文本
if (marker.start > lastIndex) {
const normalText = text.substring(lastIndex, marker.start);
const segments = splitByLanguage(normalText);
segments.forEach(segment => {
runs.push(new TextRun({
text: segment.text,
font: segment.isChinese ? '宋体' : 'Times New Roman',
size: FONT_SIZE_BODY // 正文小四
}));
});
}
// 处理标记
if (marker.type === 'citation') {
runs.push(new TextRun({
text: marker.text,
font: 'Times New Roman',
size: FONT_SIZE_SMALL,
superScript: true
}));
} else if (marker.type === 'latin') {
runs.push(new TextRun({
text: marker.text,
font: 'Times New Roman',
size: FONT_SIZE_BODY, // 正文小四
italics: true
}));
}
lastIndex = marker.end;
}
// 处理剩余文本
if (lastIndex < text.length) {
const normalText = text.substring(lastIndex);
const segments = splitByLanguage(normalText);
segments.forEach(segment => {
runs.push(new TextRun({
text: segment.text,
font: segment.isChinese ? '宋体' : 'Times New Roman',
size: FONT_SIZE_BODY // 正文小四
}));
});
}
return runs.length > 0 ? runs : [new TextRun({
text: text,
font: '宋体',
size: FONT_SIZE_BODY
})];
}
// 将文本按语言分割(中文 vs 非中文)
function splitByLanguage(text) {
const segments = [];
let currentSegment = '';
let currentIsChinese = null;
for (let i = 0; i < text.length; i++) {
const char = text[i];
const charIsChinese = isChinese(char);
if (currentIsChinese === null) {
currentIsChinese = charIsChinese;
currentSegment = char;
} else if (currentIsChinese === charIsChinese) {
currentSegment += char;
} else {
segments.push({
text: currentSegment,
isChinese: currentIsChinese
});
currentIsChinese = charIsChinese;
currentSegment = char;
}
}
if (currentSegment) {
segments.push({
text: currentSegment,
isChinese: currentIsChinese
});
}
return segments;
}
for (let i = 0; i < lines.length; i++) {
let line = lines[i].trim();
if (!line) {
continue;
}
// 检测参考文献部分
if (line === '## 参考文献' || line === '### 参考文献') {
inReferencesSection = true;
children.push(new Paragraph({
children: [new TextRun({
text: '参考文献',
font: '黑体',
size: FONT_SIZE_H2, // 二级标题字号
bold: true
})],
spacing: {
before: 160,
after: 120
}
}));
continue;
}
// 一级标题
if (line.startsWith('# ') && !line.startsWith('## ')) {
const text = line.replace('# ', '');
children.push(new Paragraph({
children: [new TextRun({
text: text,
font: '黑体',
size: FONT_SIZE_H1, // 一级标题字号
bold: true
})],
alignment: AlignmentType.CENTER,
spacing: {
before: 240,
after: 240
}
}));
}
// 二级标题
else if (line.startsWith('## ')) {
const text = line.replace('## ', '');
children.push(new Paragraph({
children: [new TextRun({
text: text,
font: '黑体',
size: FONT_SIZE_H2, // 二级标题字号
bold: true
})],
spacing: {
before: 160,
after: 120
}
}));
}
// 参考文献条目(小四字号)
else if (inReferencesSection && line.startsWith('[') && /^\[\d+\]/.test(line)) {
children.push(new Paragraph({
children: [new TextRun({
text: line.trim(),
font: 'Times New Roman',
size: FONT_SIZE_BODY // 参考文献小四
})],
spacing: {
line: 280,
before: 60,
after: 60
},
indent: {
firstLine: 0
}
}));
}
// 关键词行
else if (line.startsWith('关键词:') || line.startsWith('关键词:')) {
const runs = parseTextWithFormatting(line);
runs.forEach(run => {
if (run.font) {
run.bold = true;
}
});
children.push(new Paragraph({
children: runs,
spacing: {
line: 360,
before: 120,
after: 120
},
indent: {
firstLine: convertInchesToTwip(0.3)
}
}));
}
// 摘要行
else if (line.startsWith('摘要:') || line.startsWith('摘要:')) {
const runs = parseTextWithFormatting(line);
children.push(new Paragraph({
children: runs,
spacing: {
line: 360,
before: 120,
after: 120
}
}));
}
// 正文段落(小四字号)
else {
const runs = parseTextWithFormatting(line);
children.push(new Paragraph({
children: runs,
spacing: {
line: 360,
before: 60,
after: 60
},
indent: {
firstLine: convertInchesToTwip(0.3)
}
}));
}
}
// 创建文档
const doc = new Document({
sections: [{
properties: {
page: {
margin: {
top: convertInchesToTwip(1),
right: convertInchesToTwip(1),
bottom: convertInchesToTwip(1),
left: convertInchesToTwip(1)
},
size: {
width: convertInchesToTwip(8.27),
height: convertInchesToTwip(11.69)
}
}
},
children: children
}]
});
// 保存文档
Packer.toBuffer(doc).then((buffer) => {
fs.writeFileSync(outputPath, buffer);
console.log('='.repeat(60));
console.log('✅ Word文档生成成功!');
console.log('='.repeat(60));
console.log(`📄 文件路径: outputPath`);
console.log('');
console.log('📋 格式说明:');
console.log(' • 正文和参考文献: 小四 (12pt)');
console.log(' • 中文字体: 宋体');
console.log(' • 英文/数字/标点: Times New Roman');
console.log(' • 物种拉丁文名: Times New Roman 斜体');
console.log(' • 参考文献引用: 右上标 [n]');
console.log(' • 一级标题: 小三 (16pt) 黑体');
console.log(' • 二级标题: 四号 (14pt) 黑体');
console.log('='.repeat(60));
console.log('');
console.log('📝 正在打开文档...');
console.log('='.repeat(60));
// 自动打开文档
const { exec } = require('child_process');
const platform = process.platform;
let command;
if (platform === 'win32') {
command = `start "" "outputPath"`;
} else if (platform === 'darwin') {
command = `open "outputPath"`;
} else {
command = `xdg-open "outputPath"`;
}
exec(command, (error) => {
if (error) {
console.log('⚠️ 自动打开失败,请手动打开文档');
} else {
console.log('✅ 文档已打开');
}
});
}).catch((err) => {
console.error('❌ 生成Word文档时出错:', err);
});
FILE:scripts/document_format_checker.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
文档格式规范检查脚本
Document Format Checker Script
功能:
1. 检查文档结构完整性
2. 检查章节编号规范性
3. 检查标题层级规范性
4. 生成检查报告
使用方法:
python document_format_checker.py <input_file>
"""
import re
import sys
import os
from datetime import datetime
from typing import List, Dict
class DocumentFormatChecker:
"""文档格式规范检查器"""
def __init__(self):
"""初始化检查器"""
self.content = ""
self.issues = []
def load_content(self, file_path: str) -> None:
"""加载文档内容"""
with open(file_path, 'r', encoding='utf-8') as f:
self.content = f.read()
print(f"[加载] 文档: {file_path}")
print(f"[信息] 文档长度: {len(self.content)} 字符")
def check_main_title(self) -> List[Dict]:
"""检查主标题"""
issues = []
print(" [检查] 主标题...")
# 检查是否有主标题(一级标题 #)
main_titles = re.findall(r'^#\s+(.+)$', self.content, re.MULTILINE)
if not main_titles:
issues.append({
'type': 'missing_title',
'message': '文档缺少主标题(一级标题 # 标题)'
})
elif len(main_titles) > 1:
issues.append({
'type': 'multiple_titles',
'message': f'文档有 {len(main_titles)} 个主标题,建议只有一个',
'titles': main_titles[:3] # 只显示前3个
})
else:
print(f" [信息] 主标题: {main_titles[0]}")
return issues
def check_abstract(self) -> List[Dict]:
"""检查摘要"""
issues = []
print(" [检查] 摘要...")
# 检查是否有摘要
if '摘要' not in self.content and 'Abstract' not in self.content:
issues.append({
'type': 'missing_abstract',
'message': '文档缺少摘要部分'
})
else:
# 检查摘要位置(应该在前部)
lines = self.content.split('\n')
abstract_line = -1
for i, line in enumerate(lines):
if '摘要' in line or 'Abstract' in line:
abstract_line = i
break
if abstract_line > 50: # 摘要在50行之后
issues.append({
'type': 'abstract_position',
'message': f'摘要位于第 {abstract_line + 1} 行,建议放在文档开头'
})
else:
print(f" [信息] 摘要位置: 第 {abstract_line + 1} 行")
# 检查关键词
if '关键词' not in self.content and 'Key words' not in self.content:
issues.append({
'type': 'missing_keywords',
'message': '文档缺少关键词部分'
})
return issues
def check_section_structure(self) -> List[Dict]:
"""检查章节结构"""
issues = []
print(" [检查] 章节结构...")
# 检查是否有二级标题
second_level_titles = re.findall(r'^##\s+(.+)$', self.content, re.MULTILINE)
if not second_level_titles:
issues.append({
'type': 'missing_sections',
'message': '文档缺少二级标题(## 标题)'
})
else:
print(f" [信息] 二级标题数量: {len(second_level_titles)}")
# 检查是否有三级标题
third_level_titles = re.findall(r'^###\s+(.+)$', self.content, re.MULTILINE)
print(f" [信息] 三级标题数量: {len(third_level_titles)}")
return issues
def check_chapter_numbering(self) -> List[Dict]:
"""检查章节编号(中文编号)"""
issues = []
print(" [检查] 章节编号...")
# 检查中文章节编号
chapter_pattern = r'^##\s*([一二三四五六七八九十]+)\、(.+)$'
chapters = re.findall(chapter_pattern, self.content, re.MULTILINE)
if chapters:
print(f" [信息] 中文章节: {len(chapters)} 个")
# 检查编号连续性
chapter_nums = [ch[0] for ch in chapters]
expected_chapters = ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十',
'十一', '十二', '十三', '十四', '十五', '十六', '十七', '十八']
# 找出缺失的章节
present_set = set(chapter_nums)
missing_chapters = [ch for ch in expected_chapters if ch not in present_set]
if missing_chapters:
issues.append({
'type': 'missing_chapters',
'message': f'章节编号不连续,缺失: {", ".join(missing_chapters)}',
'missing': missing_chapters
})
# 检查重复章节
chapter_counts = {}
for num in chapter_nums:
chapter_counts[num] = chapter_counts.get(num, 0) + 1
duplicate_chapters = {num: count for num, count in chapter_counts.items() if count > 1}
if duplicate_chapters:
issues.append({
'type': 'duplicate_chapters',
'message': f'发现重复的章节编号: {duplicate_chapters}'
})
else:
print(" [信息] 未使用中文章节编号")
return issues
def check_reference_section(self) -> List[Dict]:
"""检查参考文献部分"""
issues = []
print(" [检查] 参考文献部分...")
# 检查是否有参考文献部分
if '## 参考文献' not in self.content and '## 参考文献' not in self.content:
issues.append({
'type': 'missing_references',
'message': '文档缺少参考文献部分'
})
else:
# 检查参考文献格式
ref_section = re.search(r'## 参考文献\n([\s\S]+)$', self.content)
if ref_section:
ref_content = ref_section.group(1)
ref_lines = [line.strip() for line in ref_content.split('\n') if line.strip()]
# 检查文献数量
ref_pattern = r'^\[\d+\]\s+'
valid_refs = [line for line in ref_lines if re.match(ref_pattern, line)]
print(f" [信息] 参考文献数量: {len(valid_refs)}")
if len(valid_refs) < 5:
issues.append({
'type': 'too_few_references',
'message': f'参考文献数量较少({len(valid_refs)} 篇),建议增加文献引用'
})
else:
issues.append({
'type': 'invalid_reference_format',
'message': '参考文献部分格式不规范'
})
return issues
def check_paragraph_format(self) -> List[Dict]:
"""检查段落格式"""
issues = []
print(" [检查] 段落格式...")
lines = self.content.split('\n')
# 检查空行(段落之间应该有空行)
for i in range(len(lines) - 1):
current = lines[i].strip()
next_line = lines[i + 1].strip()
# 如果两行都不是标题,且都没有空行分隔,可能需要添加空行
if current and next_line and not current.startswith('#') and not next_line.startswith('#'):
# 检查是否是标题后的第一段(不需要空行)
if i > 0 and not lines[i - 1].strip().startswith('#'):
# 如果当前行以句号结尾,下一行不是标题,建议添加空行
if current.endswith('。') and not next_line.startswith('#'):
# 这只是建议,不是错误
pass
# 检查段落长度(过长的段落)
for i, line in enumerate(lines):
if line.strip() and not line.strip().startswith('#'):
# 计算段落长度(包含后续行直到空行或标题)
para_length = len(line.strip())
j = i + 1
while j < len(lines) and lines[j].strip() and not lines[j].strip().startswith('#'):
para_length += len(lines[j].strip())
j += 1
# 如果段落超过500字,建议分段
if para_length > 500:
issues.append({
'type': 'long_paragraph',
'message': f'第 {i + 1} 行的段落过长(约{para_length}字),建议适当分段',
'length': para_length
})
return issues
def check_heading_hierarchy(self) -> List[Dict]:
"""检查标题层级"""
issues = []
print(" [检查] 标题层级...")
lines = self.content.split('\n')
# 记录标题层级
previous_level = 0
for i, line in enumerate(lines):
if line.startswith('#'):
# 提取标题级别
level = 0
while level < len(line) and line[level] == '#':
level += 1
# 检查标题层级是否跳跃
if previous_level > 0 and level > previous_level + 1:
issues.append({
'type': 'heading_jump',
'message': f'第 {i + 1} 行: 标题层级从 {previous_level} 跳跃到 {level},建议逐级递进',
'from': previous_level,
'to': level
})
previous_level = level
return issues
def check_list_format(self) -> List[Dict]:
"""检查列表格式"""
issues = []
print(" [检查] 列表格式...")
# 检查无序列表
unordered_lists = re.findall(r'^[-*+]\s+', self.content, re.MULTILINE)
print(f" [信息] 无序列表项: {len(unordered_lists)}")
# 检查有序列表
ordered_lists = re.findall(r'^\d+\.\s+', self.content, re.MULTILINE)
print(f" [信息] 有序列表项: {len(ordered_lists)}")
# 检查列表缩进一致性(简单检查)
lines = self.content.split('\n')
in_list = False
list_indent = None
for i, line in enumerate(lines):
if re.match(r'^[-*+]\s+', line) or re.match(r'^\d+\.\s+', line):
current_indent = len(line) - len(line.lstrip())
if not in_list:
in_list = True
list_indent = current_indent
elif list_indent != current_indent:
issues.append({
'type': 'list_indent_inconsistent',
'message': f'第 {i + 1} 行: 列表缩进不一致',
'line': line.strip()
})
else:
if line.strip():
in_list = False
list_indent = None
return issues
def check_document_structure(self) -> List[Dict]:
"""检查文档整体结构"""
issues = []
print(" [检查] 文档整体结构...")
# 检查是否包含主要部分
required_sections = ['摘要', '关键词', '参考文献']
missing_sections = []
for section in required_sections:
if section not in self.content:
missing_sections.append(section)
if missing_sections:
issues.append({
'type': 'missing_sections',
'message': f'文档缺少必要部分: {", ".join(missing_sections)}',
'sections': missing_sections
})
# 检查文档长度
char_count = len(self.content)
word_count = len(self.content.replace('\n', '').replace(' ', ''))
print(f" [信息] 字符数: {char_count}")
print(f" [信息] 中文字数: {word_count}")
if word_count < 1000:
issues.append({
'type': 'document_too_short',
'message': f'文档较短({word_count}字),建议增加内容'
})
return issues
def run_all_checks(self) -> List[Dict]:
"""运行所有检查"""
print("\n[开始] 文档格式规范检查...")
print("="*60)
# 1. 检查主标题
self.issues.extend(self.check_main_title())
# 2. 检查摘要
self.issues.extend(self.check_abstract())
# 3. 检查章节结构
self.issues.extend(self.check_section_structure())
# 4. 检查章节编号
self.issues.extend(self.check_chapter_numbering())
# 5. 检查参考文献部分
self.issues.extend(self.check_reference_section())
# 6. 检查段落格式
self.issues.extend(self.check_paragraph_format())
# 7. 检查标题层级
self.issues.extend(self.check_heading_hierarchy())
# 8. 检查列表格式
self.issues.extend(self.check_list_format())
# 9. 检查文档整体结构
self.issues.extend(self.check_document_structure())
print("="*60)
print(f"\n[完成] 共发现 {len(self.issues)} 个问题\n")
return self.issues
def generate_report(self) -> str:
"""生成检查报告"""
report = []
report.append("# 文档格式规范检查报告\n")
report.append(f"生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
# 统计问题类型
issue_types = {}
for issue in self.issues:
issue_type = issue['type']
issue_types[issue_type] = issue_types.get(issue_type, 0) + 1
report.append("## 问题统计\n")
type_names = {
'missing_title': '缺少主标题',
'multiple_titles': '多个主标题',
'missing_abstract': '缺少摘要',
'abstract_position': '摘要位置',
'missing_keywords': '缺少关键词',
'missing_sections': '缺少章节',
'missing_chapters': '缺失章节',
'duplicate_chapters': '重复章节',
'missing_references': '缺少参考文献',
'too_few_references': '参考文献过少',
'invalid_reference_format': '参考文献格式不规范',
'long_paragraph': '段落过长',
'heading_jump': '标题层级跳跃',
'list_indent_inconsistent': '列表缩进不一致',
'document_too_short': '文档过短'
}
for issue_type, count in issue_types.items():
name = type_names.get(issue_type, issue_type)
report.append(f"- {name}: {count} 个\n")
report.append("\n## 详细问题列表\n")
# 按类型分组显示问题
grouped_issues = {}
for issue in self.issues:
issue_type = issue['type']
if issue_type not in grouped_issues:
grouped_issues[issue_type] = []
grouped_issues[issue_type].append(issue)
type_order = ['missing_title', 'multiple_titles', 'missing_abstract', 'abstract_position',
'missing_keywords', 'missing_sections', 'missing_chapters', 'duplicate_chapters',
'missing_references', 'too_few_references', 'invalid_reference_format',
'long_paragraph', 'heading_jump', 'list_indent_inconsistent', 'document_too_short']
has_issues = False
for issue_type in type_order:
if issue_type in grouped_issues:
has_issues = True
type_names = {
'missing_title': '### 主标题问题',
'multiple_titles': '### 主标题问题',
'missing_abstract': '### 摘要问题',
'abstract_position': '### 摘要问题',
'missing_keywords': '### 关键词问题',
'missing_sections': '### 章节结构问题',
'missing_chapters': '### 章节编号问题',
'duplicate_chapters': '### 章节编号问题',
'missing_references': '### 参考文献问题',
'too_few_references': '### 参考文献问题',
'invalid_reference_format': '### 参考文献问题',
'long_paragraph': '### 段落格式问题',
'heading_jump': '### 标题层级问题',
'list_indent_inconsistent': '### 列表格式问题',
'document_too_short': '### 文档结构问题'
}
report.append(f"{type_names.get(issue_type, f'### {issue_type}')}\n\n")
for i, issue in enumerate(grouped_issues[issue_type], 1):
report.append(f"{i}. {issue['message']}\n")
if 'titles' in issue:
report.append(f" - 标题列表: {issue['titles']}\n")
if 'missing' in issue:
report.append(f" - 缺失章节: {issue['missing']}\n")
if 'length' in issue:
report.append(f" - 段落长度: {issue['length']} 字\n")
if 'from' in issue and 'to' in issue:
report.append(f" - 层级: {issue['from']} → {issue['to']}\n")
if 'line' in issue:
report.append(f" - 内容: {issue['line']}\n")
report.append("\n")
# 检查结果总结
if not has_issues:
report.append("\n## 检查结果\n")
report.append("✅ 文档格式规范符合要求!\n")
else:
report.append("\n## 检查结果\n")
report.append(f"⚠️ 发现 {len(self.issues)} 个问题,建议进行修正。\n")
return ''.join(report)
def save_report(self, report_path: str) -> None:
"""保存检查报告"""
report = self.generate_report()
with open(report_path, 'w', encoding='utf-8') as f:
f.write(report)
print(f"[报告] 已生成: {report_path}\n")
def process(self, input_file: str, output_report: Optional[str] = None) -> None:
"""处理主流程"""
print("="*60)
print("文档格式规范检查工具")
print("="*60)
# 加载内容
self.load_content(input_file)
# 运行所有检查
issues = self.run_all_checks()
# 生成报告
if output_report is None:
output_report = input_file.rsplit('.', 1)[0] + '_document_format_report.md'
self.save_report(output_report)
# 显示总结
print("="*60)
print(f"检查完成! 共发现 {len(issues)} 个问题")
print(f"报告文件: {output_report}")
print("="*60)
def main():
"""主函数"""
if len(sys.argv) < 2:
print("使用方法:")
print(" python document_format_checker.py <input_file>")
print("\n示例:")
print(" python document_format_checker.py review.md")
sys.exit(1)
input_file = sys.argv[1]
if not os.path.exists(input_file):
print(f"[错误] 输入文件不存在: {input_file}")
sys.exit(1)
# 创建检查器并处理
checker = DocumentFormatChecker()
checker.process(input_file)
if __name__ == '__main__':
main()
FILE:scripts/extract_and_fix_references.py
# -*- coding: utf-8 -*-
"""
完整修复参考文献问题 - 使用最严谨的方法
用法: python extract_and_fix_references.py <your_review.md>
"""
import re
import sys
import os
if len(sys.argv) < 2:
print("用法: python extract_and_fix_references.py <your_review.md>")
sys.exit(1)
input_file = sys.argv[1]
# 读取原始文件
with open(input_file, 'r', encoding='utf-8') as f:
content = f.read()
# 需要删除的普通期刊
journals_to_remove = [
'森林与人类',
'中南林业调查规划',
'河南林业',
'云南林业'
]
# 分离正文和参考文献
ref_start = content.find('\n\n## 参考文献')
if ref_start == -1:
ref_start = content.find('\n\n### 参考文献')
main_text = content[:ref_start]
refs_text = content[ref_start + len('\n\n## 参考文献'):]
# 提取参考文献,保留完整内容
ref_lines = refs_text.strip().split('\n')
refs = []
for line in ref_lines:
if line.strip():
match = re.match(r'\[(\d+)\]\s+(.*)', line.strip())
if match:
num = int(match.group(1))
text = match.group(2).strip()
refs.append((num, text))
print(f"提取到参考文献: {len(refs)} 条")
# 标记需要删除的文献
deleted_indices = set()
for i, (num, text) in enumerate(refs):
for journal in journals_to_remove:
if journal in text:
deleted_indices.add(i)
print(f"标记删除 [{num}]: {text[:60]}...")
break
# 保留的文献
kept_refs = [(num, text) for i, (num, text) in enumerate(refs) if i not in deleted_indices]
print(f"\n保留文献: {len(kept_refs)} 条")
# 重新编号
new_num_mapping = {} # 旧编号 -> 新编号列表(因为同一旧编号可能对应多条文献)
refs_by_old_num = {}
for i, (old_num, text) in enumerate(kept_refs):
if old_num not in refs_by_old_num:
refs_by_old_num[old_num] = []
refs_by_old_num[old_num].append((i, text))
# 为每条保留的文献分配新编号
final_refs = [] # (新编号, 文本)
old_to_new = {} # 旧编号 -> 新编号(取第一条)
new_num = 1
for old_num in sorted(refs_by_old_num.keys()):
refs_for_old_num = refs_by_old_num[old_num]
for idx, text in refs_for_old_num:
final_refs.append((new_num, text))
if old_num not in old_to_new:
old_to_new[old_num] = new_num
new_num += 1
print(f"新编号范围: 1 - {new_num - 1}")
# 替换正文中的引用
def replace_citation(match):
old_num = int(match.group(1))
# 删除[99]和[107]
if old_num in [99, 107]:
print(f"删除无效引用 [{old_num}]")
return ''
# 查找新编号
if old_num in old_to_new:
return f"[{old_to_new[old_num]}]"
# 如果找不到,保持原样
return match.group(0)
new_main_text = re.sub(r'\[(\d+)\]', replace_citation, main_text)
# 生成新的参考文献部分
new_refs_section = "\n\n## 参考文献\n\n"
for new_num, text in final_refs:
new_refs_section += f"[{new_num}] {text}\n\n"
# 组合
final_content = new_main_text + new_refs_section
# 保存
base_name = os.path.splitext(input_file)[0]
output_file = f'{base_name}_最终修复.md'
with open(output_file, 'w', encoding='utf-8') as f:
f.write(final_content)
print(f"\n完成! 输出文件: {output_file}")
FILE:scripts/extract_ref_format.py
# -*- coding: utf-8 -*-
"""
解压Word文档并提取文本内容
"""
import zipfile
import xml.etree.ElementTree as ET
# 解压Word文档
with zipfile.ZipFile('参考文献格式.docx', 'r') as docx:
# 读取document.xml
with docx.open('word/document.xml') as f:
xml_content = f.read()
# 解析XML
root = ET.fromstring(xml_content)
# 提取所有文本
# Word文档的文本在w:t元素中
ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
texts = []
for elem in root.iter():
if elem.tag == '{' + ns['w'] + '}t':
if elem.text:
texts.append(elem.text)
# 输出文本
full_text = ''.join(texts)
print("=" * 80)
print("参考文献格式文档内容:")
print("=" * 80)
print(full_text)
print("=" * 80)
FILE:scripts/filter_references.py
# -*- coding: utf-8 -*-
"""
筛选参考文献,删除指定的普通期刊
用法: python filter_references.py <your_review.md>
"""
import re
import sys
import os
if len(sys.argv) < 2:
print("用法: python filter_references.py <your_review.md>")
sys.exit(1)
input_file = sys.argv[1]
# 读取markdown文件
with open(input_file, 'r', encoding='utf-8') as f:
content = f.read()
# 需要删除的期刊
journals_to_remove = [
'森林与人类',
'中南林业调查规划',
'河南林业',
'云南林业'
]
# 提取所有参考文献
ref_pattern = r'\[(\d+)\]\s+(.*?)(?=\n\[|\n\n|$)'
matches = re.findall(ref_pattern, content, re.DOTALL)
# 存储保留的参考文献和旧编号映射
kept_refs = []
old_to_new_mapping = {}
for num, ref_text in matches:
num = int(num)
ref_text = ref_text.strip()
# 检查是否包含需要删除的期刊
should_remove = False
for journal in journals_to_remove:
if journal in ref_text:
print(f"删除 [{num}]: 包含期刊 '{journal}'")
try:
print(f" 内容: {ref_text[:80]}...")
except UnicodeEncodeError:
print(f" 内容: (包含特殊字符)")
should_remove = True
break
if not should_remove:
print(f"保留 [{num}]")
kept_refs.append((num, ref_text))
# 生成旧编号到新编号的映射
new_num = 1
for old_num, ref_text in kept_refs:
old_to_new_mapping[old_num] = new_num
new_num += 1
print(f"\n保留 {len(kept_refs)} 篇文献")
print(f"删除 {len(matches) - len(kept_refs)} 篇文献")
# 更新正文中的引用编号
print("\n更新正文中的引用编号...")
# 找到正文部分(参考文献之前)
ref_section_start = content.find('\n\n## 参考文献')
if ref_section_start == -1:
ref_section_start = content.find('\n\n### 参考文献')
if ref_section_start == -1:
ref_section_start = len(content)
main_text = content[:ref_section_start]
# 替换引用编号 [数字] 为新编号
def replace_reference(match):
old_num = int(match.group(1))
if old_num in old_to_new_mapping:
return f"[{old_to_new_mapping[old_num]}]"
return match.group(0)
new_main_text = re.sub(r'\[(\d+)\]', replace_reference, main_text)
# 统计替换次数
replacement_count = 0
for old_num in old_to_new_mapping:
count = main_text.count(f"[{old_num}]")
if count > 0:
replacement_count += count
print(f" [{old_num}] -> [{old_to_new_mapping[old_num]}] (出现{count}次)")
print(f"\n共替换了 {replacement_count} 处引用编号")
# 生成新的参考文献部分
print("\n生成新的参考文献列表...")
new_refs_section = "\n\n## 参考文献\n\n"
for old_num, ref_text in kept_refs:
new_num = old_to_new_mapping[old_num]
new_refs_section += f"[{new_num}] {ref_text}\n"
# 组合新内容
new_content = new_main_text + new_refs_section
base_name = os.path.splitext(input_file)[0]
output_filtered = f'{base_name}_筛选后.md'
# 写入新文件
with open(output_filtered, 'w', encoding='utf-8') as f:
f.write(new_content)
print("\n[完成]!")
print(f"新文件: {output_filtered}")
print(f"参考文献数量: {len(kept_refs)} 篇")
FILE:scripts/literature_integrity_checker.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
文献完整性验证检查脚本 - Literature Integrity Checker v3.2.0
功能:
1. 验证每篇文献的可溯源性(通过web_search验证)
2. 检查文献信息的完整性和准确性
3. 验证引用与文献列表的一致性
4. 检查重复文献
5. 检查连续相同引用(新增)
6. 生成详细的验证报告
7. ✨ NEW: 检查通过后自动打开最终 Word 文档
使用方法:
python literature_integrity_checker.py <input_file> [--deep-check] [--no-auto-open]
注意: 此检查需要网络连接,用于验证文献的真实性
"""
import re
import sys
import os
import json
from datetime import datetime
from typing import List, Dict, Tuple, Set, Optional
from dataclasses import dataclass, asdict
from urllib.parse import quote
import urllib.request
import urllib.error
import ssl
import time
import subprocess
@dataclass
class Reference:
"""文献数据结构"""
number: int
raw_text: str
authors: List[str]
title: str
journal: Optional[str] = None
year: Optional[int] = None
volume: Optional[str] = None
pages: Optional[str] = None
doi: Optional[str] = None
url: Optional[str] = None
@dataclass
class ValidationResult:
"""验证结果数据结构"""
reference_number: int
is_valid: bool
issues: List[str]
warnings: List[str]
suggestions: List[str]
search_results: Optional[List[Dict]] = None
class LiteratureIntegrityChecker:
"""文献完整性检查器"""
def __init__(self, enable_deep_check: bool = False, auto_open: bool = True):
"""初始化检查器"""
self.content = ""
self.file_path = ""
self.references: Dict[int, Reference] = {}
self.citations: List[int] = []
self.validation_results: List[ValidationResult] = []
self.enable_deep_check = enable_deep_check
self.auto_open = auto_open
self.issues_summary = {
'critical': [],
'warning': [],
'info': []
}
def load_content(self, file_path: str) -> None:
"""加载文档内容"""
self.file_path = file_path
with open(file_path, 'r', encoding='utf-8') as f:
self.content = f.read()
print(f"[加载] 文档: {file_path}")
print(f"[统计] 字符数: {len(self.content)}\n")
def parse_reference(self, ref_text: str, ref_num: int) -> Reference:
"""解析单条参考文献"""
ref = Reference(number=ref_num, raw_text=ref_text, authors=[], title="")
# 尝试提取作者(通常在开头,以.结束)
author_match = re.match(r'^([^\.]+)\.\s*', ref_text)
if author_match:
authors_str = author_match.group(1).strip()
# 分割多个作者
ref.authors = [a.strip() for a in re.split(r',|,', authors_str) if a.strip()]
# 尝试提取年份(19xx或20xx格式)
year_match = re.search(r'\b(19|20)\d{2}\b', ref_text)
if year_match:
ref.year = int(year_match.group(0))
# 尝试提取期刊名(通常在年份之后,以逗号或句号结束)
if ref.year:
year_pos = ref_text.find(str(ref.year))
if year_pos > 0:
after_year = ref_text[year_pos + 4:].strip()
# 查找期刊名(通常在逗号或点号之间)
journal_match = re.match(r'^[\s,]*(.*?)[\s,]*\d+', after_year)
if journal_match:
ref.journal = journal_match.group(1).strip()
# 尝试提取标题(作者之后,年份之前)
if author_match:
after_author = ref_text[author_match.end():]
year_match = re.search(r'\b(19|20)\d{2}\b', after_author)
if year_match:
ref.title = after_author[:year_match.start()].strip()
# 移除可能的期刊名部分
if '.' in ref.title:
ref.title = ref.title.rsplit('.', 1)[0].strip()
return ref
def extract_references(self) -> None:
"""提取所有参考文献"""
print("[步骤 1/6] 提取参考文献...")
# 查找参考文献章节
ref_section_pattern = r'##\s*参考文献|##\s*References'
ref_section_match = re.search(ref_section_pattern, self.content, re.IGNORECASE)
if not ref_section_match:
self.issues_summary['critical'].append("未找到参考文献章节")
return
# 提取参考文献部分的内容
ref_content = self.content[ref_section_match.end():]
# 匹配参考文献条目(格式: [数字] 开头)
ref_pattern = r'^\[(\d+)\]\s*(.+?)$'
references = re.findall(ref_pattern, ref_content, re.MULTILINE)
for ref_num, ref_text in references:
num = int(ref_num)
ref = self.parse_reference(ref_text, num)
self.references[num] = ref
print(f"[提取] 找到 {len(self.references)} 篇参考文献\n")
def extract_citations(self) -> None:
"""提取正文中的所有引用"""
print("[步骤 2/6] 提取正文引用...")
# 查找所有引用标记 [数字]
citation_pattern = r'\[(\d+)\]'
citations = re.findall(citation_pattern, self.content)
self.citations = [int(c) for c in citations]
# 去重统计
unique_citations = sorted(set(self.citations))
print(f"[提取] 找到 {len(self.citations)} 处引用({len(unique_citations)} 个唯一编号)\n")
def check_reference_completeness(self) -> None:
"""检查文献信息完整性"""
print("[步骤 3/6] 检查文献信息完整性...")
for ref_num, ref in self.references.items():
issues = []
warnings = []
# 检查必要字段
if not ref.authors:
issues.append("缺少作者信息")
if not ref.title or len(ref.title) < 5:
issues.append("标题可能不完整或缺失")
if not ref.journal:
warnings.append("期刊未识别")
if not ref.year:
issues.append("缺少年份信息")
if not ref.volume and not ref.pages:
warnings.append("缺少卷期或页码信息")
# 记录问题
if issues or warnings:
result = ValidationResult(
reference_number=ref_num,
is_valid=len(issues) == 0,
issues=issues,
warnings=warnings,
suggestions=[]
)
self.validation_results.append(result)
# 分类
for issue in issues:
self.issues_summary['critical'].append(f"[{ref_num}] {issue}")
for warning in warnings:
self.issues_summary['warning'].append(f"[{ref_num}] {warning}")
print(f"[检查] 发现 {len(self.validation_results)} 篇文献需要关注\n")
def check_citation_consistency(self) -> None:
"""检查引用一致性"""
print("[步骤 4/6] 检查引用一致性...")
# 检查无效引用(引用编号超出参考文献范围)
max_ref_num = max(self.references.keys()) if self.references else 0
invalid_citations = [c for c in self.citations if c > max_ref_num or c <= 0]
if invalid_citations:
unique_invalid = sorted(set(invalid_citations))
for invalid in unique_invalid:
self.issues_summary['critical'].append(f"无效引用 [{invalid}](超出参考文献范围)")
# 检查孤立引用(正文中有引用但参考文献列表中不存在)
missing_refs = set(self.citations) - set(self.references.keys())
if missing_refs:
for missing in sorted(missing_refs):
self.issues_summary['critical'].append(f"孤立引用 [{missing}](参考文献列表中不存在)")
print(f"[检查] 无效引用: {len(invalid_citations)} | 孤立引用: {len(missing_refs)}\n")
def check_duplicate_references(self) -> None:
"""检查重复文献"""
print("[步骤 5/6] 检查重复文献...")
# 基于标题和作者检查重复
ref_signatures = {}
duplicates = []
for ref_num, ref in self.references.items():
# 创建文献签名(标题 + 第一作者)
signature = f"{ref.title.lower()}_{ref.authors[0] if ref.authors else ''}"
if signature in ref_signatures:
duplicates.append((ref_signatures[signature], ref_num))
else:
ref_signatures[signature] = ref_num
if duplicates:
for dup1, dup2 in duplicates:
self.issues_summary['warning'].append(f"疑似重复文献: [{dup1}] 和 [{dup2}]")
print(f"[检查] 发现 {len(duplicates)} 对疑似重复文献\n")
def check_citation_format(self) -> None:
"""检查引用格式"""
print("[步骤 6/6] 检查引用格式...")
# 检查连续相同引用
lines = self.content.split('\n')
consecutive_issues = []
for i, line in enumerate(lines):
# 提取当前行的所有引用
citations = re.findall(r'\[(\d+)\]', line)
# 检查同一行是否有相同引用
if len(citations) > len(set(citations)):
self.issues_summary['warning'].append(f"第 {i+1} 行: 重复引用同一文献")
# 检查连续三行或更多使用相同引用
for i in range(len(lines) - 2):
ref_counts = {}
for j in range(i, min(i + 3, len(lines))):
citations = re.findall(r'\[(\d+)\]', lines[j])
for citation in citations:
ref_counts[citation] = ref_counts.get(citation, 0) + 1
# 检查是否有引用在连续3行都出现
for ref_num, count in ref_counts.items():
if count >= 3:
self.issues_summary['warning'].append(
f"第 {i+1}-{i+3} 行: 连续3句引用同一文献 [{ref_num}],建议只在最后一句保留"
)
print(f"[检查] 完成\n")
def run_deep_verification(self) -> None:
"""深度验证(可选,需要网络连接)"""
if not self.enable_deep_check:
print("[深度验证] 跳过(未启用 --deep-check)\n")
return
print("[深度验证] 启动文献真实性验证...")
print("注意: 此步骤需要网络连接,可能需要较长时间\n")
verified_count = 0
for ref_num, ref in self.references.items():
if ref.title and ref.authors:
try:
# 构建搜索查询
query = f"{ref.authors[0]} {ref.title}"
encoded_query = quote(query)
search_url = f"https://api.crossref.org/works?query={encoded_query}"
# 发送请求
req = urllib.request.Request(search_url)
req.add_header('User-Agent', 'LiteratureIntegrityChecker/1.0')
with urllib.request.urlopen(req, timeout=10) as response:
data = json.loads(response.read().decode('utf-8'))
if data.get('message', {}).get('total', 0) > 0:
verified_count += 1
print(f" [{ref_num}] ✓ 验证通过")
else:
self.issues_summary['warning'].append(f"[{ref_num}] 未找到匹配的记录")
print(f" [{ref_num}] ⚠ 未找到匹配记录")
# 避免请求过快
time.sleep(0.5)
except Exception as e:
self.issues_summary['info'].append(f"[{ref_num}] 验证失败: {str(e)}")
print(f" [{ref_num}] ✗ 验证失败: {str(e)}")
print(f"\n[深度验证] 完成,验证了 {verified_count}/{len(self.references)} 篇文献\n")
def generate_report(self) -> str:
"""生成检查报告"""
report = []
report.append("# 文献完整性检查报告\n")
report.append(f"**生成时间**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
report.append(f"**检查文件**: {os.path.basename(self.file_path)}\n")
# 统计信息
report.append("## 检查统计\n")
report.append(f"- **参考文献总数**: {len(self.references)}\n")
report.append(f"- **正文引用总数**: {len(self.citations)}\n")
report.append(f"- **唯一引用数**: {len(set(self.citations))}\n")
report.append(f"- **严重问题**: {len(self.issues_summary['critical'])}\n")
report.append(f"- **警告问题**: {len(self.issues_summary['warning'])}\n")
report.append(f"- **提示信息**: {len(self.issues_summary['info'])}\n")
# 问题列表
if self.issues_summary['critical']:
report.append("\n## 严重问题\n")
for issue in self.issues_summary['critical']:
report.append(f"- {issue}\n")
if self.issues_summary['warning']:
report.append("\n## 警告\n")
for issue in self.issues_summary['warning']:
report.append(f"- {issue}\n")
if self.issues_summary['info']:
report.append("\n## 提示\n")
for issue in self.issues_summary['info']:
report.append(f"- {issue}\n")
# 检查清单
report.append("\n## 文献检查清单\n")
report.append("在提交前,请确保以下事项:\n")
report.append("- [ ] 所有引用都有对应的参考文献\n")
report.append("- [ ] 参考文献编号连续(1, 2, 3...)\n")
report.append("- [ ] 每篇文献都有完整的作者、标题、期刊、年份信息\n")
report.append("- [ ] 没有重复的参考文献\n")
report.append("- [ ] 引用标记位置正确(在标点符号之前)\n")
report.append("- [ ] 文献格式符合学术规范\n")
# 结论
report.append("\n## 检查结论\n")
if len(self.issues_summary['critical']) == 0:
report.append("✅ **恭喜!未发现任何问题,文献完整性良好!**\n")
else:
report.append("⚠️ **发现严重问题,请在提交前修复!**\n")
return '\n'.join(report)
def save_report(self, output_path: str) -> None:
"""保存报告到文件"""
report = self.generate_report()
with open(output_path, 'w', encoding='utf-8') as f:
f.write(report)
print(f"[保存] 报告已生成: {output_path}\n")
def open_final_document(self) -> None:
"""打开最终的 Word 文档"""
if not self.auto_open:
print("[自动打开] 已禁用\n")
return
# 检查是否有严重问题
if len(self.issues_summary['critical']) > 0:
print("[自动打开] 检测到严重问题,跳过自动打开\n")
return
# 查找对应的 Word 文档
base_path = os.path.splitext(self.file_path)[0]
# 优先查找 _上标版.docx
possible_files = [
base_path + '_上标版.docx',
base_path + '.docx'
]
target_file = None
for file_path in possible_files:
if os.path.exists(file_path):
target_file = file_path
break
if target_file:
print(f"[自动打开] 正在打开文档: {os.path.basename(target_file)}")
try:
# 获取操作系统平台
platform = sys.platform
if platform == 'win32':
# Windows
os.startfile(target_file)
elif platform == 'darwin':
# macOS
subprocess.run(['open', target_file], check=True)
else:
# Linux
subprocess.run(['xdg-open', target_file], check=True)
print("OK 文档已打开\n")
except Exception as e:
print(f"WARNING 自动打开失败: {str(e)}")
print(f"请手动打开: {target_file}\n")
else:
print("[自动打开] 未找到对应的 Word 文档")
print(f" 查找路径: {base_path}_上标版.docx 或 {base_path}.docx\n")
def run_all_checks(self, input_file: str) -> None:
"""运行所有检查"""
print("\n" + "=" * 80)
print("文献完整性验证检查器 v3.2.0")
print("=" * 80 + "\n")
self.load_content(input_file)
# 1. 提取参考文献
self.extract_references()
# 2. 提取正文引用
self.extract_citations()
# 3. 检查文献完整性
self.check_reference_completeness()
# 4. 检查引用一致性
self.check_citation_consistency()
# 5. 检查重复文献
self.check_duplicate_references()
# 6. 检查引用格式
self.check_citation_format()
# 7. 深度验证(可选)
self.run_deep_verification()
# 生成报告
output_file = input_file.rsplit('.', 1)[0] + '_literature_integrity_report.md'
self.save_report(output_file)
# 打印总结
print("=" * 80)
print("检查完成!")
print("=" * 80)
critical = len(self.issues_summary['critical'])
warning = len(self.issues_summary['warning'])
info = len(self.issues_summary['info'])
print(f"严重问题: {critical} | 警告: {warning} | 提示: {info}")
print(f"报告路径: {output_file}\n")
# 自动打开最终文档
if self.auto_open:
self.open_final_document()
if critical > 0:
print("[!] 发现严重问题,请在提交前修复!")
sys.exit(1)
def main():
"""主函数"""
if len(sys.argv) < 2:
print("文献完整性验证检查工具 v3.2.0")
print("=" * 60)
print("\n使用方法:")
print(" python literature_integrity_checker.py <input_file> [--deep-check] [--no-auto-open]")
print("\n参数:")
print(" input_file 要检查的Markdown文件路径")
print(" --deep-check 启用深度验证(需要网络连接)")
print(" --no-auto-open 禁用自动打开最终文档")
print("\n示例:")
print(" python literature_integrity_checker.py review.md")
print(" python literature_integrity_checker.py review.md --deep-check")
print(" python literature_integrity_checker.py review.md --no-auto-open")
print("\n说明:")
print(" 此工具会检查文献的完整性、一致性和规范性")
print(" 发现严重问题时返回非零退出码")
print(" 检查通过后自动打开对应的 Word 文档(_上标版.docx)")
sys.exit(1)
input_file = sys.argv[1]
enable_deep_check = '--deep-check' in sys.argv
auto_open = '--no-auto-open' not in sys.argv
if not os.path.exists(input_file):
print(f"[错误] 文件不存在: {input_file}")
sys.exit(1)
checker = LiteratureIntegrityChecker(enable_deep_check=enable_deep_check, auto_open=auto_open)
checker.run_all_checks(input_file)
if __name__ == "__main__":
main()
FILE:scripts/literature_integrity_checker_auto_open.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
文献完整性检查器 - 增强版
添加检查完成后打开最终文档的功能
"""
import re
import sys
import os
import subprocess
from datetime import datetime
from typing import List, Dict
class LiteratureIntegrityChecker:
"""文献完整性检查器"""
def __init__(self, enable_deep_check=False):
"""初始化检查器"""
self.content = ""
self.references = []
self.citations = []
self.issues_summary = {
'critical': [],
'warning': [],
'info': []
}
self.enable_deep_check = enable_deep_check
def load_content(self, file_path: str) -> None:
"""加载文档内容"""
with open(file_path, 'r', encoding='utf-8') as f:
self.content = f.read()
print(f"[加载] 文档: {file_path}")
print(f"[统计] 字符数: {len(self.content)}")
def extract_references(self) -> None:
"""提取参考文献"""
print("\n" + "=" * 80)
print("【1】提取参考文献")
print("=" * 80)
# 匹配参考文献格式: [1] 作者. 标题. 期刊, 年份, 卷(期): 页码.
ref_pattern = re.compile(
r'^\[(\d+)\]\s*([^\.]+)\.\s*([^\.]+)\.\s*([^,]+),\s*(\d{4}),\s*(.+)\.\s*$',
re.MULTILINE
)
matches = ref_pattern.findall(self.content)
for match in matches:
ref_num = int(match[0])
author = match[1].strip()
title = match[2].strip()
journal = match[3].strip()
year = match[4].strip()
pages = match[5].strip()
self.references.append({
'number': ref_num,
'author': author,
'title': title,
'journal': journal,
'year': year,
'pages': pages
})
self.references.sort(key=lambda x: x['number'])
print(f"[完成] 提取到 {len(self.references)} 篇参考文献")
def extract_citations(self) -> None:
"""提取正文引用"""
print("\n" + "=" * 80)
print("【2】提取正文引用")
print("=" * 80)
# 匹配正文中的引用标记: [1], [2-3], [1,3,5]
citation_pattern = re.compile(r'\[(\d+)(?:-\d+)?(?:,\s*\d+(?:-\d+)?)*\]')
matches = citation_pattern.findall(self.content)
unique_citations = set(int(m) for m in matches)
self.citations = sorted(list(unique_citations))
print(f"[完成] 提取到 {len(matches)} 处引用 ({len(self.citations)} 个唯一引用)")
def check_reference_completeness(self) -> None:
"""检查文献信息完整性"""
print("\n" + "=" * 80)
print("【3】检查文献信息完整性")
print("=" * 80)
incomplete_refs = []
for ref in self.references:
issues = []
if not ref['journal'] or ref['journal'] == '未识别':
issues.append('缺少期刊名')
if not ref['pages'] or ref['pages'] == '未识别':
issues.append('缺少页码信息')
if not ref['title'] or len(ref['title']) < 10:
issues.append('标题可能不完整或缺失')
if issues:
incomplete_refs.append({
'number': ref['number'],
'issues': issues
})
print(f"[完成] 发现 {len(incomplete_refs)} 篇文献信息不完整")
for ref in incomplete_refs:
for issue in ref['issues']:
self.issues_summary['warning'].append({
'type': 'incomplete_reference',
'reference': ref['number'],
'message': f"文献[{ref['number']}]: {issue}"
})
def check_citation_consistency(self) -> None:
"""检查引用一致性"""
print("\n" + "=" * 80)
print("【4】检查引用一致性")
print("=" * 80)
critical_issues = []
warnings = []
ref_numbers = set(ref['number'] for ref in self.references)
# 检查无效引用
invalid_citations = [c for c in self.citations if c not in ref_numbers]
for citation in invalid_citations:
critical_issues.append(f"引用 [{citation}] 超出参考文献范围")
self.issues_summary['critical'].append({
'type': 'invalid_citation',
'citation': citation,
'message': f"引用 [{citation}] 超出参考文献范围"
})
# 检查孤立文献
orphan_refs = [ref['number'] for ref in self.references if ref['number'] not in self.citations]
for ref_num in orphan_refs:
warnings.append(f"文献 [{ref_num}] 未在正文中被引用")
self.issues_summary['warning'].append({
'type': 'orphan_reference',
'reference': ref_num,
'message': f"文献 [{ref_num}] 未在正文中被引用"
})
print(f"[完成] 发现 {len(critical_issues)} 个严重问题")
print(f"[完成] 发现 {len(warnings)} 个提示")
def check_duplicate_references(self) -> None:
"""检查重复文献"""
print("\n" + "=" * 80)
print("【5】检查重复文献")
print("=" * 80)
# 基于标题和作者检查重复
seen = {}
duplicates = []
for ref in self.references:
key = (ref['title'].lower(), ref['author'].lower())
if key in seen:
duplicates.append({
'original': seen[key],
'duplicate': ref['number']
})
self.issues_summary['warning'].append({
'type': 'duplicate_reference',
'reference': ref['number'],
'message': f"文献 [{ref['number']}] 与文献 [{seen[key]}] 可能重复"
})
else:
seen[key] = ref['number']
print(f"[完成] 发现 {len(duplicates)} 篇重复文献")
def check_citation_format(self) -> None:
"""检查引用格式"""
print("\n" + "=" * 80)
print("【6】检查引用格式")
print("=" * 80)
format_issues = []
# 检查引用位置是否在标点符号前
lines = self.content.split('\n')
for i, line in enumerate(lines, 1):
# 匹配引用在标点符号后的情况
if re.search(r'[。!?\.,;:]\s*\[\d+', line):
format_issues.append({
'line': i,
'message': f'第{i}行: 引用可能位于标点符号后'
})
print(f"[完成] 发现 {len(format_issues)} 个格式问题")
for issue in format_issues:
self.issues_summary['info'].append({
'type': 'citation_position',
'line': issue['line'],
'message': issue['message']
})
def run_deep_verification(self) -> None:
"""深度验证(需要网络连接)"""
print("\n" + "=" * 80)
print("【7】深度验证(已跳过,使用 --deep-check 启用)")
print("=" * 80)
print("[提示] 深度验证需要网络连接,用于在线验证文献真实性")
if self.enable_deep_check:
# TODO: 实现在线验证逻辑
print("[提示] 深度验证功能待实现")
def save_report(self, output_file: str) -> None:
"""保存检查报告"""
with open(output_file, 'w', encoding='utf-8') as f:
f.write(f"# 文献完整性验证报告\n\n")
f.write(f"**检查时间**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
f.write(f"**检查文件**: {os.path.basename(self.content_file_path) if hasattr(self, 'content_file_path') else 'unknown'}\n\n")
f.write(f"**深度验证**: {'已启用' if self.enable_deep_check else '未启用'}\n\n")
f.write("## 统计信息\n\n")
f.write(f"- **参考文献总数**: {len(self.references)}\n")
f.write(f"- **正文引用总数**: {len(self.citations)}\n")
f.write(f"- **唯一引用数**: {len(self.citations)}\n\n")
f.write("## 问题汇总\n\n")
f.write(f"- 🔴 **严重问题**: {len(self.issues_summary['critical'])} 个\n")
f.write(f"- 🟡 **警告问题**: {len(self.issues_summary['warning'])} 个\n")
f.write(f"- 🔵 **提示信息**: {len(self.issues_summary['info'])} 个\n\n")
if self.issues_summary['warning']:
f.write("### 🟡 警告问题(建议修复)\n\n")
for i, issue in enumerate(self.issues_summary['warning'], 1):
f.write(f"{i}. {issue['message']}\n")
f.write("\n")
f.write("## 文献详情\n\n")
for ref in self.references:
title_preview = ref['title'][:50] + '...' if len(ref['title']) > 50 else ref['title']
f.write(f"### [{ref['number']}] {title_preview}\n\n")
f.write(f"- **作者**: {ref['author']}\n")
f.write(f"- **期刊**: {ref['journal']}\n")
f.write(f"- **年份**: {ref['year']}\n")
f.write(f"- **卷期**: {ref.get('volume', '未识别')}\n")
f.write(f"- **页码**: {ref['pages']}\n")
f.write(f"- **DOI**: {ref.get('doi', '未识别')}\n\n")
def run_all_checks(self, input_file: str) -> None:
"""运行所有检查"""
print("\n" + "=" * 80)
print("开始文献完整性验证")
print("=" * 80 + "\n")
self.content_file_path = input_file
self.load_content(input_file)
# 1. 提取参考文献
self.extract_references()
# 2. 提取正文引用
self.extract_citations()
# 3. 检查文献完整性
self.check_reference_completeness()
# 4. 检查引用一致性
self.check_citation_consistency()
# 5. 检查重复文献
self.check_duplicate_references()
# 6. 检查引用格式
self.check_citation_format()
# 7. 深度验证(可选)
self.run_deep_verification()
# 生成报告
output_file = input_file.rsplit('.', 1)[0] + '_literature_integrity_report.md'
self.save_report(output_file)
# 打印总结
print("=" * 80)
print("检查完成!")
print("=" * 80)
critical = len(self.issues_summary['critical'])
warning = len(self.issues_summary['warning'])
info = len(self.issues_summary['info'])
print(f"严重问题: {critical} | 警告: {warning} | 提示: {info}")
print(f"报告路径: {output_file}\n")
if critical == 0:
print("=" * 80)
print("检查通过!正在打开最终文档...")
print("=" * 80)
# 查找最终的 Word 文档
word_doc = input_file.rsplit('.', 1)[0] + '_上标版.docx'
if os.path.exists(word_doc):
# 尝试打开 Word 文档
try:
if os.name == 'nt': # Windows
os.startfile(word_doc)
elif os.name == 'posix': # macOS/Linux
subprocess.call(['open', word_doc] if sys.platform == 'darwin' else ['xdg-open', word_doc])
print(f"✓ 已打开文档: {word_doc}")
except Exception as e:
print(f"✗ 自动打开失败: {e}")
print(f" 请手动打开文档: {word_doc}")
else:
print(f"⚠ 未找到 Word 文档: {word_doc}")
print(f" 请先生成 Word 文档:")
print(f" node scripts/create_word_with_superscript.js {input_file}")
else:
print("[!] 发现严重问题,请在提交前修复!")
return critical > 0
def main():
"""主函数"""
if len(sys.argv) < 2:
print("文献完整性验证检查工具")
print("=" * 60)
print("\n使用方法:")
print(" python literature_integrity_checker.py <input_file> [--deep-check]")
print("\n参数:")
print(" input_file 要检查的Markdown文件路径")
print(" --deep-check 启用深度验证(需要网络连接)")
print("\n示例:")
print(" python literature_integrity_checker.py review.md")
print(" python literature_integrity_checker.py review.md --deep-check")
print("\n说明:")
print(" 此工具会检查文献的完整性、一致性和规范性")
print(" 发现严重问题时返回非零退出码")
print(" 检查通过后会自动打开最终生成的 Word 文档")
sys.exit(1)
input_file = sys.argv[1]
enable_deep_check = '--deep-check' in sys.argv
if not os.path.exists(input_file):
print(f"[错误] 文件不存在: {input_file}")
sys.exit(1)
checker = LiteratureIntegrityChecker(enable_deep_check=enable_deep_check)
has_errors = checker.run_all_checks(input_file)
if has_errors:
sys.exit(1)
if __name__ == "__main__":
main()
FILE:scripts/merge_duplicate_citations_in_paragraphs.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
合并同一段落中的重复引用组合
功能:在同一段落中,如果相同的引用编号组合多次出现,
只保留最后一个出现的引用组合,删除前面重复的。
例如:
原文:...非人灵长类动物之一[34,37]。某物种体长...15-17 kg[34,36]。
其毛色较为单调...黑色毛冠[34,36]。
合并后:...非人灵长类动物之一[34,37]。某物种体长...15-17 kg[34,36]。
其毛色较为单调...黑色毛冠。
用法:
python merge_duplicate_citations_in_paragraphs.py <输入文件.md> [输出文件.md]
如果不指定输出文件,默认生成:<原文件名>_merged.md
"""
import re
import sys
from collections import OrderedDict
def find_paragraphs(text):
"""
将文本分割成段落
段落以空行分隔
"""
# 按空行分割段落
paragraphs = re.split(r'\n\s*\n', text)
return [p.strip() for p in paragraphs if p.strip()]
def normalize_citation(citation_text):
"""
标准化引用组合,用于比较
例如:[34,36] 和 [34,36] 是相同的
[36,34] 和 [34,36] 也是相同的(排序后比较)
"""
# 提取所有数字
numbers = re.findall(r'\d+', citation_text)
# 排序后组成元组作为唯一标识
return tuple(sorted(numbers))
def merge_citations_in_paragraph(paragraph):
"""
在单个段落内合并重复的引用组合
保留每个引用组合最后一次出现的位置,删除前面的重复
处理场景:
- 单个引用:[34]
- 组合引用:[34,36] 或 [34-36] 或 [34,35,36]
示例:
原文:...kg[34,36]。其毛色...[34,36]。
结果:...kg[34,36]。其毛色...。
(保留最后一个[34,36],删除前面的)
"""
# 匹配引用标记,包括组合引用 [34,36] 或范围 [34-36]
citation_pattern = re.compile(r'\[(\d+(?:,\d+)*(?:-\d+)?)\]')
# 找出所有引用及其位置
citations = list(citation_pattern.finditer(paragraph))
if not citations:
return paragraph
# 记录每个引用组合最后出现的位置
last_occurrence = {}
for i, match in enumerate(citations):
citation_key = normalize_citation(match.group(1))
last_occurrence[citation_key] = (match.start(), match.end())
# 确定要删除的引用位置(非最后一次出现的)
to_remove = []
for match in citations:
citation_key = normalize_citation(match.group(1))
# 如果当前位置不是最后出现的位置,则删除
if (match.start(), match.end()) != last_occurrence[citation_key]:
to_remove.append((match.start(), match.end()))
if not to_remove:
return paragraph # 没有重复引用
# 从后向前删除,避免位置偏移问题
result = paragraph
for start, end in sorted(to_remove, reverse=True):
result = result[:start] + result[end:]
return result
def process_document(content):
"""
处理整个文档,合并各段落中的重复引用
"""
# 分离正文和参考文献部分
ref_section_match = re.search(r'\n## 参考文献\s*\n', content)
if ref_section_match:
main_text = content[:ref_section_match.start()]
ref_section = content[ref_section_match.start():]
else:
main_text = content
ref_section = ""
# 处理正文部分
# 按段落分割并处理
paragraphs = find_paragraphs(main_text)
merged_count = 0
processed_paragraphs = []
for para in paragraphs:
original_para = para
processed_para = merge_citations_in_paragraph(para)
if original_para != processed_para:
# 统计合并的引用数
original_citations = re.findall(r'\[\d+(?:,\d+)*(?:-\d+)?\]', original_para)
processed_citations = re.findall(r'\[\d+(?:,\d+)*(?:-\d+)?\]', processed_para)
merged_count += len(original_citations) - len(processed_citations)
processed_paragraphs.append(processed_para)
# 重新组合正文
processed_main_text = '\n\n'.join(processed_paragraphs)
# 组合完整文档
result = processed_main_text + ref_section
return result, merged_count
def main():
if len(sys.argv) < 2:
print("用法: python merge_duplicate_citations_in_paragraphs.py <输入文件.md> [输出文件.md]")
print("\n功能:合并同一段落中的重复引用,只保留最后一次出现")
print("\n示例:")
print(' 原文:研究显示[34]有显著效果[36],进一步分析[34]表明...')
print(' 合并后:研究显示[34]有显著效果[36],进一步分析表明...')
sys.exit(1)
input_file = sys.argv[1]
if len(sys.argv) >= 3:
output_file = sys.argv[2]
else:
# 自动生成输出文件名
if input_file.endswith('.md'):
output_file = input_file[:-3] + '_merged.md'
else:
output_file = input_file + '_merged.md'
# 读取输入文件
try:
with open(input_file, 'r', encoding='utf-8') as f:
content = f.read()
except FileNotFoundError:
print(f"错误: 找不到文件 '{input_file}'")
sys.exit(1)
except Exception as e:
print(f"错误: 读取文件失败 - {e}")
sys.exit(1)
print(f"正在处理: {input_file}")
# 处理文档
result, merged_count = process_document(content)
# 写入输出文件
try:
with open(output_file, 'w', encoding='utf-8') as f:
f.write(result)
except Exception as e:
print(f"错误: 写入文件失败 - {e}")
sys.exit(1)
# 输出结果
print(f"\n处理完成!")
print(f" 输出文件: {output_file}")
print(f" 合并重复引用: {merged_count} 处")
if merged_count > 0:
print(f"\n✓ 已删除同一段落中的重复引用,保留最后一次出现")
else:
print(f"\n✓ 未发现需要合并的重复引用")
if __name__ == '__main__':
main()
FILE:scripts/reference_accuracy_checker.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
文献格式和引用准确性检查脚本
Reference Format and Citation Accuracy Checker Script
功能:
1. 验证文献格式规范性
2. 检查引用准确性
3. 生成检查报告
使用方法:
python reference_accuracy_checker.py <input_file>
"""
import re
import sys
import os
from datetime import datetime
from typing import List, Dict, Tuple, Set
class ReferenceAccuracyChecker:
"""文献格式和引用准确性检查器"""
def __init__(self):
"""初始化检查器"""
self.content = ""
self.references = {}
self.issues = []
def load_content(self, file_path: str) -> None:
"""加载文档内容"""
with open(file_path, 'r', encoding='utf-8') as f:
self.content = f.read()
print(f"[加载] 文档: {file_path}")
def extract_references(self) -> Dict[int, str]:
"""提取参考文献列表"""
refs = {}
# 匹配参考文献格式: [数字] 文献内容
ref_pattern = r'^\[(\d+)\]\s+(.+)$'
ref_section = re.search(r'## 参考文献\n([\s\S]+)$', self.content)
if ref_section:
ref_list = ref_section.group(1)
for line in ref_list.split('\n'):
line = line.strip()
match = re.match(ref_pattern, line)
if match:
num = int(match.group(1))
content = match.group(2)
refs[num] = content
self.references = refs
print(f"[提取] 参考文献 {len(refs)} 篇")
return refs
def check_reference_format(self) -> List[Dict]:
"""检查文献格式"""
issues = []
print(" [检查] 文献格式...")
for num, ref_content in self.references.items():
# 检查基本格式
if not ref_content:
issues.append({
'type': 'ref_format',
'reference': num,
'message': f'文献 [{num}] 内容为空',
'content': ref_content
})
continue
# 检查作者格式
author_match = re.match(r'^([^\.]+)\.', ref_content)
if not author_match:
issues.append({
'type': 'ref_format',
'reference': num,
'message': f'文献 [{num}] 作者格式不正确,应为: 作者. 标题',
'content': ref_content[:80]
})
continue
authors = author_match.group(1)
# 检查作者分隔符(应使用英文逗号)
if ',' in authors:
issues.append({
'type': 'ref_format',
'reference': num,
'message': f'文献 [{num}] 作者分隔符应使用英文逗号","',
'content': ref_content[:80]
})
# 检查期刊名格式(应包含期刊名)
# 常见期刊格式模式
journal_patterns = [
r'[A-Z][a-z]+[A-Z][a-z]+', # 英文期刊名(大写开头)
r'[A-Z]{2,}', # 缩写期刊名
]
has_journal = False
for pattern in journal_patterns:
if re.search(pattern, ref_content):
has_journal = True
break
if not has_journal:
issues.append({
'type': 'ref_format',
'reference': num,
'message': f'文献 [{num}] 似乎缺少期刊名或期刊名格式不规范',
'content': ref_content[:80]
})
# 检查年份(应包含4位数字年份)
year_match = re.search(r'\b(19|20)\d{2}\b', ref_content)
if not year_match:
issues.append({
'type': 'ref_format',
'reference': num,
'message': f'文献 [{num}] 似乎缺少年份或年份格式不正确',
'content': ref_content[:80]
})
# 检查卷期号(可选但建议有)
volume_pattern = r'\d+\(?[A-Za-z]?\)?'
if not re.search(volume_pattern, ref_content):
# 卷期号不是必需的,所以这只是警告
pass
# 检查页码(可选但建议有)
page_pattern = r'\d+-\d+|\d+$'
if not re.search(page_pattern, ref_content):
# 页码不是必需的,所以这只是警告
pass
print(f" [完成] 发现 {len(issues)} 个文献格式问题")
return issues
def extract_citations(self) -> Set[int]:
"""提取正文中的所有引用"""
# 匹配引用格式: [数字]
citation_pattern = r'\[(\d+)\]'
citations = re.findall(citation_pattern, self.content)
citation_nums = {int(num) for num in citations}
print(f" [提取] 正文引用: {len(citation_nums)} 处")
return citation_nums
def check_citation_accuracy(self) -> List[Dict]:
"""检查引用准确性"""
issues = []
print(" [检查] 引用准确性...")
# 提取正文中的所有引用
citation_nums = self.extract_citations()
# 检查1: 无效引用(超出参考文献范围)
max_ref = max(self.references.keys()) if self.references else 0
invalid_citations = {num for num in citation_nums if num > max_ref}
if invalid_citations:
for num in sorted(invalid_citations):
issues.append({
'type': 'invalid_citation',
'number': num,
'message': f'引用 [{num}] 超出参考文献范围 (1-{max_ref})'
})
# 检查2: 孤立引用(正文中有引用但参考文献中没有)
if self.references:
ref_nums = set(self.references.keys())
orphan_citations = citation_nums - ref_nums
if orphan_citations:
for num in sorted(orphan_citations):
issues.append({
'type': 'orphan_citation',
'number': num,
'message': f'引用 [{num}] 在正文中出现但不在参考文献列表中'
})
# 检查3: 未引用的文献(参考文献中有但正文中未引用)
if self.references and citation_nums:
ref_nums = set(self.references.keys())
unused_refs = ref_nums - citation_nums
if unused_refs:
issues.append({
'type': 'unused_reference',
'count': len(unused_refs),
'message': f'发现 {len(unused_refs)} 篇文献在正文中未被引用(这是正常的)',
'numbers': sorted(list(unused_refs))
})
print(f" [完成] 发现 {len(issues)} 个引用准确性问题")
return issues
def check_reference_continuity(self) -> List[Dict]:
"""检查参考文献编号连续性"""
issues = []
print(" [检查] 参考文献编号连续性...")
if not self.references:
return issues
ref_nums = sorted(self.references.keys())
# 检查是否有重复编号
num_counts = {}
for num in ref_nums:
num_counts[num] = num_counts.get(num, 0) + 1
duplicates = {num: count for num, count in num_counts.items() if count > 1}
if duplicates:
for num, count in duplicates.items():
issues.append({
'type': 'duplicate_reference',
'number': num,
'message': f'参考文献编号 [{num}] 重复 {count} 次'
})
# 检查是否有缺失编号
if ref_nums:
expected_start = 1
expected_end = max(ref_nums)
expected_nums = set(range(expected_start, expected_end + 1))
actual_nums = set(ref_nums)
missing_nums = expected_nums - actual_nums
if missing_nums:
issues.append({
'type': 'missing_reference',
'count': len(missing_nums),
'message': f'参考文献编号不连续,缺失: {sorted(list(missing_nums))}'
})
print(f" [完成] 发现 {len(issues)} 个编号连续性问题")
return issues
def check_citation_continuity(self) -> List[Dict]:
"""检查正文中引用编号的连续性"""
issues = []
print(" [检查] 正文中引用编号连续性...")
# 提取正文中的所有引用
citation_pattern = r'\[(\d+)\]'
citations = re.findall(citation_pattern, self.content)
citation_nums = [int(num) for num in citations]
if not citation_nums:
return issues
# 检查引用编号是否连续(可选,不是必需的)
# 这里我们检查是否有较大的跳跃
sorted_citations = sorted(set(citation_nums))
gaps = []
for i in range(len(sorted_citations) - 1):
if sorted_citations[i+1] - sorted_citations[i] > 5: # 跳跃超过5
gaps.append((sorted_citations[i], sorted_citations[i+1]))
if gaps:
issues.append({
'type': 'citation_gap',
'message': f'正文中引用编号存在较大跳跃: {gaps}'
})
print(f" [完成] 发现 {len(issues)} 个引用连续性问题")
return issues
def check_citation_position(self) -> List[Dict]:
"""检查引用标注位置"""
issues = []
print(" [检查] 引用标注位置...")
# 检查引用标注是否在标点符号之后(不符合规范)
# 规范: 引用应在标点符号之前
bad_patterns = [
(r'[,;]\s*\[(\d+)\]', '逗号或分号之后的引用应在其之前'),
(r':[a-zA-Z0-9\s]*\s*\[(\d+)\]', '冒号后的引用通常应在冒号之前'),
]
for pattern, message in bad_patterns:
matches = re.finditer(pattern, self.content)
for match in matches:
issues.append({
'type': 'citation_position',
'location': match.start(),
'message': message,
'content': match.group(0)
})
print(f" [完成] 发现 {len(issues)} 个引用位置问题")
return issues
def run_all_checks(self) -> List[Dict]:
"""运行所有检查"""
print("\n[开始] 文献格式和引用准确性检查...")
print("="*60)
# 1. 检查文献格式
self.issues.extend(self.check_reference_format())
# 2. 检查引用准确性
self.issues.extend(self.check_citation_accuracy())
# 3. 检查参考文献编号连续性
self.issues.extend(self.check_reference_continuity())
# 4. 检查正文中引用编号连续性
self.issues.extend(self.check_citation_continuity())
# 5. 检查引用标注位置
self.issues.extend(self.check_citation_position())
print("="*60)
print(f"\n[完成] 共发现 {len(self.issues)} 个问题\n")
return self.issues
def generate_report(self) -> str:
"""生成检查报告"""
report = []
report.append("# 文献格式和引用准确性检查报告\n")
report.append(f"生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
# 统计问题类型
issue_types = {}
for issue in self.issues:
issue_type = issue['type']
issue_types[issue_type] = issue_types.get(issue_type, 0) + 1
report.append("## 问题统计\n")
type_names = {
'ref_format': '文献格式',
'invalid_citation': '无效引用',
'orphan_citation': '孤立引用',
'unused_reference': '未引用文献',
'duplicate_reference': '重复编号',
'missing_reference': '缺失编号',
'citation_gap': '引用编号跳跃',
'citation_position': '引用位置'
}
for issue_type, count in issue_types.items():
name = type_names.get(issue_type, issue_type)
report.append(f"- {name}: {count} 个\n")
report.append("\n## 详细问题列表\n")
# 按类型分组显示问题
grouped_issues = {}
for issue in self.issues:
issue_type = issue['type']
if issue_type not in grouped_issues:
grouped_issues[issue_type] = []
grouped_issues[issue_type].append(issue)
type_order = ['ref_format', 'invalid_citation', 'orphan_citation',
'unused_reference', 'duplicate_reference', 'missing_reference',
'citation_gap', 'citation_position']
has_issues = False
for issue_type in type_order:
if issue_type in grouped_issues:
has_issues = True
type_names = {
'ref_format': '### 文献格式问题',
'invalid_citation': '### 无效引用',
'orphan_citation': '### 孤立引用',
'unused_reference': '### 未引用文献',
'duplicate_reference': '### 重复编号',
'missing_reference': '### 缺失编号',
'citation_gap': '### 引用编号跳跃',
'citation_position': '### 引用位置问题'
}
report.append(f"{type_names.get(issue_type, f'### {issue_type}')}\n\n")
for i, issue in enumerate(grouped_issues[issue_type], 1):
report.append(f"{i}. {issue['message']}\n")
if 'content' in issue:
report.append(f" - 内容: {issue['content']}\n")
if 'number' in issue:
report.append(f" - 编号: [{issue['number']}]\n")
if 'reference' in issue:
report.append(f" - 参考文献: [{issue['reference']}]\n")
if 'numbers' in issue:
report.append(f" - 编号列表: {issue['numbers']}\n")
report.append("\n")
# 检查结果总结
if not has_issues:
report.append("\n## 检查结果\n")
report.append("✅ 未发现文献格式和引用准确性问题,文档质量良好!\n")
else:
report.append("\n## 检查结果\n")
report.append(f"⚠️ 发现 {len(self.issues)} 个问题,建议进行修正。\n")
return ''.join(report)
def save_report(self, report_path: str) -> None:
"""保存检查报告"""
report = self.generate_report()
with open(report_path, 'w', encoding='utf-8') as f:
f.write(report)
print(f"[报告] 已生成: {report_path}\n")
def process(self, input_file: str, output_report: Optional[str] = None) -> None:
"""处理主流程"""
print("="*60)
print("文献格式和引用准确性检查工具")
print("="*60)
# 加载内容
self.load_content(input_file)
# 提取参考文献
self.extract_references()
# 运行所有检查
issues = self.run_all_checks()
# 生成报告
if output_report is None:
output_report = input_file.rsplit('.', 1)[0] + '_reference_accuracy_report.md'
self.save_report(output_report)
# 显示总结
print("="*60)
print(f"检查完成! 共发现 {len(issues)} 个问题")
print(f"报告文件: {output_report}")
print("="*60)
def main():
"""主函数"""
if len(sys.argv) < 2:
print("使用方法:")
print(" python reference_accuracy_checker.py <input_file>")
print("\n示例:")
print(" python reference_accuracy_checker.py review.md")
sys.exit(1)
input_file = sys.argv[1]
if not os.path.exists(input_file):
print(f"[错误] 输入文件不存在: {input_file}")
sys.exit(1)
# 创建检查器并处理
checker = ReferenceAccuracyChecker()
checker.process(input_file)
if __name__ == '__main__':
main()
FILE:scripts/reference_formatter.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
参考文献格式规范化脚本
Reference Format Normalization Script
功能:
1. 检查参考文献引用标注格式
2. 检查参考文献列表格式
3. 生成格式检查报告
4. 根据配置文件执行格式规范化
使用方法:
python reference_formatter.py <input_file> [config_file]
"""
import re
import sys
import yaml
import os
from datetime import datetime
from typing import List, Dict, Tuple, Optional
class ReferenceFormatter:
"""参考文献格式规范化器"""
def __init__(self, config_path: str = "ref_format_default.yml"):
"""初始化格式化器,加载配置文件"""
self.config = self._load_config(config_path)
self.issues = []
self.content = ""
self.references = {}
def _load_config(self, config_path: str) -> Dict:
"""加载配置文件"""
if not os.path.exists(config_path):
print(f"[警告] 配置文件不存在: {config_path}, 使用默认配置")
return self._get_default_config()
with open(config_path, 'r', encoding='utf-8') as f:
return yaml.safe_load(f)
def _get_default_config(self) -> Dict:
"""获取默认配置"""
return {
'citation': {
'style': 'superscript',
'position_rules': {
'before_punctuation': True,
'after_quote_in_sentence': True,
'before_period': True
}
},
'same_source_format': '[{ref}:{pages}]',
'figure_table': {
'move_to_caption': True,
'inline_notes_format': 'superscript_number'
},
'reference_list': {
'sort_order': 'order_of_appearance',
'numbering': {
'format': '[{num}]',
'use_brackets': True
},
'authors': {
'max_authors': 3,
'beyond_limit_suffix': '等',
'separator': ', ',
'last_separator': ', '
}
},
'format_standard': 'chinese_academic',
'quality_check': {
'check_continuity': True,
'check_duplicates': True,
'check_invalid_refs': True,
'check_position': True
},
'processing': {
'mode': 'semi_auto',
'generate_report': True,
'backup_original': True
}
}
def load_content(self, file_path: str) -> None:
"""加载文档内容"""
with open(file_path, 'r', encoding='utf-8') as f:
self.content = f.read()
print(f"[加载] 文档: {file_path}")
def extract_references(self) -> Dict[int, str]:
"""提取参考文献列表"""
refs = {}
# 匹配参考文献格式: [数字] 文献内容
ref_pattern = r'^\[(\d+)\]\s+(.+)$'
ref_section = re.search(r'## 参考文献\n([\s\S]+)$', self.content)
if ref_section:
ref_list = ref_section.group(1)
for line in ref_list.split('\n'):
line = line.strip()
match = re.match(ref_pattern, line)
if match:
num = int(match.group(1))
content = match.group(2)
refs[num] = content
self.references = refs
print(f"[提取] 参考文献 {len(refs)} 篇")
return refs
def check_citation_positions(self) -> List[Dict]:
"""检查引用标注位置"""
issues = []
citation_pattern = r'\[(\d+)(?::(\d+-\d+))?\]'
# 检查标点符号之前的引用
if self.config['citation']['position_rules']['before_punctuation']:
# 查找标点符号后的引用(不符合规范)
bad_patterns = [
(r'[,;]\s*\[(\d+)\]', '逗号或分号之后的引用应在其之前'),
(r'[,:;]\s*"[^"]*"\s*\[(\d+)\]', '引文后的引用应在标点之前')
]
for pattern, message in bad_patterns:
matches = re.finditer(pattern, self.content)
for match in matches:
issues.append({
'type': 'position',
'location': match.start(),
'message': message,
'content': match.group(0)
})
return issues
def check_citation_continuity(self) -> List[Dict]:
"""检查引用编号连续性"""
issues = []
# 提取正文中的所有引用编号
citations = re.findall(r'\[(\d+)(?::\d+-\d+)?\]', self.content)
citation_nums = [int(num) for num in citations]
if not citation_nums:
return issues
# 检查是否有超出范围的引用
max_ref = max(self.references.keys()) if self.references else 0
for num in citation_nums:
if num > max_ref:
issues.append({
'type': 'invalid_ref',
'number': num,
'message': f'引用编号 [{num}] 超出参考文献范围 (1-{max_ref})'
})
# 检查连续性(可选)
if self.config['quality_check']['check_continuity']:
sorted_nums = sorted(set(citation_nums))
if sorted_nums:
expected = list(range(1, max(sorted_nums) + 1))
missing = set(expected) - set(sorted_nums)
if missing:
issues.append({
'type': 'missing_ref',
'numbers': sorted(list(missing)),
'message': f'引用编号不连续,缺失: {sorted(list(missing))}'
})
return issues
def check_duplicate_references(self) -> List[Dict]:
"""检查重复的参考文献编号"""
issues = []
# 检查参考文献列表中的重复编号
ref_nums = list(self.references.keys())
duplicates = [num for num in ref_nums if ref_nums.count(num) > 1]
for dup in set(duplicates):
issues.append({
'type': 'duplicate_ref',
'number': dup,
'message': f'参考文献编号 [{dup}] 重复'
})
return issues
def check_author_format(self) -> List[Dict]:
"""检查作者列表格式"""
issues = []
max_authors = self.config['reference_list']['authors']['max_authors']
suffix = self.config['reference_list']['authors']['beyond_limit_suffix']
for num, content in self.references.items():
# 检查作者数量
author_match = re.match(r'^(.+?)\.', content)
if author_match:
authors_str = author_match.group(1)
authors = [a.strip() for a in authors_str.split(',')]
# 如果作者数量超过限制,检查是否有"等"
if len(authors) > max_authors and suffix not in content:
issues.append({
'type': 'author_format',
'reference': num,
'message': f'文献 [{num}] 作者数量({len(authors)})超过限制({max_authors}),应添加"{suffix}"'
})
return issues
def check_figure_table_citations(self) -> List[Dict]:
"""检查图表中的引用标注"""
issues = []
# 检查图表标题中的引用(应该在标题中,不在图表内)
if self.config['figure_table']['move_to_caption']:
# 查找图表块
figure_pattern = r'!\[([^\]]*\[?\d+\]?\][^\]]*)\]\([^\)]+\)'
matches = re.finditer(figure_pattern, self.content)
for match in matches:
caption = match.group(1)
# 如果图表标题中没有引用但图表描述中有,则报告问题
if '[' not in caption:
issues.append({
'type': 'figure_citation',
'location': match.start(),
'message': '图表引用标注应在图名/表名中,不在图表内',
'content': match.group(0)[:50] + '...'
})
return issues
def run_checks(self) -> List[Dict]:
"""运行所有检查"""
print("\n[开始] 格式检查...")
# 检查引用位置
if self.config['quality_check']['check_position']:
print(" [检查] 引用标注位置...")
self.issues.extend(self.check_citation_positions())
# 检查引用连续性
if self.config['quality_check']['check_continuity']:
print(" [检查] 引用编号连续性...")
self.issues.extend(self.check_citation_continuity())
# 检查重复编号
if self.config['quality_check']['check_duplicates']:
print(" [检查] 重复编号...")
self.issues.extend(self.check_duplicate_references())
# 检查作者格式
print(" [检查] 作者列表格式...")
self.issues.extend(self.check_author_format())
# 检查图表引用
print(" [检查] 图表引用...")
self.issues.extend(self.check_figure_table_citations())
print(f"\n[完成] 发现 {len(self.issues)} 个问题\n")
return self.issues
def generate_report(self) -> str:
"""生成格式检查报告"""
report = []
report.append("# 参考文献格式检查报告\n")
report.append(f"生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
# 统计问题类型
issue_types = {}
for issue in self.issues:
issue_type = issue['type']
issue_types[issue_type] = issue_types.get(issue_type, 0) + 1
report.append("## 问题统计\n")
for issue_type, count in issue_types.items():
type_names = {
'position': '引用位置',
'invalid_ref': '无效引用',
'missing_ref': '缺失引用',
'duplicate_ref': '重复编号',
'author_format': '作者格式',
'figure_citation': '图表引用'
}
name = type_names.get(issue_type, issue_type)
report.append(f"- {name}: {count} 个\n")
report.append("\n## 详细问题列表\n")
# 按类型分组显示问题
grouped_issues = {}
for issue in self.issues:
issue_type = issue['type']
if issue_type not in grouped_issues:
grouped_issues[issue_type] = []
grouped_issues[issue_type].append(issue)
type_order = ['position', 'invalid_ref', 'missing_ref', 'duplicate_ref',
'author_format', 'figure_citation']
for issue_type in type_order:
if issue_type in grouped_issues:
type_names = {
'position': '### 引用位置问题',
'invalid_ref': '### 无效引用',
'missing_ref': '### 缺失引用',
'duplicate_ref': '### 重复编号',
'author_format': '### 作者格式问题',
'figure_citation': '### 图表引用问题'
}
report.append(f"{type_names.get(issue_type, f'### {issue_type}')}\n\n")
for i, issue in enumerate(grouped_issues[issue_type], 1):
report.append(f"{i}. {issue['message']}\n")
if 'content' in issue:
report.append(f" - 内容: {issue['content']}\n")
if 'number' in issue:
report.append(f" - 编号: [{issue['number']}]\n")
if 'reference' in issue:
report.append(f" - 参考文献: [{issue['reference']}]\n")
report.append("\n")
# 检查结果总结
if len(self.issues) == 0:
report.append("\n## 检查结果\n")
report.append("✅ 未发现格式问题,文档符合规范!\n")
else:
report.append("\n## 检查结果\n")
report.append(f"⚠️ 发现 {len(self.issues)} 个问题,建议进行修正。\n")
return ''.join(report)
def fix_citation_positions(self) -> str:
"""修复引用标注位置"""
content = self.content
fixed_count = 0
# 修复标点符号后的引用(将引用移到标点符号之前)
if self.config['citation']['position_rules']['before_punctuation']:
# 修复: , [1] → [1],
content = re.sub(r',\s*\[(\d+)\]', r'[\1],', content)
# 修复: ; [1] → [1];
content = re.sub(r';\s*\[(\d+)\]', r'[\1];', content)
fixed_count += len(re.findall(r'[,;]\s*\[(\d+)\]', content))
# 修复引文后的引用位置
if self.config['citation']['position_rules']['after_quote_in_sentence']:
# 修复: "...。" [1] → "..."。[1]
content = re.sub(r'"。"(\s*)\[(\d+)\]', r'"。"[1]', content)
fixed_count += len(re.findall(r'"。"(\s*)\[(\d+)\]', content))
self.content = content
print(f"[修复] 引用位置: {fixed_count} 处")
return self.content
def format_reference_authors(self) -> Dict[int, str]:
"""格式化参考文献作者列表"""
formatted_refs = {}
max_authors = self.config['reference_list']['authors']['max_authors']
suffix = self.config['reference_list']['authors']['beyond_limit_suffix']
for num, content in self.references.items():
author_match = re.match(r'^(.+?)\.', content)
if author_match:
authors_str = author_match.group(1)
rest = content[author_match.end():]
authors = [a.strip() for a in authors_str.split(',')]
# 如果作者数量超过限制,截取前N个并添加后缀
if len(authors) > max_authors:
authors = authors[:max_authors]
new_authors_str = ', '.join(authors) + suffix + '.'
else:
new_authors_str = authors_str + '.'
formatted_refs[num] = new_authors_str + rest
else:
formatted_refs[num] = content
self.references = formatted_refs
print(f"[格式化] 作者列表: {len(formatted_refs)} 篇")
return formatted_refs
def apply_fixes(self, confirmed_issues: List[Dict]) -> None:
"""应用确认的修复"""
print("\n[开始] 应用修复...")
fix_count = 0
for issue in confirmed_issues:
if issue['type'] == 'position':
# 修复引用位置
self.fix_citation_positions()
fix_count += 1
elif issue['type'] == 'author_format':
# 修复作者格式
self.format_reference_authors()
fix_count += 1
print(f"[完成] 已修复 {fix_count} 类问题\n")
def save_content(self, output_path: str) -> None:
"""保存处理后的内容"""
with open(output_path, 'w', encoding='utf-8') as f:
f.write(self.content)
print(f"[保存] 输出文件: {output_path}")
def process(self, input_file: str, output_file: Optional[str] = None) -> None:
"""处理主流程"""
print("="*60)
print("参考文献格式规范化工具")
print("="*60)
# 加载内容
self.load_content(input_file)
# 提取参考文献
self.extract_references()
# 运行检查
issues = self.run_checks()
# 生成报告
report = self.generate_report()
if output_file:
report_path = output_file.rsplit('.', 1)[0] + '_format_report.md'
else:
report_path = input_file.rsplit('.', 1)[0] + '_format_report.md'
with open(report_path, 'w', encoding='utf-8') as f:
f.write(report)
print(f"[报告] 已生成: {report_path}\n")
# 半自动模式
if self.config['processing']['mode'] == 'semi_auto':
if len(issues) > 0:
print("="*60)
print("请查看格式检查报告,确认是否需要修复\n")
print(f"报告文件: {report_path}")
print("="*60)
else:
print("✅ 文档格式检查通过!\n")
def main():
"""主函数"""
if len(sys.argv) < 2:
print("使用方法:")
print(" python reference_formatter.py <input_file> [config_file]")
print("\n示例:")
print(" python reference_formatter.py review.md")
print(" python reference_formatter.py review.md custom_config.yml")
sys.exit(1)
input_file = sys.argv[1]
config_file = sys.argv[2] if len(sys.argv) > 2 else "ref_format_default.yml"
if not os.path.exists(input_file):
print(f"[错误] 输入文件不存在: {input_file}")
sys.exit(1)
# 创建格式化器并处理
formatter = ReferenceFormatter(config_file)
formatter.process(input_file)
if __name__ == '__main__':
main()
FILE:scripts/simple_verify.py
# -*- coding: utf-8 -*-
"""
简化的验证脚本 - 只检查最关键的问题
用法: python simple_verify.py <your_review.md>
"""
import re
import sys
from collections import Counter
if len(sys.argv) < 2:
print("用法: python simple_verify.py <your_review.md>")
sys.exit(1)
with open(sys.argv[1], 'r', encoding='utf-8') as f:
content = f.read()
print("=" * 80)
print("综述验证报告(简化版)")
print("=" * 80)
# 分离正文和参考文献
ref_start = content.find('\n\n## 参考文献')
main_text = content[:ref_start]
refs_text = content[ref_start + len('\n\n## 参考文献'):]
# 1. 检查正文中的引用
print("\n[1] 正文引用检查")
citations = re.findall(r'\[(\d+)\]', main_text)
citation_nums = [int(c) for c in citations]
if citation_nums:
print(f"正文引用总数: {len(citation_nums)}")
print(f"引用编号范围: {min(citation_nums)} - {max(citation_nums)}")
# 超范围引用
invalid = [n for n in citation_nums if n > 94]
if invalid:
print(f"警告: 发现超范围引用 {len(set(invalid))} 个: {sorted(set(invalid))}")
else:
print("OK: 所有引用编号都在范围内")
# 未引用的文献
all_cited = set(citation_nums)
unused = [i for i in range(1, 95) if i not in all_cited]
if unused:
print(f"未引用的文献: {len(unused)} 篇")
else:
print("OK: 所有文献都有被引用")
# 2. 检查参考文献列表
print("\n[2] 参考文献列表检查")
# 更精确地提取参考文献
ref_lines = refs_text.strip().split('\n')
valid_refs = []
for line in ref_lines:
if line.strip() and line.strip().startswith('['):
valid_refs.append(line.strip())
print(f"参考文献数量: {len(valid_refs)}")
# 提取编号
ref_nums = []
for ref in valid_refs:
match = re.match(r'\[(\d+)\]', ref)
if match:
ref_nums.append(int(match.group(1)))
if ref_nums:
print(f"参考文献编号范围: {min(ref_nums)} - {max(ref_nums)}")
# 检查是否连续
if len(set(ref_nums)) == len(ref_nums) and max(ref_nums) - min(ref_nums) + 1 == len(ref_nums):
print("OK: 参考文献编号连续且无重复")
else:
if len(set(ref_nums)) != len(ref_nums):
print("警告: 参考文献编号有重复")
else:
print("警告: 参考文献编号不连续")
# 3. 检查章节结构
print("\n[3] 章节结构检查")
h1 = re.findall(r'^# (.+)$', content, re.MULTILINE)
h2 = re.findall(r'^## (.+)$', content, re.MULTILINE)
print(f"一级标题: {len(h1)} 个")
print(f"二级标题: {len(h2)} 个")
# 检查编号章节
numbered = re.findall(r'^## (一|二|三|四|五|六|七|八)、', content, re.MULTILINE)
print(f"编号章节: {len(numbered)} 个")
print("\n" + "=" * 80)
print("验证完成")
print("=" * 80)
FILE:scripts/typo_grammar_checker.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
文字错别字和语法错误检查脚本
Typo and Grammar Checker Script
功能:
1. 检查常见错别字
2. 检查语法错误
3. 生成检查报告
使用方法:
python typo_grammar_checker.py <input_file>
"""
import re
import sys
import os
from datetime import datetime
from typing import List, Dict, Tuple
class TypoGrammarChecker:
"""错别字和语法检查器"""
def __init__(self):
"""初始化检查器"""
self.content = ""
self.issues = []
# 常见错别字词典
self.typo_dict = self._load_typo_dict()
# 语法规则
self.grammar_rules = self._load_grammar_rules()
def _load_typo_dict(self) -> Dict[str, str]:
"""加载常见错别字词典"""
return {
# 常见错别字
'做为': '作为',
'做用': '作用',
'成份': '成分',
'其它': '其他',
'唯一': '唯一',
'帐号': '账号',
'帐户': '账户',
'登陆': '登录',
'粘稠': '黏稠',
'成像': '成像',
'必需': '必须',
'报导': '报道',
'反应': '反应', # 根据上下文判断
'截止': '截至',
'包含': '包含',
'包含': '包括',
'以致于': '以至于',
'以至': '以致',
'以至': '甚至',
'以至': '以至于',
'必须': '必须',
'必须': '必需',
'反应': '反映',
'包含': '包括',
'截止': '截至',
'帐号': '账号',
'登陆': '登录',
'帐户': '账户',
# 学术论文常见错误
'研究显示': '研究显示',
'结果表明': '结果表明',
'结果提示': '结果提示',
'研究指出': '研究指出',
'研究发现': '研究发现',
'数据显示': '数据显示',
'统计显示': '统计显示',
# 英文拼写错误
'analys': 'analyze',
'analyse': 'analyze',
'analyses': 'analyzes',
'studys': 'studies',
'researchs': 'researches',
'datas': 'data',
'methodolog': 'methodology',
'referenc': 'reference',
'conclusion': 'conclusion',
'discusion': 'discussion',
'reslut': 'result',
'resluts': 'results',
'mircroplastic': 'microplastic',
'mircroplastics': 'microplastics',
'microplastic': 'microplastics', # 复数形式
'microbe': 'microbe',
'microbiome': 'microbiome',
'microbiota': 'microbiota',
}
def _load_grammar_rules(self) -> List[Tuple[str, str]]:
"""加载语法规则"""
return [
# 标点符号问题
(r'\.{3,}', '发现连续句号,建议使用省略号'),
(r'\s{3,}', '发现连续空格'),
# 括号不匹配
(r'\([^)]*$', '左括号缺少对应的右括号(在同一行)'),
(r'\[[^\]]*$', '左方括号缺少对应的右方括号'),
# 引号问题
(r'["\"][^"\"]*$', '左引号缺少对应的右引号'),
# 中文和英文之间缺少空格(建议,不是强制)
# (r'[\u4e00-\u9fff][a-zA-Z]', '中文和英文之间建议添加空格'),
# (r'[a-zA-Z][\u4e00-\u9fff]', '英文和中文之间建议添加空格'),
# 常见语法错误
(r'因为。所以', '"因为...所以"之间不应使用句号'),
(r'虽然。但是', '"虽然...但是"之间不应使用句号'),
(r'不仅。而且', '"不仅...而且"之间不应使用句号'),
# 重复词语
(r'\b的的\b', '发现重复的"的"'),
(r'\b了了\b', '发现重复的"了"'),
(r'\b是是\b', '发现重复的"是"'),
(r'\b在在\b', '发现重复的"在"'),
# 数字格式问题
(r'[一二三四五六七八九十]+万', '建议使用阿拉伯数字表示数量'),
]
def load_content(self, file_path: str) -> None:
"""加载文档内容"""
with open(file_path, 'r', encoding='utf-8') as f:
self.content = f.read()
print(f"[加载] 文档: {file_path}")
print(f"[信息] 文档长度: {len(self.content)} 字符")
def check_typos(self) -> List[Dict]:
"""检查错别字"""
issues = []
print(" [检查] 错别字...")
# 检查常见错别字
for wrong, correct in self.typo_dict.items():
if wrong in self.content:
# 查找所有出现位置
pattern = re.escape(wrong)
matches = list(re.finditer(pattern, self.content))
for match in matches:
issues.append({
'type': 'typo',
'location': match.start(),
'message': f'发现疑似错别字: "{wrong}" → "{correct}"',
'content': self._get_context(match.start(), 30),
'suggestion': correct
})
print(f" [完成] 发现 {len(issues)} 个疑似错别字")
return issues
def check_grammar(self) -> List[Dict]:
"""检查语法错误"""
issues = []
print(" [检查] 语法错误...")
# 检查常见语法问题
for pattern, message in self.grammar_rules:
matches = re.finditer(pattern, self.content, re.MULTILINE)
for match in matches:
issues.append({
'type': 'grammar',
'location': match.start(),
'message': message,
'content': self._get_context(match.start(), 30)
})
print(f" [完成] 发现 {len(issues)} 个语法问题")
return issues
def check_punctuation(self) -> List[Dict]:
"""检查标点符号使用"""
issues = []
print(" [检查] 标点符号...")
# 检查标点符号使用规范
# 1. 检查混用中英文标点
patterns = [
(r'[a-zA-Z0-9],[a-zA-Z0-9]', '英文逗号后应加空格'),
(r'[a-zA-Z0-9]\.[a-zA-Z0-9]', '英文句号后应加空格'),
(r'[\u4e00-\u9fff],[a-zA-Z0-9]', '中文后使用英文逗号,建议使用中文逗号'),
(r'[a-zA-Z0-9],[\u4e00-\u9fff]', '中文前使用英文逗号,建议使用中文逗号'),
]
for pattern, message in patterns:
matches = re.finditer(pattern, self.content)
for match in matches:
issues.append({
'type': 'punctuation',
'location': match.start(),
'message': message,
'content': self._get_context(match.start(), 20)
})
print(f" [完成] 发现 {len(issues)} 个标点符号问题")
return issues
def check_number_format(self) -> List[Dict]:
"""检查数字格式"""
issues = []
print(" [检查] 数字格式...")
# 检查日期格式
date_patterns = [
(r'(\d{4})年(\d{1,2})月(\d{1,2})日', '日期格式: {year}年{month}月{day}日'),
]
for pattern, template in date_patterns:
matches = re.finditer(pattern, self.content)
for match in matches:
# 这个是正确的格式,不需要报错
pass
# 检查数字与单位之间是否缺少空格
number_unit_pattern = r'\d+(\.\d+)?(kg|g|mg|ug|ml|l|m|cm|mm|km|m³|cm³|h|min|s)'
matches = re.finditer(number_unit_pattern, self.content, re.IGNORECASE)
for match in matches:
issues.append({
'type': 'number_format',
'location': match.start(),
'message': '数字与单位之间建议添加空格',
'content': match.group(0),
'suggestion': f"{match.group(0)[:-len(match.group(2))]} {match.group(2)}"
})
print(f" [完成] 发现 {len(issues)} 个数字格式问题")
return issues
def _get_context(self, position: int, length: int = 30) -> str:
"""获取文本上下文"""
start = max(0, position - length)
end = min(len(self.content), position + length)
return self.content[start:end]
def run_all_checks(self) -> List[Dict]:
"""运行所有检查"""
print("\n[开始] 文字错别字和语法检查...")
print("="*60)
# 1. 检查错别字
self.issues.extend(self.check_typos())
# 2. 检查语法错误
self.issues.extend(self.check_grammar())
# 3. 检查标点符号
self.issues.extend(self.check_punctuation())
# 4. 检查数字格式
self.issues.extend(self.check_number_format())
print("="*60)
print(f"\n[完成] 共发现 {len(self.issues)} 个问题\n")
return self.issues
def generate_report(self) -> str:
"""生成检查报告"""
report = []
report.append("# 文字错别字和语法检查报告\n")
report.append(f"生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
# 统计问题类型
issue_types = {}
for issue in self.issues:
issue_type = issue['type']
issue_types[issue_type] = issue_types.get(issue_type, 0) + 1
report.append("## 问题统计\n")
type_names = {
'typo': '错别字',
'grammar': '语法错误',
'punctuation': '标点符号',
'number_format': '数字格式'
}
for issue_type, count in issue_types.items():
name = type_names.get(issue_type, issue_type)
report.append(f"- {name}: {count} 个\n")
report.append("\n## 详细问题列表\n")
# 按类型分组显示问题
grouped_issues = {}
for issue in self.issues:
issue_type = issue['type']
if issue_type not in grouped_issues:
grouped_issues[issue_type] = []
grouped_issues[issue_type].append(issue)
type_order = ['typo', 'grammar', 'punctuation', 'number_format']
has_issues = False
for issue_type in type_order:
if issue_type in grouped_issues:
has_issues = True
type_names = {
'typo': '### 错别字问题',
'grammar': '### 语法错误',
'punctuation': '### 标点符号问题',
'number_format': '### 数字格式问题'
}
report.append(f"{type_names.get(issue_type, f'### {issue_type}')}\n\n")
for i, issue in enumerate(grouped_issues[issue_type], 1):
report.append(f"{i}. {issue['message']}\n")
if 'content' in issue:
report.append(f" - 上下文: ...{issue['content']}...\n")
if 'suggestion' in issue:
report.append(f" - 建议: {issue['suggestion']}\n")
if 'location' in issue:
# 计算行号
line_num = self.content[:issue['location']].count('\n') + 1
report.append(f" - 位置: 第 {line_num} 行\n")
report.append("\n")
# 检查结果总结
if not has_issues:
report.append("\n## 检查结果\n")
report.append("✅ 未发现明显的错别字和语法错误,文档质量良好!\n")
else:
report.append("\n## 检查结果\n")
report.append(f"⚠️ 发现 {len(self.issues)} 个问题,建议进行修正。\n")
return ''.join(report)
def save_report(self, report_path: str) -> None:
"""保存检查报告"""
report = self.generate_report()
with open(report_path, 'w', encoding='utf-8') as f:
f.write(report)
print(f"[报告] 已生成: {report_path}\n")
def process(self, input_file: str, output_report: Optional[str] = None) -> None:
"""处理主流程"""
print("="*60)
print("文字错别字和语法检查工具")
print("="*60)
# 加载内容
self.load_content(input_file)
# 运行所有检查
issues = self.run_all_checks()
# 生成报告
if output_report is None:
output_report = input_file.rsplit('.', 1)[0] + '_typo_grammar_report.md'
self.save_report(output_report)
# 显示总结
print("="*60)
print(f"检查完成! 共发现 {len(issues)} 个问题")
print(f"报告文件: {output_report}")
print("="*60)
def main():
"""主函数"""
if len(sys.argv) < 2:
print("使用方法:")
print(" python typo_grammar_checker.py <input_file>")
print("\n示例:")
print(" python typo_grammar_checker.py review.md")
sys.exit(1)
input_file = sys.argv[1]
if not os.path.exists(input_file):
print(f"[错误] 输入文件不存在: {input_file}")
sys.exit(1)
# 创建检查器并处理
checker = TypoGrammarChecker()
checker.process(input_file)
if __name__ == '__main__':
main()
FILE:scripts/unified_checker.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
统一检查脚本 - 一次性完成所有检查
Unified Checker Script
功能:
1. 参考文献格式检查
2. 引用准确性检查
3. 文字错别字和语法检查
4. 文档格式规范检查
5. 生成统一报告
使用方法:
python unified_checker.py <input_file>
"""
import re
import sys
import os
from datetime import datetime
from typing import List, Dict, Tuple, Set
class UnifiedChecker:
"""统一检查器 - 整合所有检查功能"""
def __init__(self):
"""初始化检查器"""
self.content = ""
self.file_path = ""
self.issues = {
'reference_format': [],
'citation_accuracy': [],
'typo_grammar': [],
'document_format': []
}
self.references = {}
self.citations = []
def load_content(self, file_path: str) -> None:
"""加载文档内容"""
self.file_path = file_path
with open(file_path, 'r', encoding='utf-8') as f:
self.content = f.read()
print(f"[加载] 文档: {file_path}")
print(f"[统计] 字符数: {len(self.content)}\n")
def check_reference_format(self) -> List[Dict]:
"""检查参考文献格式"""
print("=" * 80)
print("【1】参考文献格式检查")
print("=" * 80)
issues = []
# 提取参考文献
ref_pattern = r'^\[(\d+)\]\s+(.+)$'
ref_section = re.search(r'## 参考文献\n([\s\S]+)$', self.content)
if not ref_section:
issues.append({
'type': 'reference_format',
'severity': 'critical',
'message': '未找到参考文献部分',
'location': '文档末尾'
})
return issues
ref_list = ref_section.group(1)
for line in ref_list.split('\n'):
line = line.strip()
match = re.match(ref_pattern, line)
if match:
num = int(match.group(1))
ref_content = match.group(2)
self.references[num] = ref_content
# 检查作者格式
if '. ' in ref_content:
authors_part = ref_content.split('. ')[0]
if ',' in authors_part:
issues.append({
'type': 'reference_format',
'severity': 'warning',
'message': f'文献 [{num}] 作者分隔符应使用英文逗号","',
'location': f'[{num}]',
'content': ref_content[:80]
})
# 检查编号连续性
if self.references:
max_num = max(self.references.keys())
missing_nums = []
for i in range(1, max_num + 1):
if i not in self.references:
missing_nums.append(i)
if missing_nums:
issues.append({
'type': 'reference_format',
'severity': 'critical',
'message': f'参考文献编号不连续,缺失: {missing_nums}',
'location': '参考文献列表'
})
# 检查重复编号
num_count = {}
for line in ref_list.split('\n'):
match = re.match(r'^\[(\d+)\]', line.strip())
if match:
num = int(match.group(1))
num_count[num] = num_count.get(num, 0) + 1
duplicates = {k: v for k, v in num_count.items() if v > 1}
if duplicates:
issues.append({
'type': 'reference_format',
'severity': 'critical',
'message': f'发现重复编号: {duplicates}',
'location': '参考文献列表'
})
self.issues['reference_format'] = issues
print(f"[完成] 发现 {len(issues)} 个格式问题\n")
return issues
def check_citation_accuracy(self) -> List[Dict]:
"""检查引用准确性"""
print("=" * 80)
print("【2】引用准确性检查")
print("=" * 80)
issues = []
# 提取正文中的引用
# 排除参考文献部分的引用
main_content = self.content.split('## 参考文献')[0]
citation_pattern = r'\[(\d+)\]'
citations = [int(num) for num in re.findall(citation_pattern, main_content)]
self.citations = citations
# 检查无效引用
max_ref_num = max(self.references.keys()) if self.references else 0
invalid_refs = [num for num in citations if num > max_ref_num]
if invalid_refs:
issues.append({
'type': 'citation_accuracy',
'severity': 'critical',
'message': f'发现 {len(set(invalid_refs))} 个超出范围的引用: {sorted(set(invalid_refs))}',
'location': '正文'
})
# 检查孤立引用(存在文献但未被引用)
if self.references:
all_cited = set(citations)
isolated_refs = []
for num in self.references.keys():
if num not in all_cited:
isolated_refs.append(num)
if isolated_refs:
issues.append({
'type': 'citation_accuracy',
'severity': 'warning',
'message': f'发现 {len(isolated_refs)} 个未被引用的文献: {isolated_refs[:10]}...',
'location': '参考文献列表'
})
# 检查引用编号连续性
if citations:
missing_in_citations = []
for i in range(1, max(citations) + 1):
if i not in citations and i <= max_ref_num:
missing_in_citations.append(i)
if missing_in_citations:
issues.append({
'type': 'citation_accuracy',
'severity': 'info',
'message': f'正文中存在跳跃引用: {missing_in_citations[:10]}...',
'location': '正文'
})
self.issues['citation_accuracy'] = issues
print(f"[完成] 发现 {len(issues)} 个引用问题\n")
return issues
def check_typo_grammar(self) -> List[Dict]:
"""检查错别字和语法错误"""
print("=" * 80)
print("【3】错别字和语法检查")
print("=" * 80)
issues = []
# 常见错别字字典
typos = {
'的的': '的',
'的的的': '的',
'是是': '是',
'不不': '不',
'在在': '在',
'和和': '和',
'的的的': '的',
',,': ',',
'。。': '。',
'??': '?',
'!!': '!',
}
# 检查常见错别字
for typo, correction in typos.items():
if typo in self.content:
count = self.content.count(typo)
issues.append({
'type': 'typo_grammar',
'severity': 'warning',
'message': f'发现重复字符"{typo}"(应为"{correction}")',
'location': f'出现{count}次',
'count': count
})
# 检查引号不匹配
single_quotes = self.content.count('\'')
double_quotes = self.content.count('"')
chinese_left_quotes = self.content.count('"')
chinese_right_quotes = self.content.count('"')
book_left_quotes = self.content.count('《')
book_right_quotes = self.content.count('》')
if chinese_left_quotes != chinese_right_quotes:
issues.append({
'type': 'typo_grammar',
'severity': 'critical',
'message': f'中文引号不匹配: 左{chinese_left_quotes}个, 右{chinese_right_quotes}个',
'location': '全文'
})
if book_left_quotes != book_right_quotes:
issues.append({
'type': 'typo_grammar',
'severity': 'critical',
'message': f'书名号不匹配: 左{book_left_quotes}个, 右{book_right_quotes}个',
'location': '全文'
})
# 检查括号不匹配
brackets = {
'(': ')',
'[': ']',
'{': '}',
'(': ')',
'【': '】',
'《': '》'
}
for left, right in brackets.items():
left_count = self.content.count(left)
right_count = self.content.count(right)
if left_count != right_count:
issues.append({
'type': 'typo_grammar',
'severity': 'critical',
'message': f'括号{left}{right}不匹配: 左{left_count}个, 右{right_count}个',
'location': '全文'
})
self.issues['typo_grammar'] = issues
print(f"[完成] 发现 {len(issues)} 个错别字/语法问题\n")
return issues
def check_document_format(self) -> List[Dict]:
"""检查文档格式规范"""
print("=" * 80)
print("【4】文档格式规范检查")
print("=" * 80)
issues = []
lines = self.content.split('\n')
# 检查标题
has_title = any(line.strip().startswith('# ') for line in lines)
if not has_title:
issues.append({
'type': 'document_format',
'severity': 'critical',
'message': '文档缺少一级标题',
'location': '文档开头'
})
# 检查摘要
has_abstract = '## 摘要' in self.content or '### 摘要' in self.content
if not has_abstract:
issues.append({
'type': 'document_format',
'severity': 'warning',
'message': '文档缺少摘要部分',
'location': '文档开头'
})
# 检查参考文献部分
has_references = '## 参考文献' in self.content
if not has_references:
issues.append({
'type': 'document_format',
'severity': 'critical',
'message': '文档缺少参考文献部分',
'location': '文档末尾'
})
# 检查章节编号
sections = re.findall(r'^##+\s*(\d+\.?\d*\.?\d*\.?\d*)', self.content, re.MULTILINE)
section_nums = [float(s) for s in sections]
if section_nums:
# 检查是否连续
missing_sections = []
expected_next = 1
for num in sorted(section_nums):
if num != expected_next:
missing_sections.append(expected_next)
expected_next += 1
if missing_sections:
issues.append({
'type': 'document_format',
'severity': 'warning',
'message': f'章节编号可能不连续,缺失: {missing_sections}',
'location': '章节标题'
})
# 检查空段落
empty_lines = 0
for i, line in enumerate(lines):
if line.strip() == '' and i > 0 and i < len(lines) - 1:
empty_lines += 1
if empty_lines > len(lines) * 0.1:
issues.append({
'type': 'document_format',
'severity': 'info',
'message': f'空段落过多({empty_lines}个),可能影响可读性',
'location': '全文'
})
self.issues['document_format'] = issues
print(f"[完成] 发现 {len(issues)} 个格式问题\n")
return issues
def generate_report(self) -> str:
"""生成统一检查报告"""
report = []
report.append("# 统一检查报告\n")
report.append(f"**检查时间**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
report.append(f"**检查文件**: {self.file_path}\n")
report.append(f"**字符总数**: {len(self.content)}\n")
report.append(f"**参考文献数**: {len(self.references)}\n")
report.append(f"**引用总数**: {len(self.citations)}\n\n")
# 统计问题
total_issues = sum(len(v) for v in self.issues.values())
critical_issues = sum(1 for v in self.issues.values() for issue in v if issue.get('severity') == 'critical')
warning_issues = sum(1 for v in self.issues.values() for issue in v if issue.get('severity') == 'warning')
report.append("## 问题统计\n")
report.append(f"- **总问题数**: {total_issues}")
report.append(f"- **严重问题**: {critical_issues} (必须修复)")
report.append(f"- **警告问题**: {warning_issues} (建议修复)\n\n")
# 详细问题
for check_type, issues in self.issues.items():
if issues:
type_name = {
'reference_format': '参考文献格式',
'citation_accuracy': '引用准确性',
'typo_grammar': '错别字和语法',
'document_format': '文档格式'
}.get(check_type, check_type)
report.append(f"## {type_name} ({len(issues)}个)\n")
for i, issue in enumerate(issues, 1):
severity = issue.get('severity', 'info')
severity_icon = {'critical': '🔴', 'warning': '🟡', 'info': '🔵'}.get(severity, '⚪')
report.append(f"\n### {severity_icon} 问题 {i}\n")
report.append(f"**类型**: {severity}\n")
report.append(f"**位置**: {issue.get('location', '未知')}\n")
report.append(f"**描述**: {issue.get('message', '')}\n")
if 'content' in issue:
report.append(f"**内容**: `{issue['content']}`\n")
report.append("\n---\n")
# 建议
report.append("\n## 修复建议\n")
if critical_issues > 0:
report.append("1. **优先修复严重问题** 🔴\n")
report.append(" - 这些问题可能导致文档无法正常使用\n")
if warning_issues > 0:
report.append("2. **建议修复警告问题** 🟡\n")
report.append(" - 这些问题影响文档质量\n")
if total_issues == 0:
report.append("[OK] **恭喜!未发现任何问题!** 文档质量良好!\n")
return '\n'.join(report)
def save_report(self, output_path: str) -> None:
"""保存报告到文件"""
report = self.generate_report()
with open(output_path, 'w', encoding='utf-8') as f:
f.write(report)
print(f"[保存] 报告已生成: {output_path}")
def run_all_checks(self, input_file: str) -> None:
"""运行所有检查"""
print("\n" + "=" * 80)
print("开始统一检查")
print("=" * 80 + "\n")
self.load_content(input_file)
# 运行所有检查
self.check_reference_format()
self.check_citation_accuracy()
self.check_typo_grammar()
self.check_document_format()
# 生成报告
output_file = input_file.rsplit('.', 1)[0] + '_unified_report.md'
self.save_report(output_file)
# 打印总结
print("\n" + "=" * 80)
print("检查完成!")
print("=" * 80)
print(f"总问题数: {sum(len(v) for v in self.issues.values())}")
print(f"报告路径: {output_file}\n")
def main():
"""主函数"""
if len(sys.argv) < 2:
print("使用方法: python unified_checker.py <input_file>")
print("示例: python unified_checker.py review.md")
sys.exit(1)
input_file = sys.argv[1]
if not os.path.exists(input_file):
print(f"错误: 文件不存在: {input_file}")
sys.exit(1)
checker = UnifiedChecker()
checker.run_all_checks(input_file)
if __name__ == "__main__":
main()
FILE:templates/ref_format_default.yml
# 参考文献格式配置文件
# Reference Format Configuration
# 本文件定义了参考文献引用和列表的格式规则
# 引用标注样式配置
citation:
# 标注样式: superscript(上角标) 或 inline(语句组成部分)
style: "superscript"
# 位置规则
position_rules:
# 标点符号之前: 将引用置于逗号、分号等标点之前
before_punctuation: true
# 引文在句内时,句号在引文之内,标注在引号之外
# 示例: "……。"[1]
after_quote_in_sentence: true
# 引文在全句末尾,标注在句号之前
# 示例: "……"[1]
before_period: true
# 同文献多页引用格式
# 说明: 不需要标注页码范围,直接使用上角标格式对应参考文献即可
# 占位符: {ref}=文献编号
same_source_format: "[{ref}]" # 统一格式: [1]
# 图表引用处理规则
figure_table:
# 将图表内的引用标注移至图名/表名
move_to_caption: true
# 图表内注释标注格式
# options: "superscript_number"(①②), "parenthesis_number"(1)2)), "dagger"(†)
inline_notes_format: "superscript_number"
# 参考文献列表配置
reference_list:
# 排序方式: "order_of_appearance"(按出现顺序) 或 "alphabetical"(字母顺序)
sort_order: "order_of_appearance"
# 编号格式
numbering:
format: "[{num}]" # 例如: [1]
use_brackets: true
# 作者列表配置
authors:
# 作者数量上限(超过此数量添加后缀)
max_authors: 3
# 超出限制时的后缀
beyond_limit_suffix: "等"
# 作者分隔符
separator: ", "
# 最后作者前的连接符
last_separator: ", "
# 格式标准
# 支持的标准: "chinese_academic"(中国学术期刊), "apa", "mla", "chicago"
format_standard: "chinese_academic"
# 质量检查规则
quality_check:
# 检查引用编号连续性
check_continuity: true
# 检查重复编号
check_duplicates: true
# 检查无效引用(超出范围的编号)
check_invalid_refs: true
# 检查引用位置是否靠近相关引文
check_position: true
# 处理方式
processing:
# 处理模式: "auto"(全自动) 或 "semi_auto"(半自动,需用户确认)
mode: "semi_auto"
# 生成报告
generate_report: true
# 备份原始文件
backup_original: true
FILE:CHANGELOG.md
# Workwork Skill Changelog
## v3.0.0 (2026-03-19) - MAJOR UPDATE
### 🛡️ Literature Integrity Checker - CRITICAL NEW FEATURE
**全新文献完整性验证系统** - 杜绝文献引用错误的终极解决方案
#### Core Features
**Comprehensive Reference Validation**
- Parses and validates each reference's completeness
- Checks for: authors, title, journal, year, volume, pages, DOI
- Identifies missing or incomplete bibliographic information
- Validates year format and reasonableness (1900-current year)
**Citation Consistency Verification**
- Ensures all citations in text have corresponding references
- Detects invalid citations (numbers exceeding reference range)
- Identifies orphan citations (cited but not in reference list)
- Finds unused references (in list but never cited)
**Reference Number Management**
- Validates sequential numbering (1, 2, 3...)
- Detects missing numbers in sequence
- Identifies duplicate reference numbers
- Reports gaps in citation numbering
**Duplicate Detection**
- Identifies duplicate references by title and first author
- Prevents accidental inclusion of same source multiple times
- Suggests merging or removing duplicates
**Citation Format Compliance**
- Checks citation position (should be before punctuation)
- Identifies consecutive citation formatting issues
- Validates bracket usage and spacing
#### Critical Error Prevention
**Exit Code Behavior**
- Returns non-zero exit code on critical errors
- Prevents proceeding with document generation if issues found
- Forces user to fix problems before submission
- Integrates with CI/CD pipelines for automated checks
**Error Categories**
- 🔴 **Critical**: Invalid citations, missing references, numbering gaps
- 🟡 **Warning**: Incomplete information, formatting issues
- 🔵 **Info**: Unused references, style suggestions
#### Usage
```bash
# Basic check
python scripts/literature_integrity_checker.py your_review.md
# With deep verification (requires network)
python scripts/literature_integrity_checker.py your_review.md --deep-check
```
#### Output
Generates detailed markdown report with:
- Summary statistics
- Categorized issues list
- Reference-by-reference breakdown
- Fix recommendations
- Pre-submission checklist
### 📋 Updated Workflow
**New Recommended 9-Step Process:**
1. Write review
2. **Run Literature Integrity Check** ⭐ NEW CRITICAL STEP
3. Run unified check
4. Review reports
5. Fix issues
6. Filter references (optional)
7. Fix numbering
8. Final verification
9. Generate Word document
### 🔧 Improvements
- Enhanced error detection and reporting
- Stricter validation rules
- Better integration with existing tools
- Comprehensive documentation updates
---
## v2.3.0 (2026-03-19)
### ✨ New Features
#### Enhanced Superscript Word Generator (`create_word_with_superscript.js`)
**Smart Font Handling**
- Chinese characters use **SimSun (宋体)**
- English letters, numbers, and punctuation use **Times New Roman**
- Ensures proper academic formatting for mixed-language documents
**Auto-Italicization of Scientific Names**
- Automatically detects species Latin names (e.g., `Rhinopithecus bieti`, `Cetorhinus maximus`)
- Applies italic formatting to scientific names
- Pattern: Capitalized genus + lowercase species epithet
**Font Size Standardization**
- Body text: **小四 (12pt)**
- References: **小四 (12pt)**
- Level 1 headings: **小三 (16pt)** 黑体
- Level 2 headings: **四号 (14pt)** 黑体
- Citation superscripts: Slightly smaller for visual distinction
**Auto-Open Feature**
- Automatically opens the generated Word document after creation
- Cross-platform support (Windows, macOS, Linux)
- Provides clear console feedback
### 🔧 Improvements
- Better handling of mixed Chinese/English text
- Improved citation superscript formatting with brackets `[n]`
- Enhanced console output with formatting summary
### 📚 Documentation Updates
- Updated `SKILL.md` with detailed feature descriptions
- Added usage examples for superscript format
- Updated version history
---
## v2.2.0 (2026-03-19)
### ✨ New Features
- Added `create_word_with_superscript.js` for superscript citation format
- Superscript citations retain brackets: `[1]`, `[2-3]` appear as superscript
---
## v2.1.0 (2026-03-19)
### ✨ New Features
- Added unified checker (`unified_checker.py`) for comprehensive validation
- Optimized workflows to eliminate redundant checks
- Improved reporting format
---
## v2.0.0 (2026-03-19)
### ✨ New Features
- Added typo and grammar checker
- Added reference accuracy checker
- Added document format validation
---
## v1.0.0 (2026-03-19)
### 🎉 Initial Release
- Basic reference formatting
- Word document generation
- Reference filtering capabilities
FILE:package.json
{
"name": "workwork-skill",
"version": "3.1.0",
"description": "学术综述撰写与格式化工具集 - 从格式检查到文档生成的全流程支持(包含连续相同引用检测)",
"main": "SKILL.md",
"keywords": [
"academic",
"review",
"research",
"references",
"format-checker",
"word-document",
"superscript-citations",
"chinese-academic"
],
"author": "WorkBuddy",
"license": "MIT",
"dependencies": {
"docx": "^9.6.1"
},
"scripts": {
"format-check": "cd scripts && python reference_formatter.py",
"typo-check": "cd scripts && python typo_grammar_checker.py",
"reference-check": "cd scripts && python reference_accuracy_checker.py",
"doc-check": "cd scripts && python document_format_checker.py",
"unified-check": "cd scripts && python unified_checker.py",
"filter": "cd scripts && python filter_references.py",
"fix-refs": "cd scripts && python extract_and_fix_references.py",
"verify": "cd scripts && python simple_verify.py",
"word": "cd scripts && node create_word_doc_v3.js",
"word-superscript": "cd scripts && node create_word_with_superscript.js"
},
"workbuddy": {
"skill": {
"name": "workwork",
"version": "3.1.0",
"category": "academic-research",
"description": "学术综述撰写与格式化工具集,支持上标引用格式和连续相同引用检测",
"features": [
"文献完整性检查器",
"统一检查器",
"参考文献格式检查",
"文献筛选",
"参考文献编号修复",
"错别字检查",
"引用准确性检查",
"文档格式检查",
"Word文档生成",
"上标引用Word生成",
"连续相同引用检测"
]
}
}
}
FILE:README.md
# Workwork - 学术综述撰写与格式化工具
这是一个专门用于学术综述撰写的自动化工具集,提供从格式检查到文档生成的全流程支持。
## 📋 功能概览
### 核心功能模块 (9个)
1. **统一检查器** - 一次性完成所有检查(推荐)
2. **参考文献格式规范检查** - 检查引用标注位置、作者格式、编号连续性
3. **文献管理与筛选** - 识别并删除非核心期刊文献
4. **参考文献编号系统** - 自动重新编号、修复重复编号
5. **文字错别字和语法检查** - 检查常见错别字、语法错误、标点符号
6. **文献格式和引用准确性检查** - 验证文献格式、引用准确性、编号连续性
7. **文档格式规范检查** - 检查文档结构、章节编号、段落格式
8. **Word文档生成器** - 生成标准格式的Word文档
9. **上标引用Word生成器** - 生成带右上标引用的Word文档(推荐)
## 🚀 快速开始
### 🎯 推荐: 使用统一检查(一次性完成所有检查)
```bash
cd scripts
python unified_checker.py ../../your_review.md
# 查看统一报告
cat ../../your_review_unified_report.md
```
### 📝 分步检查(按需使用)
如果需要单独运行某个检查:
```bash
cd scripts
# 1. 检查参考文献格式
python reference_formatter.py ../../your_review.md
# 2. 文字错别字和语法检查
python typo_grammar_checker.py ../../your_review.md
# 3. 文献格式和引用准确性检查
python reference_accuracy_checker.py ../../your_review.md
# 4. 文档格式规范检查
python document_format_checker.py ../../your_review.md
```
### 📊 筛选和修复
```bash
cd scripts
# 5. 筛选文献(可选)
# 先编辑 filter_references.py 中的期刊列表
python filter_references.py
# 6. 修复参考文献编号
python extract_and_fix_references.py
# 7. 生成Word文档(标准格式)
node create_word_doc_v3.js
# 或生成上标格式(推荐)
node create_word_with_superscript.js ../../your_review.md
```
## 📁 目录结构
```
workwork/
├── README.md # 本文件
├── example_usage.md # 使用示例
├── package.json # Node.js 依赖
├── scripts/ # 功能脚本目录
│ ├── unified_checker.py # 统一检查器 ✅
│ ├── reference_formatter.py # 参考文献格式检查 ✅
│ ├── typo_grammar_checker.py # 错别字检查 ✅
│ ├── reference_accuracy_checker.py # 引用准确性检查 ✅
│ ├── document_format_checker.py # 文档格式检查 ✅
│ ├── filter_references.py # 文献筛选 ✅
│ ├── extract_and_fix_references.py # 参考文献编号修复 ✅
│ ├── create_word_doc_v3.js # Word文档生成 ✅
│ ├── create_word_with_superscript.js # 上标引用Word生成 ✅
│ └── simple_verify.py # 简单验证 ✅
├── templates/ # 配置文件目录
│ └── ref_format_default.yml # 默认格式配置 ✅
├── examples/ # 示例文件目录(空)
└── reports/ # 报告示例目录(空)
```
## 💡 使用流程
### ⚠️ 重要提示
**避免重复检查**: 为了提高效率,建议使用**统一检查**功能。
### 🎯 推荐工作流程
1. 撰写综述(Markdown格式)
2. 运行统一检查 `unified_checker.py`
3. 查看统一报告
4. 确认修复
5. 筛选文献(可选)
6. 修复编号
7. 最终验证
8. 生成Word文档(推荐上标格式)
### 📝 分步工作流程
1. 撰写综述(Markdown格式)
2. 依次运行各个检查脚本
3. 查看各个检查报告
4. 确认修复
5. 筛选文献(可选)
6. 修复编号
7. 全面验证
8. 生成Word文档
### 🔄 快速修复流程(小修小补)
如果只是修复了小问题,不需要运行所有检查:
- 只运行受影响的检查脚本
- 生成Word文档
## ✨ 上标引用Word生成器特色功能
`create_word_with_superscript.js` 提供以下增强功能:
- **智能字体处理**: 中文宋体 + 英文/数字/标点 Times New Roman
- **物种拉丁文名自动斜体**: 自动识别并斜体化物种学名
- **字号标准化**: 正文和参考文献均为小四 (12pt)
- **自动生成后打开**: 生成完成后自动打开Word文档
- **上标引用格式**: `[1]`、`[2-3]` 显示为右上标
## 🔧 配置
您可以自定义格式规则:
```yaml
# 编辑 templates/ref_format_default.yml
reference_format:
citation_style: "superscript"
position_rules:
before_punctuation: true
same_source_format: "[{ref}]"
author_limit: 3
```
## 📖 详细文档
- **example_usage.md** - 详细的使用示例
## 📝 系统要求
- Python 3.7+
- Node.js 14+
- Python依赖: `pyyaml`(可选)
- Node.js依赖: `docx`
安装Node.js依赖:
```bash
npm install docx
```
## 🎯 特色功能
1. **功能完整** - 涵盖学术综述撰写的全流程
2. **配置灵活** - 支持YAML配置文件
3. **模块化** - 可按需使用各功能
4. **上标引用** - 支持中文学术规范的上标引用格式
## 📄 许可
本工具仅供学习和研究使用。
## 🤝 贡献
欢迎提交问题和改进建议!
## 📅 版本历史
### v2.3.0 (2026-03-19)
- ✅ **增强上标Word生成器** - 智能字体处理、自动斜体化物种名、自动生成后打开
- ✅ **字号标准化** - 正文和参考文献统一使用小四 (12pt)
### v2.2.0 (2026-03-19)
- ✅ **新增上标引用Word生成器** - `create_word_with_superscript.js`
- ✅ **上标引用保留方括号** - `[1]`、`[2-3]` 显示为上标格式
### v2.1.0 (2026-03-19)
- ✅ **新增统一检查功能** - `unified_checker.py` 一次性完成所有检查
- ✅ **优化工作流程** - 提供推荐工作流程和快速修复流程
- ✅ **避免重复检查** - 添加使用说明,防止重复运行相同检查
### v2.0.0 (2026-03-19)
- ✅ 新增文字错别字和语法检查功能
- ✅ 新增文献格式和引用准确性检查功能
- ✅ 新增文档格式规范检查功能
### v1.0.0 (2026-03-19)
- ✅ 实现参考文献格式检查功能
- ✅ 实现文献筛选功能
- ✅ 实现参考文献编号修复
- ✅ 实现Word文档生成
- ✅ 实现质量控制检查
---
**创建时间**: 2026-03-19
**当前版本**: v2.3.0
**状态**: ✅ 完成并可投入使用
FILE:test_sample.md
# 人工智能在医疗领域的应用研究综述
## 摘要
本文综述了人工智能技术在医疗领域的最新应用进展,包括医学影像分析、疾病诊断、药物研发等方面。通过对近年来相关文献的分析,总结了当前研究的主要成果和存在的挑战。
**关键词**: 人工智能;医疗;深度学习;医学影像
## 1. 引言
人工智能(Artificial Intelligence, AI)技术的快速发展为医疗领域带来了革命性的变化[1]。深度学习技术在医学影像分析中展现出巨大潜力[2,3]。近年来,越来越多的研究开始探索AI在临床诊断中的应用[4]。
## 2. 医学影像分析
### 2.1 影像识别技术
卷积神经网络(CNN)在医学影像识别中取得了显著成果[5]。研究表明,AI系统在特定任务上可以达到甚至超越专家水平[6]。
### 2.2 应用案例
在肺结节检测方面,深度学习算法表现出色[7]。多项研究证实了其在提高诊断准确率方面的价值[8,9]。
## 3. 疾病诊断
AI辅助诊断系统已经在多个科室得到应用[10]。自然语言处理技术也被用于电子病历分析[11]。
## 4. 挑战与展望
尽管取得了显著进展,但AI在医疗领域的应用仍面临数据隐私、算法透明度等挑战[12]。未来研究需要解决这些问题[13]。
## 参考文献
[1] Smith J, Johnson A. Artificial intelligence in healthcare: A comprehensive review. Nature Medicine, 2023, 29(3): 456-467.
[2] Wang L, Zhang Y. Deep learning for medical image analysis. IEEE Transactions on Medical Imaging, 2022, 41(5): 1123-1138.
[3] Chen X, Liu M, Wu B. Convolutional neural networks for radiology image classification. Medical Image Analysis, 2023, 78: 102456.
[4] Anderson K, Brown P. Clinical applications of machine learning. The Lancet Digital Health, 2022, 4(8): e567-e578.
[5] Li H, Zhao Q. Advanced CNN architectures for medical imaging. Nature Biomedical Engineering, 2023, 7(2): 145-158.
[6] Taylor R, Martinez S. AI versus human experts in diagnostic accuracy. JAMA, 2022, 328(15): 1523-1535.
[7] Zhang F, Wang C. Pulmonary nodule detection using deep learning. Radiology, 2023, 307(1): e222145.
[8] Liu Y, Chen H. Performance evaluation of AI diagnostic systems. Nature Medicine, 2022, 28(9): 1789-1801.
[9] Kim J, Park S. Multi-center validation of AI algorithms. Medical Physics, 2023, 50(4): 2341-2355.
[10] Brown M, Davis L. AI-assisted diagnosis in clinical practice. NEJM AI, 2023, 1(2): 100-115.
[11] Wilson T, Garcia R. Natural language processing for electronic health records. JAMIA, 2022, 29(8): 1423-1438.
[12] Johnson P, Lee K. Ethical challenges in medical AI. Nature Biotechnology, 2023, 41(3): 312-325.
[13] Davis M, White S. Future directions for AI in healthcare. Science Translational Medicine, 2022, 14(648): eabm4567.
FILE:test_sample_with_errors.md
# 测试文档 - 包含错误的文献
## 1. 引言
这是一个测试文档,包含一些文献引用错误[1]。
有些引用指向不存在的文献[15]。还有孤立引用[20]。
## 2. 正文
这里引用了一些文献[2,3]。还有跳跃引用[5,10]。
## 参考文献
[1] Smith J. First reference. Nature, 2023, 1: 1-10.
[2] Johnson A. Second reference. Science, 2022, 2: 20-30.
[3] Wang L. Third reference. Cell, 2023, 3: 100-110.
[5] Brown M. Fifth reference. NEJM, 2022, 5: 500-510.
[10] Davis K. Tenth reference. Lancet, 2023, 10: 1000-1010.