@clawhub-aitanjp-c011beca2f
Extract invoice information from images and PDF files using Baidu OCR API, export to Excel. Supports single file, multiple files, or entire directory process...
---
name: invoice-extractor
description: Extract invoice information from images and PDF files using Baidu OCR API, export to Excel. Supports single file, multiple files, or entire directory processing. Use when the user mentions invoices, invoice recognition, extracting invoice data, processing receipts, converting invoices to Excel, or batch processing invoice files.
---
# Invoice Extractor
Extract invoice information from images (PNG, JPG) and PDF files, then export to Excel format.
## Capabilities
- **Multi-format support**: PNG, JPG, JPEG, BMP, TIFF, PDF
- **High accuracy**: Uses Baidu OCR API specialized for invoice recognition
- **Complete fields**: Extracts all invoice fields including buyer/seller info, amounts, items
- **Excel export**: Formatted Excel output with summary and detail sheets
- **Flexible input**: Single file, multiple files, or entire directory processing
- **Batch processing**: Process hundreds of invoices in one command
- **Preview mode**: List files before processing
## Prerequisites
1. Baidu Cloud OCR API credentials (free tier: 50,000 requests/day)
2. Python environment with required packages
## Quick Start
### 1. Setup Baidu OCR
Get API credentials from https://cloud.baidu.com/product/ocr:
1. Register/login to Baidu Cloud
2. Create an application
3. Get API Key and Secret Key
### 2. Configure
Create `config.txt` in the project root:
```
BAIDU_API_KEY=your_api_key_here
BAIDU_SECRET_KEY=your_secret_key_here
```
Or run the setup wizard:
```bash
python main_baidu.py --setup
```
### 3. Run
**Process a single file:**
```bash
python main_baidu.py -f invoice.pdf
```
**Process multiple files:**
```bash
python main_baidu.py -f invoice1.pdf -f invoice2.png
```
**Process entire directory:**
```bash
python main_baidu.py -i ./fp
```
**Mixed mode (directory + extra files):**
```bash
python main_baidu.py -i ./fp -f extra_invoice.pdf
```
Output will be saved to `output/` directory as Excel file.
## Workflow
```
Task Progress:
- [ ] Check prerequisites (Baidu API credentials)
- [ ] Choose input method (single file / multiple files / directory)
- [ ] Scan and collect invoice files
- [ ] Preview files (optional with --list)
- [ ] Process each file with Baidu OCR
- [ ] Parse invoice fields
- [ ] Export to Excel
- [ ] Verify output
```
## Input Methods
### Single File
Process one specific invoice file:
```bash
python main_baidu.py -f invoice.pdf
python main_baidu.py -f "path/to/invoice.png"
```
### Multiple Files
Process several specific files:
```bash
python main_baidu.py -f file1.pdf -f file2.png -f file3.jpg
```
### Entire Directory
Process all invoice files in a directory (recursive):
```bash
python main_baidu.py -i ./my_invoices
python main_baidu.py -i "/path/to/invoice/folder"
```
### Mixed Mode
Combine directory and individual files:
```bash
python main_baidu.py -i ./fp -f ./extra/invoice.pdf
```
### Preview Mode
List files without processing:
```bash
python main_baidu.py -i ./fp --list
```
## Extracted Fields
### Basic Information
- Invoice code (发票代码)
- Invoice number (发票号码)
- Invoice date (开票日期)
- Invoice type (发票类型)
### Buyer Information
- Name (购买方名称)
- Tax number (纳税人识别号)
- Address and phone (地址电话)
- Bank account (开户行及账号)
### Seller Information
- Name (销售方名称)
- Tax number (纳税人识别号)
- Address and phone (地址电话)
- Bank account (开户行及账号)
### Amounts
- Total amount (合计金额)
- Total tax (合计税额)
- Amount with tax (价税合计)
### Items
- Product name (货物名称)
- Specification (规格型号)
- Unit (单位)
- Quantity (数量)
- Unit price (单价)
- Amount (金额)
- Tax rate (税率)
- Tax amount (税额)
## Command Line Options
```bash
python main_baidu.py [options]
Input Options:
-f FILE, --file FILE Specify invoice file (can be used multiple times)
-i DIR, --input DIR Input directory (default: fp)
Output Options:
-o DIR, --output DIR Output directory (default: output)
-n NAME, --name NAME Output filename prefix (default: 发票信息)
Authentication Options:
--api-key KEY Baidu API Key
--secret-key KEY Baidu Secret Key
Other Options:
--setup Run configuration wizard
--list List files to be processed without processing
-h, --help Show help
```
## Usage Examples
### Example 1: Single File
```bash
python main_baidu.py -f "invoice.pdf"
```
### Example 2: Multiple Files
```bash
python main_baidu.py -f "1.pdf" -f "2.png" -f "3.jpg"
```
### Example 3: Entire Directory
```bash
python main_baidu.py -i "./2024_invoices"
```
### Example 4: Preview Before Processing
```bash
python main_baidu.py -i ./fp --list
# Then process:
python main_baidu.py -i ./fp
```
### Example 5: Mixed Input
```bash
python main_baidu.py -i ./fp -f ./urgent/invoice.pdf -o ./output -n "March_2024"
```
### Example 6: Custom Output
```bash
python main_baidu.py -i ./fp -o ./reports -n "Q1_Invoice_Summary"
```
## Project Structure
```
.
├── fp/ # Place invoice files here
├── output/ # Excel output directory
├── src/
│ ├── main_baidu.py # Main entry point
│ ├── baidu_ocr_extractor.py # Baidu OCR wrapper
│ ├── invoice_model.py # Data models
│ ├── excel_exporter.py # Excel export
│ └── config.py # Configuration
├── scripts/ # Utility scripts
│ ├── batch_process.py # Batch processing helper
│ └── verify_export.py # Verify Excel export
├── config.txt # API credentials
├── requirements.txt # Dependencies
├── SKILL.md # This file
├── setup.md # Detailed setup guide
└── examples.md # Usage examples
```
## Utility Scripts
### Batch Processing Helper
```bash
python scripts/batch_process.py /path/to/invoices
```
### Verify Export
```bash
python scripts/verify_export.py output/invoice_info.xlsx
```
## Error Handling
Common issues and solutions:
**"Baidu OCR authentication failed"**
- Check API Key and Secret Key in config.txt
- Verify credentials are correct in Baidu Cloud console
**"No invoice files found"**
- Ensure files are in the specified directory
- Check file formats (supported: png, jpg, jpeg, bmp, tiff, pdf)
- Use `--list` to see what files are detected
**"Image format error"**
- PDF files are automatically converted to images
- Ensure PDF is not corrupted or password-protected
**"File not found"**
- Check file path is correct
- Use quotes for paths with spaces: `"path/to/file name.pdf"`
## Advanced Usage
### Environment Variables
Set credentials via environment:
```bash
export BAIDU_API_KEY="your_key"
export BAIDU_SECRET_KEY="your_secret"
```
### Batch Processing Script
Create a script for monthly processing:
```bash
#!/bin/bash
MONTH=$(date +%Y%m)
python main_baidu.py \
-i "/invoices/$MONTH" \
-o "/reports/$MONTH" \
-n "Invoice_Report_$MONTH"
```
## Additional Resources
- For detailed setup instructions, see [setup.md](setup.md)
- For more examples, see [examples.md](examples.md)
- For API documentation, visit https://cloud.baidu.com/doc/OCR/index.html
FILE:config.template.txt
# Invoice Extractor Configuration
# Get your API credentials from https://cloud.baidu.com/product/ocr
BAIDU_API_KEY=your_api_key_here
BAIDU_SECRET_KEY=your_secret_key_here
OCR_ENGINE=baidu
FILE:examples.md
# Invoice Extractor Examples
## Table of Contents
1. [Basic Examples](#basic-examples)
2. [Advanced Usage](#advanced-usage)
3. [Batch Processing](#batch-processing)
4. [Utility Scripts](#utility-scripts)
5. [Integration Examples](#integration-examples)
---
## Basic Examples
### Example 1: Single File Processing
Process one specific invoice file:
```bash
# PDF invoice
python src/main_baidu.py -f invoice.pdf
# Image invoice
python src/main_baidu.py -f "invoice.png"
# With spaces in filename
python src/main_baidu.py -f "path/to/my invoice.pdf"
```
**Output:**
```
============================================================
Invoice Extractor (Baidu OCR)
============================================================
Time: 2026-03-19 10:30:00
============================================================
Scanning files...
Found 1 files to process
[OK] Baidu OCR authentication successful
[1/1] Processing: invoice.pdf
Converting PDF to image...
[OK] Successfully extracted: 26437000000033943419
[OK] Successfully extracted 1 invoices
Extraction Summary:
------------------------------------------------------------
1. Invoice Number: 26437000000033943419
Date: 2026-01-02
Seller: China Mobile Communications Group...
Amount: 169.00
Starting Excel export...
[OK] Successfully exported 1 records to: output/invoice_info_20260319.xlsx
```
---
### Example 2: Multiple Files
Process several specific files:
```bash
python src/main_baidu.py -f invoice1.pdf -f invoice2.png -f invoice3.jpg
```
---
### Example 3: Directory Processing
Process all invoices in a directory:
```bash
# Default fp/ directory
python src/main_baidu.py
# Custom directory
python src/main_baidu.py -i ./my_invoices
# Absolute path
python src/main_baidu.py -i "/Users/accounting/2024/Q1"
```
**Recursive processing:** The tool automatically scans subdirectories.
---
### Example 4: Preview Mode
List files without processing:
```bash
python src/main_baidu.py -i ./fp --list
```
**Output:**
```
Scanning files...
Found 5 files:
1. fp\invoice_001.pdf
2. fp\invoice_002.png
3. fp\invoice_003.jpg
4. fp\subfolder\invoice_004.pdf
5. fp\subfolder\invoice_005.png
```
Use this to verify what will be processed before running the actual extraction.
---
## Advanced Usage
### Example 5: Mixed Input Mode
Combine directory and individual files:
```bash
# Process directory plus extra files
python src/main_baidu.py -i ./fp -f ./urgent/invoice.pdf -f ./special/case.png
```
**Use case:** You have most invoices organized in `fp/` but received a few urgent ones elsewhere.
---
### Example 6: Custom Output
Specify output directory and filename:
```bash
python src/main_baidu.py \
-i ./fp \
-o ./reports/2024 \
-n "March_Invoice_Summary"
```
**Result:** `./reports/2024/March_Invoice_Summary_20260319_103000.xlsx`
---
### Example 7: Command Line Authentication
Override config file with command line credentials:
```bash
python src/main_baidu.py \
-i ./fp \
--api-key "your_api_key" \
--secret-key "your_secret_key"
```
**Use case:** Temporary use or testing with different credentials.
---
## Batch Processing
### Example 8: Monthly Report Generation
Create a script for monthly processing:
```bash
#!/bin/bash
# monthly_invoice_report.sh
MONTH=$(date +%Y%m)
INPUT_DIR="/shared/invoices/$MONTH"
OUTPUT_DIR="/reports/$MONTH"
cd /opt/invoice-extractor
echo "Processing invoices for $MONTH..."
python src/main_baidu.py \
-i "$INPUT_DIR" \
-o "$OUTPUT_DIR" \
-n "Invoice_Report_$MONTH"
echo "Report generated: $OUTPUT_DIR/Invoice_Report_$MONTH.xlsx"
```
**Usage:**
```bash
chmod +x monthly_invoice_report.sh
./monthly_invoice_report.sh
```
---
### Example 9: Using Batch Process Helper
Use the provided batch helper script:
```bash
# Simple usage
python scripts/batch_process.py ./invoices
# With custom output
python scripts/batch_process.py ./invoices -o ./output -n "Q1_2024"
# Without item details (faster)
python scripts/batch_process.py ./invoices --no-items
```
---
### Example 10: Process by Date Range
Process invoices from a specific date range:
```bash
# Find invoices from March 2024 and process them
find ./invoices -name "*202403*.pdf" -o -name "*2024-03*.pdf" | \
xargs -I {} python src/main_baidu.py -f {}
```
---
## Utility Scripts
### Example 11: Verify Export Quality
Check the exported Excel file:
```bash
python scripts/verify_export.py output/invoice_info_20260319.xlsx
```
**Output:**
```
============================================================
Excel File Verification Report
============================================================
File: output/invoice_info_20260319.xlsx
------------------------------------------------------------
Number of sheets: 2
Sheets: 发票信息, 商品明细
【Invoice Information Sheet】
Records: 50
Field Completeness:
Invoice Number: 50/50 (100.0%)
Invoice Date: 50/50 (100.0%)
Seller Name: 48/50 (96.0%)
Total Amount: 50/50 (100.0%)
Amount Statistics:
Total: 125,680.50
Average: 2,513.61
Maximum: 15,800.00
Minimum: 120.00
Date Range:
Earliest: 2024-03-01
Latest: 2024-03-31
【Item Details Sheet】
Records: 156
Involved Invoices: 50
============================================================
[OK] Verification Complete
============================================================
```
---
### Example 12: Compare Multiple Exports
Compare reports from different months:
```bash
# Generate reports for each month
for month in 01 02 03; do
python src/main_baidu.py \
-i "./2024/$month" \
-o "./reports" \
-n "2024month_Invoices"
done
# Verify all reports
for file in ./reports/2024*_Invoices_*.xlsx; do
echo "Checking: $file"
python scripts/verify_export.py "$file"
done
```
---
## Integration Examples
### Example 13: Python Script Integration
Use the tool in your Python script:
```python
import sys
from pathlib import Path
# Add src to path
sys.path.insert(0, str(Path(__file__).parent / "src"))
from config import Config
from baidu_ocr_extractor import BaiduInvoiceExtractor
from excel_exporter import ExcelExporter
# Load config
Config.load_from_file()
# Initialize extractor
extractor = BaiduInvoiceExtractor(
api_key=Config.BAIDU_API_KEY,
secret_key=Config.BAIDU_SECRET_KEY
)
# Process single file
invoice = extractor.extract_from_file("path/to/invoice.pdf")
if invoice:
print(f"Invoice Number: {invoice.invoice_number}")
print(f"Amount: {invoice.total_amount_with_tax}")
print(f"Items: {len(invoice.items)}")
```
---
### Example 14: Automated Email Report
Send report via email after processing:
```bash
#!/bin/bash
# process_and_email.sh
RECIPIENT="[email protected]"
INPUT_DIR="./invoices"
OUTPUT_DIR="./output"
# Process invoices
python src/main_baidu.py -i "$INPUT_DIR" -o "$OUTPUT_DIR" -n "Daily_Report"
# Find latest report
LATEST_REPORT=$(ls -t $OUTPUT_DIR/*.xlsx | head -1)
# Send email with attachment (using mutt or mail)
echo "Invoice processing complete. See attachment." | \
mutt -s "Daily Invoice Report" -a "$LATEST_REPORT" -- "$RECIPIENT"
echo "Report sent to $RECIPIENT"
```
---
### Example 15: Database Integration
Import extracted data to database:
```python
import pandas as pd
import sqlite3
# Read Excel
df = pd.read_excel('output/invoice_info.xlsx')
# Connect to database
conn = sqlite3.connect('invoices.db')
# Save to database
df.to_sql('invoices', conn, if_exists='append', index=False)
# Query example
result = pd.read_sql_query("""
SELECT 销售方名称, SUM(价税合计) as total
FROM invoices
GROUP BY 销售方名称
ORDER BY total DESC
""", conn)
print(result)
```
---
## Common Use Cases
### Monthly Expense Report
1. Collect all invoices for the month
2. Place in `fp/` directory
3. Run: `python src/main_baidu.py -n "2024_03_Expenses"`
4. Import Excel into accounting software
### Tax Preparation
1. Gather all invoices for tax year
2. Batch process with custom output name
3. Filter and categorize in Excel
4. Submit to tax authority
### Audit Documentation
1. Scan all paper invoices to PDF
2. Process with invoice extractor
3. Generate complete Excel records
4. Archive digital and physical copies
### Vendor Analysis
```python
import pandas as pd
# Read exported Excel
df = pd.read_excel('output/invoice_info.xlsx')
# Group by vendor
vendor_stats = df.groupby('销售方名称').agg({
'价税合计': ['count', 'sum', 'mean']
}).round(2)
print(vendor_stats)
```
---
## Tips and Best Practices
### File Organization
```
invoices/
├── 2024/
│ ├── Q1/
│ │ ├── 01/
│ │ ├── 02/
│ │ └── 03/
│ └── Q2/
└── 2023/
└── ...
```
Process by quarter:
```bash
python src/main_baidu.py -i "./2024/Q1" -n "2024_Q1_Invoices"
```
### Naming Conventions
Use descriptive names for output files:
- `2024_03_Marketing_Invoices`
- `Q1_Travel_Expenses`
- `Project_X_Vendor_Payments`
### Verification Workflow
1. Preview files: `python src/main_baidu.py -i ./fp --list`
2. Process: `python src/main_baidu.py -i ./fp`
3. Verify: `python scripts/verify_export.py output/*.xlsx`
4. Review Excel output before submitting
---
## Troubleshooting Examples
### Handle Failed Extractions
```bash
# Process with verbose output
python src/main_baidu.py -i ./fp 2>&1 | tee processing.log
# Check for failures
grep "FAIL" processing.log
# Re-process failed files only
python src/main_baidu.py -f failed_invoice1.pdf -f failed_invoice2.png
```
### Large Batch Processing
For very large batches (1000+ invoices), process in chunks:
```bash
# Split into subdirectories
mkdir -p batch_{1..10}
ls *.pdf | split -l 100 - batch_
# Process each batch
for i in {1..10}; do
python src/main_baidu.py -i "./batch_$i" -o "./output" -n "Batch_$i"
done
# Merge results (using Python pandas)
python -c "
import pandas as pd
import glob
files = glob.glob('output/Batch_*.xlsx')
dfs = [pd.read_excel(f) for f in files]
combined = pd.concat(dfs, ignore_index=True)
combined.to_excel('output/Combined_Results.xlsx', index=False)
"
```
FILE:install.sh
#!/bin/bash
# Invoice Extractor Skill Installation Script
echo "=================================="
echo "Invoice Extractor Skill Installer"
echo "=================================="
# Check Python version
echo "Checking Python version..."
python_version=$(python3 --version 2>&1 | awk '{print $2}')
echo "Found Python $python_version"
# Install dependencies
echo ""
echo "Installing dependencies..."
pip install -r requirements.txt
# Create directories
echo ""
echo "Creating directories..."
mkdir -p fp output .temp
# Copy config template
echo ""
echo "Setting up configuration..."
if [ ! -f "config.txt" ]; then
cp config.template.txt config.txt
echo "Created config.txt from template"
echo "Please edit config.txt with your Baidu OCR credentials"
else
echo "config.txt already exists, skipping"
fi
echo ""
echo "=================================="
echo "Installation complete!"
echo "=================================="
echo ""
echo "Next steps:"
echo "1. Get Baidu OCR API credentials from https://cloud.baidu.com/product/ocr"
echo "2. Edit config.txt with your API Key and Secret Key"
echo "3. Place invoice files in fp/ directory"
echo "4. Run: python src/main_baidu.py"
echo ""
FILE:requirements.txt
requests>=2.28.0
pandas>=2.0.0
openpyxl>=3.1.0
PyMuPDF>=1.23.0
Pillow>=10.0.0
FILE:setup.md
# Invoice Extractor Setup Guide
## Prerequisites
- Python 3.8 or higher
- Internet connection (for Baidu OCR API)
- Baidu Cloud account
## Quick Setup (5 minutes)
### Step 1: Install Dependencies
```bash
pip install -r requirements.txt
```
Required packages:
- requests (for API calls)
- pandas (for data handling)
- openpyxl (for Excel export)
- PyMuPDF (for PDF processing)
- Pillow (for image processing)
### Step 2: Get Baidu OCR Credentials
1. Visit https://cloud.baidu.com/product/ocr
2. Register/login with your phone number
3. Complete real-name verification (upload ID card, instant approval)
4. Create application and select "VAT Invoice Recognition"
5. Copy your **API Key** and **Secret Key**
### Step 3: Configure
Run the setup wizard:
```bash
python src/main_baidu.py --setup
```
Or manually create `config.txt`:
```
BAIDU_API_KEY=your_api_key_here
BAIDU_SECRET_KEY=your_secret_key_here
```
### Step 4: Test
```bash
# Preview files
python src/main_baidu.py -i ./fp --list
# Process invoices
python src/main_baidu.py -i ./fp
```
---
## Detailed Setup Instructions
### Step 1: Install Dependencies
#### Option A: Using pip
```bash
# Create virtual environment (recommended)
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
```
#### Option B: Using conda
```bash
# Create conda environment
conda create -n invoice-extractor python=3.10
# Activate environment
conda activate invoice-extractor
# Install dependencies
pip install -r requirements.txt
```
#### Option C: Using poetry
```bash
# Install poetry first: https://python-poetry.org/docs/#installation
# Install dependencies
poetry install
# Run with poetry
poetry run python src/main_baidu.py
```
### Step 2: Get Baidu OCR Credentials
#### 2.1 Register Baidu Cloud Account
1. Visit https://cloud.baidu.com/
2. Click "Register" or "Login"
3. Complete registration with phone number
4. Verify email (optional but recommended)
#### 2.2 Real-name Verification
**Required for free tier access**
1. Go to console after login
2. Click "Account" → "Real-name Verification"
3. Choose "Personal Verification"
4. Upload ID card photos (front and back)
5. Wait for automatic approval (usually instant)
**Note:** Enterprise verification is also available for business accounts.
#### 2.3 Create OCR Application
1. Go to https://cloud.baidu.com/product/ocr
2. Click "Get Started" or "Console"
3. Select "Create Application"
4. Fill in application details:
- **Application Name**: Invoice Extractor (or any name)
- **Application Description**: Invoice recognition tool
- **Interface Selection**: Check "VAT Invoice Recognition"
5. Click "Create"
#### 2.4 Get API Keys
After creating the application, you'll see:
- **AppID**: Application ID
- **API Key**: For API authentication
- **Secret Key**: For API authentication
**Important:** Copy the **API Key** and **Secret Key** immediately. You'll need them for configuration.
### Step 3: Configure the Tool
#### Option A: Configuration Wizard (Recommended)
```bash
cd src
python main_baidu.py --setup
```
Follow the prompts:
```
Invoice Extractor - Setup Wizard
================================
Configure Baidu OCR now? (y/n): y
Enter Baidu API Key: [paste your API key]
Enter Baidu Secret Key: [paste your secret key]
[OK] Configuration saved to: config.txt
```
#### Option B: Manual Configuration
Create `config.txt` in project root:
```
# Invoice Extractor Configuration
# Get your API credentials from https://cloud.baidu.com/product/ocr
BAIDU_API_KEY=3yrSX2UuhRpzgdiLBD3D1GDr
BAIDU_SECRET_KEY=ZP6MY4DF6RR6GQhD66p5xrifSWXk2TZl
OCR_ENGINE=baidu
```
**Security Tips:**
- Never commit `config.txt` to version control
- Add `config.txt` to `.gitignore`
- Use environment variables for CI/CD
#### Option C: Environment Variables
```bash
# Windows (Command Prompt)
set BAIDU_API_KEY=your_api_key
set BAIDU_SECRET_KEY=your_secret_key
# Windows (PowerShell)
$env:BAIDU_API_KEY="your_api_key"
$env:BAIDU_SECRET_KEY="your_secret_key"
# macOS/Linux
export BAIDU_API_KEY="your_api_key"
export BAIDU_SECRET_KEY="your_secret_key"
```
### Step 4: Test Installation
#### 4.1 Verify Configuration
```bash
python -c "
from src.config import Config
Config.load_from_file()
print('API Key:', Config.BAIDU_API_KEY[:10] + '...' if Config.BAIDU_API_KEY else 'Not set')
print('Secret Key:', 'Set' if Config.BAIDU_SECRET_KEY else 'Not set')
"
```
#### 4.2 Test with Sample Invoice
1. Place a test invoice (image or PDF) in `fp/` directory
2. Preview files:
```bash
python src/main_baidu.py -i ./fp --list
```
3. Process:
```bash
python src/main_baidu.py -i ./fp
```
4. Check `output/` directory for Excel file
---
## Usage Examples
### Basic Usage
```bash
# Process all invoices in fp/ directory
python src/main_baidu.py
# Process specific directory
python src/main_baidu.py -i ./my_invoices
# Process single file
python src/main_baidu.py -f invoice.pdf
# Process multiple files
python src/main_baidu.py -f 1.pdf -f 2.png -f 3.jpg
# Custom output
python src/main_baidu.py -i ./fp -o ./reports -n "March_2024"
```
### Advanced Usage
```bash
# Preview before processing
python src/main_baidu.py -i ./fp --list
# Process with custom credentials
python src/main_baidu.py -i ./fp --api-key "xxx" --secret-key "yyy"
# Batch process with helper script
python scripts/batch_process.py ./invoices -o ./output -n "Q1_2024"
# Verify export quality
python scripts/verify_export.py output/invoice_info.xlsx
```
---
## Troubleshooting
### Authentication Failed
**Symptom**: "Baidu OCR authentication failed"
**Solutions**:
1. Check API Key and Secret Key in `config.txt`
2. Verify no extra spaces or newlines in the values
3. Ensure credentials are from OCR service (not other Baidu services)
4. Check if account has real-name verification
5. Try regenerating keys in Baidu Cloud console
### No Invoice Files Found
**Symptom**: "No invoice files found"
**Solutions**:
1. Check files are in correct directory
2. Verify file extensions (.pdf, .png, .jpg, etc.)
3. Ensure files are not corrupted
4. Use `--list` flag to see what files are detected
### Image Format Error
**Symptom**: "image format error"
**Solutions**:
1. For PDF files: Ensure PDF is not password-protected
2. For images: Check image is not corrupted
3. Try converting file to different format
4. Check if file is actually an image/PDF (not renamed)
### Network Issues
**Symptom**: Connection timeout or API errors
**Solutions**:
1. Check internet connection
2. Verify firewall allows HTTPS to aip.baidubce.com
3. Try again later (Baidu API may have temporary issues)
4. Check if you're behind a corporate proxy
### Module Not Found
**Symptom**: "ModuleNotFoundError: No module named 'xxx'"
**Solutions**:
```bash
# Reinstall dependencies
pip install -r requirements.txt --force-reinstall
# Or install specific package
pip install pandas openpyxl requests PyMuPDF Pillow
```
---
## Free Tier Limits
- **Daily quota**: 50,000 requests/day
- **QPS limit**: 2 requests/second (free tier)
- **Sufficient for**: Personal/small business use
**Rate Limiting:**
If you hit the QPS limit, the tool will automatically retry with exponential backoff.
**Upgrading:**
If you need higher limits, upgrade in Baidu Cloud console. Paid tiers offer:
- Higher QPS (10-100+)
- Priority support
- SLA guarantees
---
## Directory Structure
After setup, your project should look like:
```
invoice-extractor/
├── fp/ # Place invoice files here
├── output/ # Excel output directory
├── src/
│ ├── main_baidu.py # Main entry point
│ ├── baidu_ocr_extractor.py
│ ├── invoice_model.py
│ ├── excel_exporter.py
│ └── config.py
├── scripts/
│ ├── batch_process.py # Batch processing helper
│ └── verify_export.py # Export verification
├── config.txt # Your API credentials (gitignored)
├── config.template.txt # Template for new users
├── requirements.txt # Dependencies
├── SKILL.md # Skill documentation
├── setup.md # This file
└── examples.md # Usage examples
```
---
## Next Steps
1. Read [SKILL.md](SKILL.md) for detailed usage
2. Check [examples.md](examples.md) for common use cases
3. Explore [scripts/](scripts/) for utility scripts
4. Visit https://cloud.baidu.com/doc/OCR/index.html for API documentation
## Getting Help
- **Issues**: Check troubleshooting section above
- **Baidu OCR**: https://cloud.baidu.com/doc/OCR/index.html
- **Examples**: See [examples.md](examples.md)
FILE:src/baidu_ocr_extractor.py
"""
百度OCR发票识别提取器
使用百度智能云OCR API进行发票识别
"""
import re
import base64
import requests
from pathlib import Path
from typing import List, Optional, Dict
from datetime import datetime
from invoice_model import InvoiceInfo, InvoiceItem
class BaiduOCRConfig:
"""百度OCR配置"""
# 默认使用免费额度(QPS=2)
API_KEY = "" # 需要用户填写
SECRET_KEY = "" # 需要用户填写
# 发票识别接口
INVOICE_URL = "https://aip.baidubce.com/rest/2.0/ocr/v1/vat_invoice"
# 通用文字识别(高精度)
GENERAL_URL = "https://aip.baidubce.com/rest/2.0/ocr/v1/accurate_basic"
class BaiduInvoiceExtractor:
"""使用百度OCR的发票提取器"""
def __init__(self, api_key: str = None, secret_key: str = None):
"""
初始化百度OCR提取器
Args:
api_key: 百度智能云API Key
secret_key: 百度智能云Secret Key
"""
self.api_key = api_key or BaiduOCRConfig.API_KEY
self.secret_key = secret_key or BaiduOCRConfig.SECRET_KEY
self.access_token = None
if self.api_key and self.secret_key:
self._get_access_token()
def _get_access_token(self) -> bool:
"""获取百度OCR访问令牌"""
try:
url = f"https://aip.baidubce.com/oauth/2.0/token"
params = {
"grant_type": "client_credentials",
"client_id": self.api_key,
"client_secret": self.secret_key
}
response = requests.post(url, params=params, timeout=10)
result = response.json()
if "access_token" in result:
self.access_token = result["access_token"]
print("[OK] 百度OCR认证成功")
return True
else:
print(f"[FAIL] 百度OCR认证失败: {result.get('error_description', '未知错误')}")
return False
except Exception as e:
print(f"[FAIL] 获取访问令牌失败: {e}")
return False
def extract_from_file(self, file_path: str) -> Optional[InvoiceInfo]:
"""
从文件中提取发票信息
Args:
file_path: 文件路径(图片或PDF)
Returns:
InvoiceInfo对象,提取失败返回None
"""
file_path = Path(file_path)
if not file_path.exists():
print(f"[FAIL] 文件不存在: {file_path}")
return None
if not self.access_token:
print("[FAIL] 百度OCR未认证,请配置API Key和Secret Key")
return None
# 检查文件类型
suffix = file_path.suffix.lower()
if suffix == '.pdf':
# PDF文件需要转换为图片
return self._extract_from_pdf(file_path)
elif suffix in ['.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.tif']:
# 图片文件直接处理
return self._extract_from_image(file_path)
else:
print(f"[FAIL] 不支持的文件格式: {suffix}")
return None
def _extract_from_image(self, image_path: Path) -> Optional[InvoiceInfo]:
"""从图片中提取发票信息"""
try:
with open(image_path, 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
except Exception as e:
print(f"[FAIL] 读取图片失败: {e}")
return None
# 首先尝试使用增值税发票识别接口
invoice_info = self._recognize_vat_invoice(image_data)
if invoice_info:
invoice_info.source_file = str(image_path)
invoice_info.extraction_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
return invoice_info
# 如果发票识别失败,使用通用文字识别
print("发票识别失败,尝试通用文字识别...")
return self._recognize_general(image_data, image_path)
def _extract_from_pdf(self, pdf_path: Path) -> Optional[InvoiceInfo]:
"""从PDF中提取发票信息(先将PDF转为图片)"""
try:
import fitz # PyMuPDF
from PIL import Image
import io
print("正在将PDF转换为图片...")
# 打开PDF
doc = fitz.open(str(pdf_path))
# 获取第一页
page = doc[0]
# 将页面转换为图片(2倍分辨率提高清晰度)
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
# 转换为PIL Image
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
# 保存到内存
img_buffer = io.BytesIO()
img.save(img_buffer, format='PNG')
img_buffer.seek(0)
# 转为base64
image_data = base64.b64encode(img_buffer.read()).decode('utf-8')
doc.close()
# 使用增值税发票识别接口
invoice_info = self._recognize_vat_invoice(image_data)
if invoice_info:
invoice_info.source_file = str(pdf_path)
invoice_info.extraction_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
return invoice_info
# 如果发票识别失败,使用通用文字识别
print("发票识别失败,尝试通用文字识别...")
return self._recognize_general(image_data, pdf_path)
except Exception as e:
print(f"[FAIL] PDF处理失败: {e}")
return None
def _recognize_vat_invoice(self, image_data: str) -> Optional[InvoiceInfo]:
"""使用增值税发票识别接口"""
try:
url = f"{BaiduOCRConfig.INVOICE_URL}?access_token={self.access_token}"
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
params = {"image": image_data}
response = requests.post(url, data=params, headers=headers, timeout=30)
result = response.json()
if "words_result" in result:
return self._parse_vat_invoice_result(result["words_result"])
elif "error_code" in result:
print(f"[WARN] 发票识别接口错误: {result.get('error_msg', '未知错误')}")
return None
else:
return None
except Exception as e:
print(f"[FAIL] 发票识别请求失败: {e}")
return None
def _parse_vat_invoice_result(self, result: Dict) -> Optional[InvoiceInfo]:
"""解析增值税发票识别结果"""
invoice = InvoiceInfo()
# 基本信息
invoice.invoice_code = result.get("InvoiceCode", "")
invoice.invoice_number = result.get("InvoiceNum", "")
invoice.invoice_date = result.get("InvoiceDate", "")
invoice.invoice_type = result.get("InvoiceType", "")
# 购买方信息
buyer_info = result.get("BuyerName", {})
if isinstance(buyer_info, dict):
invoice.buyer_name = buyer_info.get("word", "")
else:
invoice.buyer_name = str(buyer_info)
invoice.buyer_tax_number = result.get("BuyerRegisterNum", "")
invoice.buyer_address = result.get("BuyerAddress", "")
invoice.buyer_bank = result.get("BuyerBank", "")
# 销售方信息
seller_info = result.get("SellerName", {})
if isinstance(seller_info, dict):
invoice.seller_name = seller_info.get("word", "")
else:
invoice.seller_name = str(seller_info)
invoice.seller_tax_number = result.get("SellerRegisterNum", "")
invoice.seller_address = result.get("SellerAddress", "")
invoice.seller_bank = result.get("SellerBank", "")
# 金额信息
invoice.total_amount = self._parse_amount(result.get("TotalAmount", "0"))
invoice.total_tax_amount = self._parse_amount(result.get("TotalTax", "0"))
invoice.total_amount_with_tax = self._parse_amount(result.get("AmountInFiguers", "0"))
# 其他信息
invoice.remarks = result.get("Remarks", "")
invoice.checker = result.get("Checker", "")
invoice.payee = result.get("Payee", "")
invoice.issuer = result.get("NoteDrawer", "")
# 商品明细
commodity_names = result.get("CommodityName", [])
commodity_nums = result.get("CommodityNum", [])
commodity_prices = result.get("CommodityPrice", [])
commodity_amounts = result.get("CommodityAmount", [])
commodity_tax_rates = result.get("CommodityTaxRate", [])
commodity_tax_amounts = result.get("CommodityTax", [])
for i in range(len(commodity_names)):
item = InvoiceItem()
name_info = commodity_names[i]
item.name = name_info.get("word", "") if isinstance(name_info, dict) else str(name_info)
if i < len(commodity_nums):
num_info = commodity_nums[i]
item.quantity = self._parse_amount(num_info.get("word", "0") if isinstance(num_info, dict) else str(num_info))
if i < len(commodity_prices):
price_info = commodity_prices[i]
item.unit_price = self._parse_amount(price_info.get("word", "0") if isinstance(price_info, dict) else str(price_info))
if i < len(commodity_amounts):
amount_info = commodity_amounts[i]
item.amount = self._parse_amount(amount_info.get("word", "0") if isinstance(amount_info, dict) else str(amount_info))
if i < len(commodity_tax_rates):
rate_info = commodity_tax_rates[i]
item.tax_rate = rate_info.get("word", "") if isinstance(rate_info, dict) else str(rate_info)
if i < len(commodity_tax_amounts):
tax_info = commodity_tax_amounts[i]
item.tax_amount = self._parse_amount(tax_info.get("word", "0") if isinstance(tax_info, dict) else str(tax_info))
if item.name:
invoice.items.append(item)
return invoice
def _recognize_general(self, image_data: str, file_path: Path) -> Optional[InvoiceInfo]:
"""使用通用文字识别(高精度版)"""
try:
url = f"{BaiduOCRConfig.GENERAL_URL}?access_token={self.access_token}"
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
params = {
"image": image_data,
"detect_direction": "true",
"probability": "false"
}
response = requests.post(url, data=params, headers=headers, timeout=30)
result = response.json()
if "words_result" in result:
text_lines = [item["words"] for item in result["words_result"]]
# 使用本地解析逻辑
from invoice_extractor import InvoiceExtractor
local_extractor = InvoiceExtractor.__new__(InvoiceExtractor)
invoice = local_extractor._parse_invoice_text(text_lines)
invoice.source_file = str(file_path)
invoice.extraction_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
return invoice
else:
print(f"[FAIL] 通用识别失败: {result.get('error_msg', '未知错误')}")
return None
except Exception as e:
print(f"[FAIL] 通用识别请求失败: {e}")
return None
def _parse_amount(self, value) -> float:
"""解析金额字符串"""
if isinstance(value, (int, float)):
return float(value)
if isinstance(value, str):
# 移除货币符号和逗号
value = value.replace('元', '').replace('元', '').replace(',', '').replace(',', '').strip()
try:
return float(value)
except ValueError:
return 0.0
return 0.0
def extract_invoices_with_baidu(directory: str, api_key: str = None, secret_key: str = None) -> List[InvoiceInfo]:
"""
使用百度OCR从目录中提取所有发票信息
Args:
directory: 包含发票文件的目录路径
api_key: 百度智能云API Key
secret_key: 百度智能云Secret Key
Returns:
InvoiceInfo对象列表
"""
directory = Path(directory)
if not directory.exists():
print(f"[FAIL] 目录不存在: {directory}")
return []
# 支持的文件扩展名
extensions = {'.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.tif', '.pdf'}
# 获取所有支持的文件
files = [f for f in directory.iterdir() if f.suffix.lower() in extensions]
if not files:
print(f"[FAIL] 目录中没有找到发票文件: {directory}")
return []
print(f"发现 {len(files)} 个待处理文件")
# 初始化百度OCR提取器
extractor = BaiduInvoiceExtractor(api_key, secret_key)
if not extractor.access_token:
print("[FAIL] 百度OCR认证失败,请检查API Key和Secret Key")
return []
invoices = []
for i, file_path in enumerate(files, 1):
print(f"\n[{i}/{len(files)}] 处理: {file_path.name}")
invoice = extractor.extract_from_file(str(file_path))
if invoice:
invoices.append(invoice)
print(f"[OK] 成功提取: {invoice.invoice_number or '未识别号码'}")
else:
print(f"[FAIL] 提取失败")
return invoices
FILE:src/config.py
"""
配置文件
存储API密钥和其他配置
"""
import os
from pathlib import Path
class Config:
"""应用配置"""
# 百度OCR配置
# 请从 https://cloud.baidu.com/product/ocr 获取免费API Key
BAIDU_API_KEY = os.getenv("BAIDU_API_KEY", "")
BAIDU_SECRET_KEY = os.getenv("BAIDU_SECRET_KEY", "")
# 输入输出目录
INPUT_DIR = "fp"
OUTPUT_DIR = "output"
# OCR引擎选择: "baidu" 或 "local"
OCR_ENGINE = "baidu"
@classmethod
def load_from_file(cls, config_file: str = None):
"""从配置文件加载"""
if config_file is None:
# 尝试多个位置
possible_paths = [
Path("config.txt"),
Path(__file__).parent.parent / "config.txt",
Path.cwd() / "config.txt",
]
for path in possible_paths:
if path.exists():
config_path = path
break
else:
config_path = Path("config.txt")
else:
config_path = Path(config_file)
if config_path.exists():
with open(config_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if line and '=' in line and not line.startswith('#'):
key, value = line.split('=', 1)
key = key.strip()
value = value.strip()
if key == "BAIDU_API_KEY":
cls.BAIDU_API_KEY = value
elif key == "BAIDU_SECRET_KEY":
cls.BAIDU_SECRET_KEY = value
elif key == "OCR_ENGINE":
cls.OCR_ENGINE = value
@classmethod
def save_to_file(cls, config_file: str = "config.txt"):
"""保存到配置文件"""
config_path = Path(config_file)
with open(config_path, 'w', encoding='utf-8') as f:
f.write("# 发票识别工具配置文件\n")
f.write("# 请从 https://cloud.baidu.com/product/ocr 获取免费API Key\n\n")
f.write(f"BAIDU_API_KEY={cls.BAIDU_API_KEY}\n")
f.write(f"BAIDU_SECRET_KEY={cls.BAIDU_SECRET_KEY}\n")
f.write(f"OCR_ENGINE={cls.OCR_ENGINE}\n")
def setup_config():
"""交互式配置向导"""
print("\n" + "=" * 60)
print("发票识别工具 - 配置向导")
print("=" * 60)
print("\n本工具支持两种OCR引擎:")
print("1. 百度OCR API(推荐)- 识别准确率高,有免费额度")
print("2. 本地OCR - 完全离线,但准确率较低")
print("\n推荐使用百度OCR API:")
print(" 1. 访问 https://cloud.baidu.com/product/ocr")
print(" 2. 注册并登录百度智能云账号")
print(" 3. 创建应用,获取 API Key 和 Secret Key")
print(" 4. 免费额度:50000次/天(个人用户足够使用)")
choice = input("\n是否现在配置百度OCR? (y/n): ").strip().lower()
if choice == 'y':
api_key = input("请输入百度API Key: ").strip()
secret_key = input("请输入百度Secret Key: ").strip()
Config.BAIDU_API_KEY = api_key
Config.BAIDU_SECRET_KEY = secret_key
Config.OCR_ENGINE = "baidu"
# 保存配置
Config.save_to_file()
print(f"\n[OK] 配置已保存到: config.txt")
return True
else:
print("\n将使用本地OCR引擎(识别准确率较低)")
Config.OCR_ENGINE = "local"
Config.save_to_file()
return False
FILE:src/excel_exporter.py
"""
Excel导出模块
将发票信息导出到Excel文件
"""
import pandas as pd
from pathlib import Path
from typing import List, Optional
from datetime import datetime
from openpyxl import load_workbook
from openpyxl.styles import Font, Alignment, Border, Side, PatternFill
from openpyxl.utils.dataframe import dataframe_to_rows
from invoice_model import InvoiceInfo
class ExcelExporter:
"""发票信息Excel导出器"""
def __init__(self):
self.default_columns = [
"发票代码",
"发票号码",
"开票日期",
"发票类型",
"购买方名称",
"购买方税号",
"购买方地址电话",
"购买方开户行",
"销售方名称",
"销售方税号",
"销售方地址电话",
"销售方开户行",
"合计金额",
"合计税额",
"价税合计",
"商品明细摘要",
"备注",
"复核人",
"收款人",
"开票人",
"源文件",
"提取时间",
]
def export_to_excel(
self,
invoices: List[InvoiceInfo],
output_path: str,
sheet_name: str = "发票信息",
include_items: bool = False
) -> bool:
"""
将发票信息导出到Excel
Args:
invoices: 发票信息列表
output_path: 输出Excel文件路径
sheet_name: 工作表名称
include_items: 是否包含商品明细
Returns:
导出成功返回True
"""
if not invoices:
print("[FAIL] 没有发票信息可导出")
return False
try:
output_path = Path(output_path)
output_path.parent.mkdir(parents=True, exist_ok=True)
# 准备数据
data = []
for invoice in invoices:
row = invoice.to_dict()
# 添加商品明细摘要
row["商品明细摘要"] = invoice.get_items_summary()
data.append(row)
# 创建DataFrame
df = pd.DataFrame(data)
# 确保列顺序
columns = [col for col in self.default_columns if col in df.columns]
df = df[columns]
# 导出到Excel
with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
df.to_excel(writer, sheet_name=sheet_name, index=False)
# 获取工作簿和工作表
workbook = writer.book
worksheet = writer.sheets[sheet_name]
# 格式化Excel
self._format_worksheet(worksheet, len(df))
# 如果包含商品明细,添加明细工作表
if include_items:
self._add_items_sheet(writer, invoices)
print(f"[OK] 成功导出 {len(invoices)} 条发票信息到: {output_path}")
return True
except Exception as e:
print(f"[FAIL] Excel导出失败: {e}")
return False
def _format_worksheet(self, worksheet, row_count: int):
"""格式化工作表"""
# 定义样式
header_font = Font(name='微软雅黑', size=11, bold=True, color='FFFFFF')
header_fill = PatternFill(start_color='4472C4', end_color='4472C4', fill_type='solid')
header_alignment = Alignment(horizontal='center', vertical='center', wrap_text=True)
cell_font = Font(name='微软雅黑', size=10)
cell_alignment = Alignment(horizontal='left', vertical='center', wrap_text=False)
number_alignment = Alignment(horizontal='right', vertical='center')
border = Border(
left=Side(style='thin', color='000000'),
right=Side(style='thin', color='000000'),
top=Side(style='thin', color='000000'),
bottom=Side(style='thin', color='000000')
)
# 格式化表头
for cell in worksheet[1]:
cell.font = header_font
cell.fill = header_fill
cell.alignment = header_alignment
cell.border = border
# 格式化数据行
for row in worksheet.iter_rows(min_row=2, max_row=row_count + 1):
for cell in row:
cell.font = cell_font
cell.border = border
# 根据列名设置对齐方式
col_name = worksheet.cell(1, cell.column).value
if col_name in ['合计金额', '合计税额', '价税合计']:
cell.alignment = number_alignment
# 格式化为货币格式(使用Excel内置的货币格式)
if isinstance(cell.value, (int, float)):
cell.number_format = '#,##0.00'
else:
cell.alignment = cell_alignment
# 调整列宽
self._adjust_column_widths(worksheet)
# 冻结首行
worksheet.freeze_panes = 'A2'
def _adjust_column_widths(self, worksheet):
"""自动调整列宽"""
column_widths = {
"发票代码": 15,
"发票号码": 15,
"开票日期": 12,
"发票类型": 12,
"购买方名称": 30,
"购买方税号": 20,
"购买方地址电话": 35,
"购买方开户行": 35,
"销售方名称": 30,
"销售方税号": 20,
"销售方地址电话": 35,
"销售方开户行": 35,
"合计金额": 12,
"合计税额": 12,
"价税合计": 12,
"商品明细摘要": 40,
"备注": 30,
"复核人": 10,
"收款人": 10,
"开票人": 10,
"源文件": 40,
"提取时间": 18,
}
for col_idx, cell in enumerate(worksheet[1], 1):
col_name = cell.value
if col_name in column_widths:
worksheet.column_dimensions[cell.column_letter].width = column_widths[col_name]
else:
# 默认宽度
worksheet.column_dimensions[cell.column_letter].width = 15
def _add_items_sheet(self, writer, invoices: List[InvoiceInfo]):
"""添加商品明细工作表"""
items_data = []
for invoice in invoices:
for item in invoice.items:
items_data.append({
"发票代码": invoice.invoice_code,
"发票号码": invoice.invoice_number,
"开票日期": invoice.invoice_date,
"商品名称": item.name,
"规格型号": item.specification,
"单位": item.unit,
"数量": item.quantity,
"单价": item.unit_price,
"金额": item.amount,
"税率": item.tax_rate,
"税额": item.tax_amount,
})
if items_data:
df_items = pd.DataFrame(items_data)
df_items.to_excel(writer, sheet_name="商品明细", index=False)
# 格式化商品明细表
worksheet = writer.sheets["商品明细"]
self._format_worksheet(worksheet, len(df_items))
def export_summary(
self,
invoices: List[InvoiceInfo],
output_path: str
) -> bool:
"""
导出发票汇总统计
Args:
invoices: 发票信息列表
output_path: 输出文件路径
Returns:
导出成功返回True
"""
try:
# 计算统计数据
total_count = len(invoices)
total_amount = sum(inv.total_amount for inv in invoices)
total_tax = sum(inv.total_tax_amount for inv in invoices)
total_with_tax = sum(inv.total_amount_with_tax for inv in invoices)
# 按销售方统计
seller_stats = {}
for inv in invoices:
seller = inv.seller_name or "未知"
if seller not in seller_stats:
seller_stats[seller] = {
"发票数量": 0,
"合计金额": 0.0,
"合计税额": 0.0,
"价税合计": 0.0,
}
seller_stats[seller]["发票数量"] += 1
seller_stats[seller]["合计金额"] += inv.total_amount
seller_stats[seller]["合计税额"] += inv.total_tax_amount
seller_stats[seller]["价税合计"] += inv.total_amount_with_tax
# 创建汇总数据
summary_data = [
["统计项目", "数值"],
["发票总数", total_count],
["合计金额", total_amount],
["合计税额", total_tax],
["价税合计", total_with_tax],
["", ""],
["销售方", "发票数量", "合计金额", "合计税额", "价税合计"],
]
for seller, stats in seller_stats.items():
summary_data.append([
seller,
stats["发票数量"],
stats["合计金额"],
stats["合计税额"],
stats["价税合计"],
])
# 导出到Excel
df_summary = pd.DataFrame(summary_data)
df_summary.to_excel(output_path, sheet_name="汇总统计", index=False, header=False)
# 格式化
workbook = load_workbook(output_path)
worksheet = workbook["汇总统计"]
# 设置标题样式
title_font = Font(name='微软雅黑', size=12, bold=True)
title_fill = PatternFill(start_color='70AD47', end_color='70AD47', fill_type='solid')
for cell in worksheet[1]:
cell.font = title_font
cell.fill = title_fill
cell.alignment = Alignment(horizontal='center', vertical='center')
# 设置数值格式
for row in worksheet.iter_rows(min_row=2, max_row=5):
if len(row) > 1:
row[1].number_format = '#,##0.00'
workbook.save(output_path)
print(f"[OK] 成功导出汇总统计到: {output_path}")
return True
except Exception as e:
print(f"[FAIL] 汇总统计导出失败: {e}")
return False
def export_invoices(
invoices: List[InvoiceInfo],
output_dir: str = "output",
filename_prefix: str = "发票信息"
) -> bool:
"""
便捷函数:导出发票信息到Excel
Args:
invoices: 发票信息列表
output_dir: 输出目录
filename_prefix: 文件名前缀
Returns:
导出成功返回True
"""
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
# 导出主表
main_file = output_dir / f"{filename_prefix}_{timestamp}.xlsx"
exporter = ExcelExporter()
success = exporter.export_to_excel(
invoices,
str(main_file),
include_items=True
)
if success:
print(f"\n导出文件列表:")
print(f" - 主表: {main_file}")
return success
FILE:src/invoice_extractor.py
"""
发票信息提取器
使用OCR技术从图片和PDF中提取发票信息
"""
import re
import os
from pathlib import Path
from typing import List, Optional, Tuple
from datetime import datetime
import fitz # PyMuPDF
from PIL import Image
import numpy as np
from invoice_model import InvoiceInfo, InvoiceItem
class InvoiceExtractor:
"""发票信息提取器"""
def __init__(self):
self.ocr = None
self._init_ocr()
def _init_ocr(self):
"""初始化OCR引擎"""
try:
from paddleocr import PaddleOCR
# 使用中文模型
self.ocr = PaddleOCR(
lang='ch'
)
print("[OK] OCR引擎初始化成功")
except Exception as e:
print(f"[FAIL] OCR引擎初始化失败: {e}")
raise
def extract_from_file(self, file_path: str) -> Optional[InvoiceInfo]:
"""
从文件中提取发票信息
Args:
file_path: 文件路径(图片或PDF)
Returns:
InvoiceInfo对象,提取失败返回None
"""
file_path = Path(file_path)
if not file_path.exists():
print(f"[FAIL] 文件不存在: {file_path}")
return None
# 根据文件类型选择提取方式
suffix = file_path.suffix.lower()
if suffix in ['.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.tif']:
return self._extract_from_image(file_path)
elif suffix == '.pdf':
return self._extract_from_pdf(file_path)
else:
print(f"[FAIL] 不支持的文件格式: {suffix}")
return None
def _extract_from_image(self, image_path: Path) -> Optional[InvoiceInfo]:
"""从图片中提取发票信息"""
try:
# 使用OCR识别文字
result = self.ocr.ocr(str(image_path))
if not result or not result[0]:
print(f"[FAIL] 未能从图片中识别到文字: {image_path}")
return None
# 提取所有文本行
text_lines = []
for line in result[0]:
if line:
text = line[1][0] # 获取识别的文本
confidence = line[1][1] # 获取置信度
if confidence > 0.5: # 过滤低置信度结果
text_lines.append(text)
# 解析发票信息
invoice_info = self._parse_invoice_text(text_lines)
invoice_info.source_file = str(image_path)
invoice_info.extraction_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
return invoice_info
except Exception as e:
print(f"[FAIL] 图片OCR处理失败: {e}")
# 尝试使用PDF方式处理图片(将图片转为PDF再处理)
return self._extract_image_as_pdf(image_path)
def _extract_from_pdf(self, pdf_path: Path) -> Optional[InvoiceInfo]:
"""从PDF中提取发票信息"""
try:
# 打开PDF文件
doc = fitz.open(str(pdf_path))
all_text_lines = []
# 遍历每一页
for page_num in range(len(doc)):
page = doc[page_num]
# 首先尝试直接提取文本
text = page.get_text()
if text.strip():
text_lines = [line.strip() for line in text.split('\n') if line.strip()]
all_text_lines.extend(text_lines)
else:
# 如果没有文本,将页面转为图片进行OCR
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2)) # 2倍分辨率
img_data = pix.tobytes("png")
# 保存临时图片
temp_img_path = Path(".temp/cache") / f"pdf_page_{page_num}.png"
temp_img_path.parent.mkdir(parents=True, exist_ok=True)
with open(temp_img_path, 'wb') as f:
f.write(img_data)
# OCR识别
result = self.ocr.ocr(str(temp_img_path))
if result and result[0]:
for line in result[0]:
if line:
text = line[1][0]
confidence = line[1][1]
if confidence > 0.5:
all_text_lines.append(text)
# 清理临时文件
temp_img_path.unlink(missing_ok=True)
doc.close()
if not all_text_lines:
print(f"[FAIL] 未能从PDF中识别到文字: {pdf_path}")
return None
# 解析发票信息
invoice_info = self._parse_invoice_text(all_text_lines)
invoice_info.source_file = str(pdf_path)
invoice_info.extraction_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
return invoice_info
except Exception as e:
print(f"[FAIL] PDF处理失败: {e}")
return None
def _parse_invoice_text(self, text_lines: List[str]) -> InvoiceInfo:
"""
从OCR识别的文本中解析发票信息
Args:
text_lines: 识别的文本行列表
Returns:
InvoiceInfo对象
"""
invoice = InvoiceInfo()
full_text = '\n'.join(text_lines)
# 提取发票代码(10-12位数字)
invoice.invoice_code = self._extract_invoice_code(full_text)
# 提取发票号码(8-20位数字)
invoice.invoice_number = self._extract_invoice_number(full_text)
# 提取开票日期
invoice.invoice_date = self._extract_invoice_date(full_text)
# 提取购买方信息
invoice.buyer_name = self._extract_buyer_name(text_lines, full_text)
invoice.buyer_tax_number = self._extract_buyer_tax_number(text_lines, full_text)
invoice.buyer_address = self._extract_buyer_address(text_lines, full_text)
invoice.buyer_bank = self._extract_buyer_bank(text_lines, full_text)
# 提取销售方信息
invoice.seller_name = self._extract_seller_name(text_lines, full_text)
invoice.seller_tax_number = self._extract_seller_tax_number(text_lines, full_text)
invoice.seller_address = self._extract_seller_address(text_lines, full_text)
invoice.seller_bank = self._extract_seller_bank(text_lines, full_text)
# 提取金额信息
invoice.total_amount, invoice.total_tax_amount, invoice.total_amount_with_tax = \
self._extract_amounts(text_lines, full_text)
# 提取其他信息
invoice.remarks = self._extract_remarks(text_lines, full_text)
invoice.checker = self._extract_person(text_lines, "复核")
invoice.payee = self._extract_person(text_lines, "收款")
invoice.issuer = self._extract_person(text_lines, "开票")
# 提取商品明细
invoice.items = self._extract_items(text_lines)
return invoice
def _extract_invoice_code(self, text: str) -> str:
"""提取发票代码(10-12位数字)"""
# 优先查找明确标记的发票代码
patterns = [
r'发票代码[::\s]*(\d{10,12})',
r'代码[::\s]*(\d{10,12})',
]
for pattern in patterns:
match = re.search(pattern, text)
if match:
return match.group(1)
return ""
def _extract_invoice_number(self, text: str) -> str:
"""提取发票号码(8-20位数字)"""
patterns = [
r'发票号码[::\s]*(\d{8,20})',
r'号码[::\s]*(\d{8,20})',
r'No[.::\s]*(\d{8,20})',
]
for pattern in patterns:
match = re.search(pattern, text)
if match:
return match.group(1)
return ""
def _extract_invoice_date(self, text: str) -> str:
"""提取开票日期"""
patterns = [
r'(\d{4}[年/-]\d{1,2}[月/-]\d{1,2}[日]?)',
r'开票日期[::\s]*(\d{4}[年/-]\d{1,2}[月/-]\d{1,2}[日]?)',
r'日期[::\s]*(\d{4}[年/-]\d{1,2}[月/-]\d{1,2}[日]?)',
]
for pattern in patterns:
match = re.search(pattern, text)
if match:
return match.group(1)
return ""
def _extract_buyer_name(self, text_lines: List[str], full_text: str) -> str:
"""提取购买方名称"""
# 查找购买方信息区域
buyer_section_start = -1
buyer_section_end = -1
for i, line in enumerate(text_lines):
if '购买方' in line or '购方' in line:
buyer_section_start = i
if buyer_section_start >= 0 and ('销售方' in line or '销方' in line):
buyer_section_end = i
break
if buyer_section_start >= 0:
end = buyer_section_end if buyer_section_end > 0 else min(buyer_section_start + 5, len(text_lines))
for i in range(buyer_section_start, end):
name = self._extract_company_name(text_lines[i])
if name:
return name
return ""
def _extract_seller_name(self, text_lines: List[str], full_text: str) -> str:
"""提取销售方名称"""
# 查找销售方信息区域
seller_section_start = -1
for i, line in enumerate(text_lines):
if '销售方' in line or '销方' in line:
seller_section_start = i
break
if seller_section_start >= 0:
for i in range(seller_section_start, min(seller_section_start + 5, len(text_lines))):
name = self._extract_company_name(text_lines[i])
if name:
return name
return ""
def _extract_company_name(self, text: str) -> str:
"""从文本中提取公司名称"""
# 匹配公司名称模式
patterns = [
r'([^\d\s]{2,}(?:公司|企业|集团|厂|店|部|中心|研究院|事务所|商行|经营部))',
]
for pattern in patterns:
match = re.search(pattern, text)
if match:
name = match.group(1).strip()
# 过滤掉太短的或包含特定关键词的
if len(name) >= 4 and not any(kw in name for kw in ['购买方', '销售方', '名称', '纳税人']):
return name
return ""
def _extract_buyer_tax_number(self, text_lines: List[str], full_text: str) -> str:
"""提取购买方纳税人识别号"""
return self._extract_tax_number_in_section(text_lines, ['购买方', '购方'], ['销售方', '销方'])
def _extract_seller_tax_number(self, text_lines: List[str], full_text: str) -> str:
"""提取销售方纳税人识别号"""
return self._extract_tax_number_in_section(text_lines, ['销售方', '销方'], ['合计', '价税合计'])
def _extract_tax_number_in_section(self, text_lines: List[str], start_markers: List[str], end_markers: List[str]) -> str:
"""在指定区域内提取纳税人识别号"""
collecting = False
for line in text_lines:
# 检查是否进入目标区域
if any(marker in line for marker in start_markers):
collecting = True
continue
# 检查是否离开目标区域
if collecting and any(marker in line for marker in end_markers):
break
if collecting:
# 纳税人识别号通常是18位数字字母组合,或15-20位
match = re.search(r'[A-Z0-9]{15,20}', line.replace(' ', ''))
if match:
code = match.group(0)
# 验证是否包含字母和数字
if any(c.isalpha() for c in code) or len(code) >= 15:
return code
return ""
def _extract_buyer_address(self, text_lines: List[str], full_text: str) -> str:
"""提取购买方地址电话"""
return self._extract_address_in_section(text_lines, ['购买方', '购方'], ['销售方', '销方'])
def _extract_seller_address(self, text_lines: List[str], full_text: str) -> str:
"""提取销售方地址电话"""
return self._extract_address_in_section(text_lines, ['销售方', '销方'], ['合计', '价税合计'])
def _extract_address_in_section(self, text_lines: List[str], start_markers: List[str], end_markers: List[str]) -> str:
"""在指定区域内提取地址电话"""
collecting = False
for line in text_lines:
if any(marker in line for marker in start_markers):
collecting = True
continue
if collecting and any(marker in line for marker in end_markers):
break
if collecting:
# 匹配地址+电话的模式
# 地址通常包含省市区、路街等
match = re.search(r'([^\d]{3,}.*?\d{3,4}-?\d{7,8})', line)
if match:
return match.group(1).strip()
# 也可能只有地址
match = re.search(r'([^\d]{5,}(?:省|市|区|县|路|街|号))', line)
if match:
return match.group(1).strip()
return ""
def _extract_buyer_bank(self, text_lines: List[str], full_text: str) -> str:
"""提取购买方开户行及账号"""
return self._extract_bank_in_section(text_lines, ['购买方', '购方'], ['销售方', '销方'])
def _extract_seller_bank(self, text_lines: List[str], full_text: str) -> str:
"""提取销售方开户行及账号"""
return self._extract_bank_in_section(text_lines, ['销售方', '销方'], ['合计', '价税合计'])
def _extract_bank_in_section(self, text_lines: List[str], start_markers: List[str], end_markers: List[str]) -> str:
"""在指定区域内提取开户行及账号"""
collecting = False
for line in text_lines:
if any(marker in line for marker in start_markers):
collecting = True
continue
if collecting and any(marker in line for marker in end_markers):
break
if collecting:
# 匹配银行+账号的模式
match = re.search(r'((?:中国|工商|农业|建设|交通|招商|光大|中信|民生|浦发|平安|华夏|兴业|广发|北京|上海|广州|深圳)?(?:银行|支行)[^\d]*\d{10,})', line)
if match:
return match.group(1).strip()
return ""
def _extract_amounts(self, text_lines: List[str], full_text: str) -> tuple:
"""提取金额信息(合计金额、税额、价税合计)"""
total_amount = 0.0
total_tax = 0.0
total_with_tax = 0.0
# 查找金额相关的行
for line in text_lines:
line = line.replace(',', '').replace(',', '')
# 合计金额(不含税)
if '合计金额' in line or ('合计' in line and '税额' not in line and '价税' not in line):
match = re.search(r'[元元]?\s*(\d+\.\d{2})', line)
if match:
total_amount = float(match.group(1))
# 合计税额
if '合计税额' in line or '税额' in line:
match = re.search(r'[元元]?\s*(\d+\.\d{2})', line)
if match:
val = float(match.group(1))
if val != total_amount: # 避免重复
total_tax = val
# 价税合计
if '价税合计' in line or '小写' in line:
# 优先匹配元符号后面的金额
match = re.search(r'[元元]\s*(\d+\.\d{2})', line)
if match:
total_with_tax = float(match.group(1))
else:
match = re.search(r'(\d+\.\d{2})', line)
if match:
total_with_tax = float(match.group(1))
return total_amount, total_tax, total_with_tax
def _extract_remarks(self, text_lines: List[str], full_text: str) -> str:
"""提取备注"""
for i, line in enumerate(text_lines):
if '备注' in line:
# 如果备注在同一行
if len(line) > 5 and not line.strip() == '备注':
remark = re.sub(r'^.*?备注[::\s]*', '', line).strip()
if remark:
return remark
# 如果备注在下一行
elif i + 1 < len(text_lines):
next_line = text_lines[i + 1].strip()
if next_line and not any(kw in next_line for kw in ['开票人', '复核人', '收款人', '销售方']):
return next_line
return ""
def _extract_person(self, text_lines: List[str], role: str) -> str:
"""提取人员信息(复核、收款、开票)"""
for line in text_lines:
if role in line:
# 提取人名(通常是2-4个汉字)
match = re.search(rf'{role}[::\s]*([^\d\s]{{2,4}})', line)
if match:
return match.group(1).strip()
# 也可能人名在关键词后面
parts = line.split(role)
if len(parts) > 1:
name = parts[-1].strip()[:4]
if name and all('\u4e00' <= c <= '\u9fff' for c in name):
return name
return ""
def _extract_items(self, text_lines: List[str]) -> List[InvoiceItem]:
"""提取商品明细"""
items = []
in_items_section = False
for line in text_lines:
line = line.strip()
if not line:
continue
# 检测是否进入商品明细区域
if any(kw in line for kw in ['货物或应税劳务', '项目名称', '规格型号', '单位', '数量', '单价']):
in_items_section = True
continue
# 检测是否离开商品明细区域
if any(kw in line for kw in ['合计', '价税合计', '销售方']) and in_items_section:
in_items_section = False
continue
if in_items_section:
item = self._parse_item_line(line)
if item and item.name:
items.append(item)
return items
def _parse_item_line(self, line: str) -> Optional[InvoiceItem]:
"""解析单行商品明细"""
item = InvoiceItem()
# 尝试提取商品名称(通常是中文开头)
# 过滤掉纯数字行和太短的内容
name_match = re.search(r'^([\u4e00-\u9fa5][\u4e00-\u9fa5a-zA-Z0-9*]{1,}(?:[、,,]?[\u4e00-\u9fa5]+)*)', line)
if name_match:
name = name_match.group(1).strip()
# 过滤掉非商品名称的内容
if len(name) >= 2 and not any(kw in name for kw in ['税率', '税额', '规格', '型号', '单位']):
item.name = name
# 提取金额(通常是两位小数)
amounts = re.findall(r'(\d+\.\d{2})', line)
if len(amounts) >= 1:
item.amount = float(amounts[0])
if len(amounts) >= 2:
item.tax_amount = float(amounts[-1])
# 提取税率
tax_rate_match = re.search(r'(\d+)%', line)
if tax_rate_match:
item.tax_rate = tax_rate_match.group(1) + '%'
# 提取规格型号
spec_match = re.search(r'([\u4e00-\u9fa5a-zA-Z0-9-]{2,})', line)
if spec_match and not item.name:
item.specification = spec_match.group(1)
return item if item.name else None
def _extract_image_as_pdf(self, image_path: Path) -> Optional[InvoiceInfo]:
"""将图片转为PDF后提取信息(备用方案)"""
try:
from PIL import Image
import io
print("尝试使用备用方案处理图片...")
# 打开图片
img = Image.open(image_path)
# 转换为RGB模式(如果是RGBA或其他模式)
if img.mode != 'RGB':
img = img.convert('RGB')
# 创建PDF内存文件
pdf_bytes = io.BytesIO()
img.save(pdf_bytes, format='PDF', resolution=100.0)
pdf_bytes.seek(0)
# 保存临时PDF文件
temp_pdf_path = Path(".temp/cache") / f"{image_path.stem}.pdf"
temp_pdf_path.parent.mkdir(parents=True, exist_ok=True)
with open(temp_pdf_path, 'wb') as f:
f.write(pdf_bytes.getvalue())
# 使用PDF提取方法
invoice = self._extract_from_pdf(temp_pdf_path)
# 更新源文件路径
if invoice:
invoice.source_file = str(image_path)
# 清理临时文件
temp_pdf_path.unlink(missing_ok=True)
return invoice
except Exception as e:
print(f"[FAIL] 备用方案处理失败: {e}")
return None
def extract_invoices_from_directory(directory: str) -> List[InvoiceInfo]:
"""
从目录中提取所有发票信息
Args:
directory: 包含发票文件的目录路径
Returns:
InvoiceInfo对象列表
"""
directory = Path(directory)
if not directory.exists():
print(f"[FAIL] 目录不存在: {directory}")
return []
# 支持的文件扩展名
extensions = {'.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.tif', '.pdf'}
# 获取所有支持的文件
files = [f for f in directory.iterdir() if f.suffix.lower() in extensions]
if not files:
print(f"[FAIL] 目录中没有找到发票文件: {directory}")
return []
print(f"发现 {len(files)} 个待处理文件")
extractor = InvoiceExtractor()
invoices = []
for i, file_path in enumerate(files, 1):
print(f"\n[{i}/{len(files)}] 处理: {file_path.name}")
invoice = extractor.extract_from_file(str(file_path))
if invoice:
invoices.append(invoice)
print(f"[OK] 成功提取: {invoice.invoice_number or '未识别号码'}")
else:
print(f"[FAIL] 提取失败")
return invoices
FILE:src/invoice_model.py
"""
发票数据模型
定义发票信息的结构化数据类
"""
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime
@dataclass
class InvoiceItem:
"""发票商品明细项"""
name: str = "" # 货物或应税劳务名称
specification: str = "" # 规格型号
unit: str = "" # 单位
quantity: float = 0.0 # 数量
unit_price: float = 0.0 # 单价
amount: float = 0.0 # 金额
tax_rate: str = "" # 税率
tax_amount: float = 0.0 # 税额
@dataclass
class InvoiceInfo:
"""发票信息数据类"""
# 基本信息
invoice_code: str = "" # 发票代码
invoice_number: str = "" # 发票号码
invoice_date: str = "" # 开票日期
invoice_type: str = "" # 发票类型
# 购买方信息
buyer_name: str = "" # 购买方名称
buyer_tax_number: str = "" # 购买方纳税人识别号
buyer_address: str = "" # 购买方地址电话
buyer_bank: str = "" # 购买方开户行及账号
# 销售方信息
seller_name: str = "" # 销售方名称
seller_tax_number: str = "" # 销售方纳税人识别号
seller_address: str = "" # 销售方地址电话
seller_bank: str = "" # 销售方开户行及账号
# 金额信息
total_amount: float = 0.0 # 合计金额
total_tax_amount: float = 0.0 # 合计税额
total_amount_with_tax: float = 0.0 # 价税合计
# 商品明细
items: List[InvoiceItem] = field(default_factory=list)
# 其他信息
remarks: str = "" # 备注
machine_number: str = "" # 机器编号
checker: str = "" # 复核人
payee: str = "" # 收款人
issuer: str = "" # 开票人
# 源文件信息
source_file: str = "" # 源文件路径
extraction_time: str = "" # 提取时间
def to_dict(self) -> dict:
"""转换为字典格式"""
return {
"发票代码": self.invoice_code,
"发票号码": self.invoice_number,
"开票日期": self.invoice_date,
"发票类型": self.invoice_type,
"购买方名称": self.buyer_name,
"购买方税号": self.buyer_tax_number,
"购买方地址电话": self.buyer_address,
"购买方开户行": self.buyer_bank,
"销售方名称": self.seller_name,
"销售方税号": self.seller_tax_number,
"销售方地址电话": self.seller_address,
"销售方开户行": self.seller_bank,
"合计金额": self.total_amount,
"合计税额": self.total_tax_amount,
"价税合计": self.total_amount_with_tax,
"备注": self.remarks,
"机器编号": self.machine_number,
"复核人": self.checker,
"收款人": self.payee,
"开票人": self.issuer,
"源文件": self.source_file,
"提取时间": self.extraction_time,
}
def get_items_summary(self) -> str:
"""获取商品明细摘要"""
if not self.items:
return ""
item_names = [item.name for item in self.items if item.name]
return "、".join(item_names[:3]) + ("..." if len(item_names) > 3 else "")
FILE:src/main.py
#!/usr/bin/env python3
"""
发票信息提取工具主程序
从fp目录下的发票图片和PDF文件中提取信息并导出到Excel
"""
import sys
import argparse
from pathlib import Path
from datetime import datetime
# 添加src目录到路径
sys.path.insert(0, str(Path(__file__).parent))
from invoice_extractor import extract_invoices_from_directory
from excel_exporter import export_invoices, ExcelExporter
def main():
"""主函数"""
parser = argparse.ArgumentParser(
description='发票信息提取工具 - 从图片和PDF中提取发票信息到Excel',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
使用示例:
python main.py # 使用默认设置,从fp目录提取
python main.py -i ./fp -o ./output # 指定输入输出目录
python main.py -v # 显示详细处理信息
"""
)
parser.add_argument(
'-i', '--input',
default='fp',
help='输入目录路径,包含发票图片和PDF文件 (默认: fp)'
)
parser.add_argument(
'-o', '--output',
default='output',
help='输出目录路径 (默认: output)'
)
parser.add_argument(
'-n', '--name',
default='发票信息',
help='输出文件名前缀 (默认: 发票信息)'
)
parser.add_argument(
'--no-items',
action='store_true',
help='不包含商品明细工作表'
)
parser.add_argument(
'-v', '--verbose',
action='store_true',
help='显示详细处理信息'
)
args = parser.parse_args()
# 打印欢迎信息
print("=" * 60)
print("发票信息提取工具")
print("=" * 60)
print(f"输入目录: {args.input}")
print(f"输出目录: {args.output}")
print(f"时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 60)
# 检查输入目录
input_dir = Path(args.input)
if not input_dir.exists():
print(f"\n[FAIL] 错误: 输入目录不存在: {input_dir}")
print("请确保fp目录存在,并将发票文件放入该目录")
return 1
# 提取发票信息
print("\n开始提取发票信息...")
invoices = extract_invoices_from_directory(str(input_dir))
if not invoices:
print("\n[FAIL] 未能提取到任何发票信息")
return 1
print(f"\n[OK] 成功提取 {len(invoices)} 张发票信息")
# 显示提取结果摘要
print("\n提取结果摘要:")
print("-" * 60)
for i, inv in enumerate(invoices, 1):
print(f"{i}. 发票号码: {inv.invoice_number or '未识别'}")
print(f" 开票日期: {inv.invoice_date or '未识别'}")
print(f" 销售方: {inv.seller_name or '未识别'}")
print(f" 金额: 元{inv.total_amount_with_tax:.2f}" if inv.total_amount_with_tax else " 金额: 未识别")
print()
# 导出到Excel
print("\n开始导出到Excel...")
output_dir = Path(args.output)
output_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = output_dir / f"{args.name}_{timestamp}.xlsx"
exporter = ExcelExporter()
success = exporter.export_to_excel(
invoices,
str(output_file),
include_items=not args.no_items
)
if success:
print("\n" + "=" * 60)
print("处理完成!")
print(f"输出文件: {output_file}")
print("=" * 60)
return 0
else:
print("\n[FAIL] 导出失败")
return 1
if __name__ == '__main__':
sys.exit(main())
FILE:src/main_baidu.py
#!/usr/bin/env python3
"""
发票信息提取工具主程序(百度OCR版本)
使用百度OCR API从发票图片和PDF文件中提取信息并导出到Excel
支持多种输入方式:
- 单个文件:python main_baidu.py -f invoice.pdf
- 多个文件:python main_baidu.py -f invoice1.pdf -f invoice2.png
- 整个目录:python main_baidu.py -i ./fp
- 混合模式:python main_baidu.py -i ./fp -f extra_invoice.pdf
"""
import sys
import argparse
from pathlib import Path
from datetime import datetime
from typing import List
# 添加src目录到路径
sys.path.insert(0, str(Path(__file__).parent))
from config import Config, setup_config
from baidu_ocr_extractor import BaiduInvoiceExtractor, extract_invoices_with_baidu
from excel_exporter import ExcelExporter
def collect_files(inputs: List[str]) -> List[Path]:
"""
收集所有待处理的文件
Args:
inputs: 输入路径列表(文件或目录)
Returns:
所有发票文件的路径列表
"""
supported_extensions = {'.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.tif', '.pdf'}
files = []
for input_path in inputs:
path = Path(input_path)
if not path.exists():
print(f"[WARN] 路径不存在,跳过: {path}")
continue
if path.is_file():
# 单个文件
if path.suffix.lower() in supported_extensions:
files.append(path)
else:
print(f"[WARN] 不支持的文件格式,跳过: {path}")
elif path.is_dir():
# 目录 - 递归收集所有支持的文件
for ext in supported_extensions:
files.extend(path.rglob(f"*{ext}"))
files.extend(path.rglob(f"*{ext.upper()}"))
# 去重并保持顺序
seen = set()
unique_files = []
for f in files:
resolved = f.resolve()
if resolved not in seen:
seen.add(resolved)
unique_files.append(f)
return unique_files
def process_files(files: List[Path], api_key: str, secret_key: str, output_dir: Path, output_name: str) -> bool:
"""
处理文件并导出到Excel
Args:
files: 待处理的文件列表
api_key: 百度API Key
secret_key: 百度Secret Key
output_dir: 输出目录
output_name: 输出文件名前缀
Returns:
处理成功返回True
"""
if not files:
print("[FAIL] 没有找到可处理的发票文件")
return False
print(f"\n发现 {len(files)} 个待处理文件")
print("-" * 60)
for i, f in enumerate(files, 1):
print(f" {i}. {f.name}")
print("-" * 60)
# 初始化提取器
extractor = BaiduInvoiceExtractor(api_key, secret_key)
if not extractor.access_token:
print("[FAIL] 百度OCR认证失败")
return False
# 处理所有文件
invoices = []
for i, file_path in enumerate(files, 1):
print(f"\n[{i}/{len(files)}] 处理: {file_path.name}")
invoice = extractor.extract_from_file(str(file_path))
if invoice:
invoices.append(invoice)
print(f"[OK] 成功提取: {invoice.invoice_number or '未识别号码'}")
else:
print(f"[FAIL] 提取失败")
if not invoices:
print("\n[FAIL] 未能提取到任何发票信息")
return False
print(f"\n[OK] 成功提取 {len(invoices)}/{len(files)} 张发票信息")
# 显示提取结果摘要
print("\n提取结果摘要:")
print("-" * 60)
for i, inv in enumerate(invoices, 1):
print(f"{i}. 发票号码: {inv.invoice_number or '未识别'}")
print(f" 开票日期: {inv.invoice_date or '未识别'}")
print(f" 购买方: {inv.buyer_name or '未识别'}")
print(f" 销售方: {inv.seller_name or '未识别'}")
if inv.total_amount_with_tax:
print(f" 金额: {inv.total_amount_with_tax:.2f}")
else:
print(" 金额: 未识别")
print()
# 导出到Excel
print("\n开始导出到Excel...")
output_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = output_dir / f"{output_name}_{timestamp}.xlsx"
exporter = ExcelExporter()
success = exporter.export_to_excel(
invoices,
str(output_file),
include_items=True
)
if success:
print("\n" + "=" * 60)
print("处理完成!")
print(f"输出文件: {output_file}")
print("=" * 60)
return True
else:
print("\n[FAIL] 导出失败")
return False
def main():
"""主函数"""
parser = argparse.ArgumentParser(
description='发票信息提取工具(百度OCR版)- 从图片和PDF中提取发票信息到Excel',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
使用示例:
# 处理单个文件
python main_baidu.py -f invoice.pdf
# 处理多个文件
python main_baidu.py -f invoice1.pdf -f invoice2.png
# 处理整个目录
python main_baidu.py -i ./fp
# 目录 + 额外文件
python main_baidu.py -i ./fp -f extra_invoice.pdf
# 指定输出
python main_baidu.py -i ./fp -o ./output -n "2024年3月发票"
# 运行配置向导
python main_baidu.py --setup
"""
)
# 输入选项
input_group = parser.add_argument_group('输入选项')
input_group.add_argument(
'-f', '--file',
action='append',
dest='files',
help='指定单个发票文件(可多次使用)'
)
input_group.add_argument(
'-i', '--input',
default='fp',
help='输入目录路径(默认: fp)'
)
# 输出选项
output_group = parser.add_argument_group('输出选项')
output_group.add_argument(
'-o', '--output',
default='output',
help='输出目录路径(默认: output)'
)
output_group.add_argument(
'-n', '--name',
default='发票信息',
help='输出文件名前缀(默认: 发票信息)'
)
# 认证选项
auth_group = parser.add_argument_group('认证选项')
auth_group.add_argument(
'--api-key',
help='百度API Key(覆盖配置文件)'
)
auth_group.add_argument(
'--secret-key',
help='百度Secret Key(覆盖配置文件)'
)
# 其他选项
parser.add_argument(
'--setup',
action='store_true',
help='运行配置向导'
)
parser.add_argument(
'--list',
action='store_true',
help='只列出要处理的文件,不执行识别'
)
args = parser.parse_args()
# 运行配置向导
if args.setup:
setup_config()
return 0
# 加载配置
Config.load_from_file()
# 检查是否有配置
if not Config.BAIDU_API_KEY or not Config.BAIDU_SECRET_KEY:
if not args.api_key or not args.secret_key:
print("[WARN] 未找到百度OCR配置")
choice = input("是否现在配置? (y/n): ").strip().lower()
if choice == 'y':
setup_config()
Config.load_from_file()
else:
print("[FAIL] 无法继续,请配置百度OCR或提供API Key")
return 1
# 使用命令行参数或配置文件
api_key = args.api_key or Config.BAIDU_API_KEY
secret_key = args.secret_key or Config.BAIDU_SECRET_KEY
# 收集输入路径
inputs = []
# 添加目录
if args.input:
inputs.append(args.input)
# 添加单独指定的文件
if args.files:
inputs.extend(args.files)
# 如果没有指定任何输入,使用默认目录
if not inputs:
inputs.append('fp')
# 打印欢迎信息
print("=" * 60)
print("发票信息提取工具(百度OCR版)")
print("=" * 60)
print(f"时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 60)
# 收集所有文件
print("\n正在扫描文件...")
files = collect_files(inputs)
if not files:
print("[FAIL] 没有找到可处理的发票文件")
print("支持的格式: PDF, PNG, JPG, JPEG, BMP, TIFF")
return 1
# 如果只列出文件,不处理
if args.list:
print(f"\n共找到 {len(files)} 个文件:")
for i, f in enumerate(files, 1):
print(f" {i}. {f}")
return 0
# 处理文件
output_dir = Path(args.output)
success = process_files(files, api_key, secret_key, output_dir, args.name)
return 0 if success else 1
if __name__ == '__main__':
sys.exit(main())
FILE:scripts/batch_process.py
#!/usr/bin/env python3
"""
批量处理助手
简化批量发票处理流程
"""
import sys
import argparse
from pathlib import Path
from datetime import datetime
# 添加src目录到路径
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from config import Config
from baidu_ocr_extractor import extract_invoices_with_baidu
from excel_exporter import ExcelExporter
def main():
parser = argparse.ArgumentParser(
description='批量发票处理助手',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
使用示例:
# 处理目录中的所有发票
python batch_process.py ./invoices
# 处理并指定输出
python batch_process.py ./invoices -o ./reports -n "3月发票"
"""
)
parser.add_argument(
'directory',
help='包含发票文件的目录'
)
parser.add_argument(
'-o', '--output',
default='output',
help='输出目录 (默认: output)'
)
parser.add_argument(
'-n', '--name',
default=None,
help='输出文件名前缀 (默认: 自动根据目录名生成)'
)
parser.add_argument(
'--no-items',
action='store_true',
help='不包含商品明细'
)
args = parser.parse_args()
# 加载配置
Config.load_from_file()
if not Config.BAIDU_API_KEY or not Config.BAIDU_SECRET_KEY:
print("[FAIL] 未配置百度OCR,请先运行: python src/main_baidu.py --setup")
return 1
# 检查目录
input_dir = Path(args.directory)
if not input_dir.exists():
print(f"[FAIL] 目录不存在: {input_dir}")
return 1
# 自动生成文件名
if args.name is None:
args.name = f"发票信息_{input_dir.name}"
print("=" * 60)
print("批量发票处理")
print("=" * 60)
print(f"输入目录: {input_dir}")
print(f"输出目录: {args.output}")
print(f"文件名: {args.name}")
print("=" * 60)
# 提取发票
print("\n开始处理...")
invoices = extract_invoices_with_baidu(
str(input_dir),
api_key=Config.BAIDU_API_KEY,
secret_key=Config.BAIDU_SECRET_KEY
)
if not invoices:
print("[FAIL] 未能提取到任何发票")
return 1
# 导出
output_dir = Path(args.output)
output_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = output_dir / f"{args.name}_{timestamp}.xlsx"
exporter = ExcelExporter()
success = exporter.export_to_excel(
invoices,
str(output_file),
include_items=not args.no_items
)
if success:
print(f"\n[OK] 成功处理 {len(invoices)} 张发票")
print(f"[OK] 输出文件: {output_file}")
return 0
else:
print("[FAIL] 导出失败")
return 1
if __name__ == '__main__':
sys.exit(main())
FILE:scripts/verify_export.py
#!/usr/bin/env python3
"""
验证导出的Excel文件
检查数据完整性和准确性
"""
import sys
import argparse
from pathlib import Path
try:
import pandas as pd
except ImportError:
print("[FAIL] 需要安装pandas: pip install pandas")
sys.exit(1)
def verify_excel(file_path: str):
"""验证Excel文件"""
file_path = Path(file_path)
if not file_path.exists():
print(f"[FAIL] 文件不存在: {file_path}")
return False
print("=" * 60)
print("Excel文件验证报告")
print("=" * 60)
print(f"文件: {file_path}")
print("-" * 60)
try:
# 读取Excel
xls = pd.ExcelFile(file_path)
print(f"\n工作表数量: {len(xls.sheet_names)}")
print(f"工作表: {', '.join(xls.sheet_names)}")
# 验证主表
if "发票信息" in xls.sheet_names:
df_main = pd.read_excel(file_path, sheet_name="发票信息")
print(f"\n【发票信息表】")
print(f" 记录数: {len(df_main)}")
# 检查关键字段
required_fields = ["发票号码", "开票日期", "销售方名称", "价税合计"]
print(f"\n 字段完整性:")
for field in required_fields:
if field in df_main.columns:
non_null = df_main[field].notna().sum()
print(f" {field}: {non_null}/{len(df_main)} ({non_null/len(df_main)*100:.1f}%)")
else:
print(f" {field}: [缺失]")
# 统计金额
if "价税合计" in df_main.columns:
total = df_main["价税合计"].sum()
print(f"\n 金额统计:")
print(f" 发票总额: {total:,.2f}")
print(f" 平均金额: {df_main['价税合计'].mean():,.2f}")
print(f" 最大金额: {df_main['价税合计'].max():,.2f}")
print(f" 最小金额: {df_main['价税合计'].min():,.2f}")
# 日期范围
if "开票日期" in df_main.columns:
dates = pd.to_datetime(df_main["开票日期"], errors='coerce')
valid_dates = dates.dropna()
if len(valid_dates) > 0:
print(f"\n 日期范围:")
print(f" 最早: {valid_dates.min().strftime('%Y-%m-%d')}")
print(f" 最晚: {valid_dates.max().strftime('%Y-%m-%d')}")
# 验证明细表
if "商品明细" in xls.sheet_names:
df_items = pd.read_excel(file_path, sheet_name="商品明细")
print(f"\n【商品明细表】")
print(f" 记录数: {len(df_items)}")
if "发票号码" in df_items.columns:
unique_invoices = df_items["发票号码"].nunique()
print(f" 涉及发票: {unique_invoices} 张")
print("\n" + "=" * 60)
print("[OK] 验证完成")
print("=" * 60)
return True
except Exception as e:
print(f"[FAIL] 验证失败: {e}")
return False
def main():
parser = argparse.ArgumentParser(
description='验证导出的Excel文件',
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument(
'file',
help='Excel文件路径'
)
args = parser.parse_args()
success = verify_excel(args.file)
return 0 if success else 1
if __name__ == '__main__':
sys.exit(main())