whmmy

@clawhub-whmmy-92b89df595
1prompts
0upvotes received
0contributions
Joined 3 months ago
1 contribution in the last year
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Less
小米 MiMo TTS
Skill
Text-to-speech using Xiaomi MiMo TTS API. Generates WAV audio files. Triggers when user says "send voice message", "voice reply", "read to me", "use clip voi...
---
name: mimo-tts
description: Text-to-speech using Xiaomi MiMo TTS API. Generates WAV audio files. Triggers when user says "send voice message", "voice reply", "read to me", "use clip voice", or any TTS-related request. Supports style control and fine-grained audio tags.
---

# Xiaomi MiMo TTS

## Quick Usage

Just say "send voice" + what you want me to say, or describe the voice style you want.

## Default Config

- **Default Voice**: `default_zh` (Chinese female)
- **Default Style**: `<style>夹子音</style>` (cute/clip voice, used when no style specified)

## Available Voices

| Voice Name | voice parameter |
|------------|-----------------|
| MiMo-Default | `mimo_default` |
| MiMo-Chinese-Female | `default_zh` |
| MiMo-English-Female | `default_eh` |

## Style Control

### Overall Style (at the beginning of text)

| Style Type | Examples |
|------------|----------|
| Speed Control | 变快 (faster) / 变慢 (slower) |
| Emotion | 开心 (happy) / 悲伤 (sad) / 生气 (angry) |
| Character | 孙悟空 (Wukong) / 林黛玉 (Lin Daiyu) |
| Style Variations | 悄悄话 (whisper) / 夹子音 (clip voice) / 台湾腔 (Taiwanese accent) |
| Dialect | 东北话 (Northeast) / 四川话 (Sichuan) / 河南话 (Henan) / 粤语 (Cantonese) |

**Format**: `<style>style1 style2</style>text to synthesize`

### Audio Tags (Fine-grained Control)

Use () to annotate emotion, speed, pauses, breathing, etc:

| Tag | Description | Example |
|-----|-------------|---------|
| （紧张，深呼吸） | Multi-emotion combo | （紧张，深呼吸）呼……冷静，冷静 |
| （语速加快） | Speed change | （语速加快，碎碎念） |
| （小声） | Volume control | （小声）哎呀，领带歪没歪？ |
| （长叹一口气） | Sigh | （长叹一口气） |
| （咳嗽） | Cough | （咳嗽）简直能把人骨头冻透了 |
| （沉默片刻） | Pause | （沉默片刻） |
| （苦笑） | Bitter smile | （苦笑）呵，没如果了 |
| （提高音量喊话） | Loud shout | （提高音量喊话）大姐！这鱼新鲜着呢！ |
| （极其疲惫，有气无力） | Exhausted | 师傅……到地方了叫我一声…… |
| （寒冷导致的急促呼吸） | Environmental | 呼——呼——这、这大兴安岭的雪…… |

**Synthesis Example**:

```python
import os
import base64
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("MIMO_API_KEY"),
    base_url="https://api.xiaomimimo.com/v1"
)

# Clip voice style
text = "<style>夹子音</style>主人～我来啦！今天有什么需要帮忙的吗～"

completion = client.chat.completions.create(
    model="mimo-v2-tts",
    messages=[
        {"role": "user", "content": "你好"},
        {"role": "assistant", "content": text}
    ],
    audio={"format": "wav", "voice": "default_zh"}
)

audio_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("output.wav", "wb") as f:
    f.write(audio_bytes)
```

## Notes

- Target text must be in the `assistant` role message, not in `user`
- `<style>` tag must be at the beginning of target text
- For singing: `<style>唱歌</style>target text`
- Returns base64-encoded WAV audio

## Script Usage

Use `scripts/mimo_tts.py` for speech synthesis:

```bash
MIMO_API_KEY=your_api_key python3 scripts/mimo_tts.py "text to synthesize" --voice default_zh --style "夹子音" --output output.wav
```

**Note**: Set `MIMO_API_KEY` environment variable or configure in OpenClaw settings.

FILE:scripts/mimo_tts.py
#!/usr/bin/env python3
"""
小米 MiMo TTS 语音合成脚本
用法: python3 mimo_tts.py "要合成的话" [--voice default_zh] [--style "夹子音"] [--output output.wav]
"""

import os
import sys
import base64
import argparse
import json
from openai import OpenAI

DEFAULT_VOICE = "default_zh"  # 默认中文女声
DEFAULT_STYLE = "夹子音"       # 默认夹子音风格

def synthesize_speech(text: str, voice: str = DEFAULT_VOICE, style: str = None, output_path: str = None):
    """调用 MiMo TTS API 进行语音合成"""
    
    api_key = os.environ.get("MIMO_API_KEY")
    if not api_key:
        # 尝试从配置文件读取
        config_path = os.path.expanduser("~/.openclaw/config.json")
        if os.path.exists(config_path):
            try:
                with open(config_path) as f:
                    config = json.load(f)
                    api_key = config.get("mimo_api_key") or config.get("MIMO_API_KEY")
            except:
                pass
        
        if not api_key:
            print("错误: 请设置 MIMO_API_KEY 环境变量", file=sys.stderr)
            print("  export MIMO_API_KEY='你的API密钥'", file=sys.stderr)
            sys.exit(1)
    
    # 构建完整的style标签
    if style:
        full_text = f"<style>{style}</style>{text}"
    else:
        full_text = text
    
    client = OpenAI(
        api_key=api_key,
        base_url="https://api.xiaomimimo.com/v1"
    )
    
    try:
        completion = client.chat.completions.create(
            model="mimo-v2-tts",
            messages=[
                {"role": "user", "content": "你好"},  # user消息可选，但可影响语气风格
                {"role": "assistant", "content": full_text}
            ],
            audio={"format": "wav", "voice": voice}
        )
        
        audio_data = completion.choices[0].message.audio.data
        audio_bytes = base64.b64decode(audio_data)
        
        if output_path:
            with open(output_path, "wb") as f:
                f.write(audio_bytes)
            print(f"音频已保存到: {output_path}")
            return output_path
        else:
            # 返回字节数据
            return audio_bytes
            
    except Exception as e:
        print(f"错误: {e}", file=sys.stderr)
        sys.exit(1)


def main():
    parser = argparse.ArgumentParser(description="小米 MiMo TTS 语音合成")
    parser.add_argument("text", help="要合成的文本")
    parser.add_argument("--voice", "-v", default=DEFAULT_VOICE, 
                        help=f"音色 (默认: {DEFAULT_VOICE})")
    parser.add_argument("--style", "-s", default=None,
                        help=f"风格标签，如: 夹子音、开心、东北话 等")
    parser.add_argument("--output", "-o", default=None,
                        help="输出文件路径 (默认: stdout)")
    parser.add_argument("--api-key", "-k", default=None,
                        help="API密钥 (也可通过 MIMO_API_KEY 环境变量设置)")
    
    args = parser.parse_args()
    
    if args.api_key:
        os.environ["MIMO_API_KEY"] = args.api_key
    
    # 如果没有指定style但用户提到夹子音，使用夹子音
    if not args.style:
        # 简单检测：如果文本里有～或者语气词多，用夹子音
        if "～" in args.text or "呀" in args.text or "啦" in args.text:
            args.style = DEFAULT_STYLE
    
    result = synthesize_speech(args.text, args.voice, args.style, args.output)
    
    if not args.output:
        # 输出到 stdout 作为二进制
        sys.stdout.buffer.write(result)


if __name__ == "__main__":
    main()
ClawHub Coding Backend+2
W@clawhub-whmmy-92b89df595