Rui

@clawhub-mitsuha-m-4eec4b4f1b
1prompts
0upvotes received
0contributions
Joined 3 months ago
1 contribution in the last year
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Less
Nccl Optimizer
Skill
Detect the optimal NCCL configuration for distributed GPU training on this machine. Checks GPU topology (NVLink/PCIe), whether RDMA (InfiniBand / RoCE) is av...
---
name: nccl_optimizer
description: >
  Detect the optimal NCCL configuration for distributed GPU training on this machine.
  Checks GPU topology (NVLink/PCIe), whether RDMA (InfiniBand / RoCE) is available,
  benchmarks intra-node collective bandwidth and peer-to-peer GPU bandwidth,
  and optionally benchmarks inter-node bandwidth via MPI.
  Use when: setting up multi-GPU or multi-node training, diagnosing slow collective
  communication, or tuning NCCL for a new cluster node.
---

# NCCL Optimizer

Finds the best NCCL communication configuration for distributed training with clear
separation of **intra-node** and **inter-node** bandwidth metrics.

## What it does

1. **GPU topology** — `nvidia-smi topo -m` to detect NVLink vs PCIe.
2. **RDMA check** — `ibv_devinfo` PORT_ACTIVE state for InfiniBand/RoCE.
   - ✅ RDMA → emit recommended `NCCL_IB_*` env-vars.
   - ❌ No RDMA → socket benchmark sweep.
3. **Intra-node all-reduce** — sweeps `NCCL_SOCKET_IFNAME` × `NCCL_NET_GDR_LEVEL` ×
   `NCCL_IB_TIMEOUT`, runs `all_reduce_perf -g <N>`, picks best bus bandwidth.
4. **Intra-node P2P** — `p2p_bw` for GPU↔GPU pair bandwidth (if available).
5. **Inter-node benchmark** — if `nodes=` passed, runs MPI `all_reduce_perf` across nodes;
   otherwise emits a ready-to-run command.

## Prerequisites

| Tool | Purpose | Install |
|------|---------|---------|
| `nvidia-smi` | GPU info + topology | NVIDIA driver |
| `ibv_devinfo` | RDMA detection | `apt install ibverbs-utils` |
| `all_reduce_perf` | Collective benchmark | See below |
| `p2p_bw` | Peer-to-peer benchmark | Same nccl-tests build |
| `mpirun` | Inter-node benchmark | `apt install openmpi-bin` |

### Build nccl-tests

```bash
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
# For V100 (sm_70), A100 (sm_80), A800 (sm_80), H100 (sm_90):
make -j$(nproc) CUDA_HOME=/usr/local/cuda \
  NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80"
export PATH=$PWD/build:$PATH
```

## Usage

```bash
# Intra-node only
openclaw skill run nccl_optimizer

# Include inter-node benchmark (requires passwordless SSH + MPI)
openclaw skill run nccl_optimizer "nodes=10.0.0.1,10.0.0.2"
```

## Metrics explained

| Metric | What it measures |
|--------|-----------------|
| All-reduce bus BW (intra) | Collective throughput across local GPUs — relevant for single-node training |
| P2P bandwidth | GPU↔GPU direct copy speed (NVLink ≫ PCIe) |
| All-reduce bus BW (inter) | Collective throughput across nodes — bottleneck for multi-node training |

## Notes

- Bus bandwidth normalises for GPU count: `(N-1)/N × data / time`. Compare at same N.
- Multi-node training is almost always bottlenecked by inter-node bandwidth, not intra-node.
- RDMA (InfiniBand/RoCE) typically gives 10-100× better inter-node bandwidth than TCP.

FILE:__init__.py
"""
nccl_optimizer — Detect the best NCCL configuration for distributed training.

**Linux only.** NCCL itself only supports Linux; this skill will exit cleanly
on other platforms with an explanatory message.

Workflow:
  1. OS check — abort early on non-Linux.
  2. Detect GPU info + intra-node topology (NVLink / PCIe).
  3. Check RDMA (InfiniBand / RoCE) availability via ibv_devinfo (ACTIVE state).
  4. Intra-node benchmark:
       - all_reduce_perf -g <N>  → collective bus bandwidth across all local GPUs
       - p2p_bw (if available)   → peer-to-peer bandwidth between GPU pairs
  5. Inter-node benchmark:
       - If nodes= provided in message: run mpirun all_reduce_perf across nodes
       - Otherwise: emit ready-to-run inter-node command
  6. Emit final recommended env-var block.

Usage:
  main()                           # intra-node only
  main("nodes=10.0.0.1,10.0.0.2") # also benchmarks inter-node (requires SSH + MPI)
"""

import platform
import subprocess
import re
import os
import shutil
import socket as _socket


# ---------------------------------------------------------------------------
# Platform guard — NCCL is Linux-only
# ---------------------------------------------------------------------------

def _check_platform() -> str | None:
    """Return an error string if not running on Linux, else None."""
    system = platform.system()
    if system != "Linux":
        return (
            f"⛔ **Platform not supported: {system}**\n\n"
            "NCCL only runs on **Linux**. This skill requires Linux with NVIDIA GPUs.\n\n"
            "See: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html"
        )
    return None


# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------

def _run_cmd(cmd: str, timeout: int = 60) -> str:
    """Run *cmd* via shell, return stdout+stderr stripped; empty string on any failure."""
    try:
        return subprocess.check_output(
            cmd, shell=True, stderr=subprocess.STDOUT, text=True, timeout=timeout
        ).strip()
    except Exception:
        return ""


def _in_container() -> bool:
    """Heuristic: detect Docker / Kubernetes / LXC container environment."""
    # /.dockerenv file is injected by Docker runtime
    if os.path.exists("/.dockerenv"):
        return True
    # /proc/1/cgroup contains 'docker' or 'kubepods' in containers
    cgroup = _run_cmd("cat /proc/1/cgroup 2>/dev/null | head -5")
    if re.search(r"docker|kubepods|containerd|lxc", cgroup, re.I):
        return True
    return False


# ---------------------------------------------------------------------------
# GPU detection
# ---------------------------------------------------------------------------

def _detect_gpu():
    """Return (gpu_names, driver, cuda, num_gpus)."""
    out = _run_cmd("nvidia-smi --query-gpu=name,driver_version --format=csv,noheader")
    gpu_names, driver = [], ""
    if out:
        for i, line in enumerate(out.splitlines()):
            parts = [p.strip() for p in line.split(",")]
            if parts:
                gpu_names.append(parts[0])
            if i == 0 and len(parts) > 1:
                driver = parts[1]

    # CUDA version: try nvcc first, fall back to nvidia-smi header
    cuda = ""
    nvcc_out = _run_cmd("nvcc --version 2>/dev/null")
    if nvcc_out:
        m = re.search(r"release\s+([0-9.]+)", nvcc_out)
        cuda = m.group(1) if m else ""
    if not cuda:
        smi_top = _run_cmd("nvidia-smi 2>/dev/null")
        m = re.search(r"CUDA Version:\s*([0-9.]+)", smi_top)
        cuda = m.group(1) if m else "unknown"

    return gpu_names, driver, cuda, len(gpu_names)


def _detect_topology(num_gpus: int):
    """Return (has_nvlink: bool, topo_summary: str)."""
    out = _run_cmd("nvidia-smi topo -m 2>/dev/null")
    if not out:
        return False, "topology info unavailable (`nvidia-smi topo` not supported)"
    has_nvlink = bool(re.search(r"\bNV\d+\b", out))
    lines = [l for l in out.splitlines() if l.strip()]
    summary = "\n".join(lines[:min(len(lines), num_gpus + 3)])
    return has_nvlink, summary


# ---------------------------------------------------------------------------
# RDMA detection
# ---------------------------------------------------------------------------

def _rdma_available():
    """Return (available: bool, active_devices: list[str]).
    Considers a device active only when its port state is PORT_ACTIVE.
    """
    out = _run_cmd("ibv_devinfo 2>/dev/null")
    if not out:
        return False, []
    active_devs, current_dev = [], None
    for line in out.splitlines():
        dev_m = re.match(r"^hca_id:\s*(\S+)", line)
        if dev_m:
            current_dev = dev_m.group(1)
        if re.search(r"state:\s*(PORT_ACTIVE|4\b)", line) and current_dev:
            if current_dev not in active_devs:
                active_devs.append(current_dev)
    return bool(active_devs), active_devs


def _rdma_gid_index():
    """Detect RoCEv2 GID index via show_gids; default to 3."""
    out = _run_cmd("show_gids 2>/dev/null")
    if out:
        m = re.search(r"(\d+)\s+\S+\s+\S+\s+ROCEv2", out)
        if m:
            return m.group(1)
    # Fallback: parse ibv_devinfo -v for GID entries
    out2 = _run_cmd("ibv_devinfo -v 2>/dev/null")
    if out2:
        m = re.search(r"GID\[(\d+)\].*:.*:.*:.*fe80", out2, re.I)
        if m:
            return m.group(1)
    return "3"


# ---------------------------------------------------------------------------
# nccl-tests binary discovery
# ---------------------------------------------------------------------------

_NCCL_SEARCH_DIRS = [
    "/usr/local/nccl-tests/build",
    os.path.expanduser("~/nccl-tests/build"),
    "./nccl-tests/build",
    "/opt/nccl-tests/build",
]


def _find_binary(name: str) -> str | None:
    """Find an nccl-tests binary in PATH or common install locations."""
    found = shutil.which(name)
    if found:
        return found
    for d in _NCCL_SEARCH_DIRS:
        p = os.path.join(d, name)
        if os.path.isfile(p) and os.access(p, os.X_OK):
            return p
    return None


# ---------------------------------------------------------------------------
# Bandwidth parsing
# ---------------------------------------------------------------------------

def _parse_bandwidth(output: str) -> float:
    """Extract average bus bandwidth (GB/s) from nccl-tests output."""
    # Primary: summary line "# Avg bus bandwidth    : 127.9"
    m = re.search(r"Avg bus bandwidth\s*[:\-]+\s*([0-9.]+)", output)
    if m:
        return float(m.group(1))
    # Fallback: last busbw column in data rows
    # Format: size count type redop root time algbw busbw #wrong ...
    values = re.findall(r"^\s*\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)", output, re.M)
    return float(values[-1]) if values else 0.0


def _parse_p2p_bandwidth(output: str) -> float:
    """Extract max unidirectional bandwidth from p2p_bw output (GB/s)."""
    values = re.findall(r"^\s*\d+\s+\d+\s+([0-9.]+)\s*$", output, re.M)
    if values:
        return max(float(v) for v in values)
    m = re.search(r"([0-9.]+)\s*GB/s", output)
    return float(m.group(1)) if m else 0.0


# ---------------------------------------------------------------------------
# Network interface detection (Linux-aware, container-safe)
# ---------------------------------------------------------------------------

def _socket_interfaces() -> list:
    """Return non-loopback, non-virtual interfaces with carrier (link-up).

    Uses /sys/class/net when available (standard Linux), falls back to
    `ip link show` (works in most containers), then `ifconfig`.
    Excludes loopback, docker bridges, virbr*, and veth interfaces.
    """
    _EXCLUDE = re.compile(r"^(lo|docker|virbr|veth|br-|tunl|dummy)")

    ifaces = []

    # Method 1: sysfs (standard Linux, may not be mounted in all containers)
    if os.path.isdir("/sys/class/net"):
        try:
            ifaces = [i for i in os.listdir("/sys/class/net") if not _EXCLUDE.match(i)]
        except OSError:
            pass

    # Method 2: ip link show (works in containers without sysfs)
    if not ifaces:
        ip_out = _run_cmd("ip link show 2>/dev/null")
        ifaces = re.findall(r"^\d+:\s+(\S+):", ip_out, re.M)
        ifaces = [i.rstrip(":@") for i in ifaces if not _EXCLUDE.match(i)]

    # Method 3: ifconfig (older systems)
    if not ifaces:
        ifc_out = _run_cmd("ifconfig -a 2>/dev/null")
        ifaces = re.findall(r"^(\S+)\s+Link|^(\S+):\s+flags", ifc_out, re.M)
        ifaces = [a or b for a, b in ifaces if not _EXCLUDE.match(a or b)]

    if not ifaces:
        return ["eth0"]  # last-resort default

    # Filter to link-up interfaces
    active = []
    for iface in ifaces:
        # sysfs carrier file
        carrier_path = f"/sys/class/net/{iface}/carrier"
        if os.path.exists(carrier_path):
            carrier = _run_cmd(f"cat {carrier_path} 2>/dev/null")
            if carrier == "1":
                active.append(iface)
        else:
            # ip link operstate
            state = _run_cmd(f"cat /sys/class/net/{iface}/operstate 2>/dev/null")
            if state in ("up", "unknown"):
                active.append(iface)

    return (active or ifaces)[:3]


# ---------------------------------------------------------------------------
# Intra-node benchmarks
# ---------------------------------------------------------------------------

def _run_intranode_allreduce(binary: str, env: dict, num_gpus: int) -> tuple:
    """Run all_reduce_perf across all local GPUs. Returns (bw, raw_output)."""
    env_str = "NCCL_DEBUG=WARN " + " ".join(f"{k}={v}" for k, v in env.items())
    cmd = f"{env_str} {binary} -b 8M -e 4G -f 2 -g {max(1, num_gpus)} 2>&1"
    out = _run_cmd(cmd, timeout=180)
    return _parse_bandwidth(out), out


def _run_p2p_bw(p2p_binary: str, num_gpus: int) -> float:
    """Run p2p_bw between all GPU pairs. Returns max observed GB/s."""
    cmd = f"NCCL_DEBUG=WARN {p2p_binary} -b 1G -e 4G -f 2 -g {max(1, num_gpus)} 2>&1"
    out = _run_cmd(cmd, timeout=120)
    return _parse_p2p_bandwidth(out)


def _optimize_socket_intranode(binary: str, num_gpus: int) -> tuple:
    """Sweep socket configs, return (best_env, best_bw).

    Search space:
      NCCL_SOCKET_IFNAME  : top-3 active interfaces
      NCCL_NET_GDR_LEVEL  : 0, 2
      NCCL_IB_TIMEOUT     : 22, 30
    """
    interfaces = _socket_interfaces()
    best_env, best_bw = None, -1.0
    for iface in interfaces:
        for gdr_level in ("0", "2"):
            for ib_timeout in ("22", "30"):
                env = {
                    "NCCL_SOCKET_IFNAME": iface,
                    "NCCL_IB_DISABLE": "1",
                    "NCCL_NET_GDR_LEVEL": gdr_level,
                    "NCCL_IB_TIMEOUT": ib_timeout,
                }
                bw, _ = _run_intranode_allreduce(binary, env, num_gpus)
                if bw > best_bw:
                    best_bw = bw
                    best_env = dict(env)
    return best_env, best_bw


# ---------------------------------------------------------------------------
# Inter-node benchmarks
# ---------------------------------------------------------------------------

def _find_mpirun() -> str | None:
    for name in ("mpirun", "orterun", "mpiexec"):
        p = shutil.which(name)
        if p:
            return p
    return None


def _run_internode_allreduce(binary: str, mpirun: str, nodes: list,
                              gpus_per_node: int, best_env: dict) -> tuple:
    """Run all_reduce_perf across multiple nodes via MPI. Returns (bw, raw_output)."""
    host_str = ",".join(f"{n}:{gpus_per_node}" for n in nodes)
    total_ranks = len(nodes) * gpus_per_node
    env_args = " ".join(f"-x {k}={v}" for k, v in best_env.items())
    cmd = (
        f"{mpirun} -np {total_ranks} -H {host_str} "
        f"-x NCCL_DEBUG=WARN {env_args} "
        f"{binary} -b 8M -e 4G -f 2 -g 1 2>&1"
    )
    out = _run_cmd(cmd, timeout=300)
    return _parse_bandwidth(out), out


def _parse_nodes(message: str) -> list:
    """Extract node list from 'nodes=10.0.0.1,10.0.0.2' or 'hosts=a,b'."""
    m = re.search(r"(?:nodes?|hosts?)=([^\s]+)", message, re.I)
    if m:
        return [n.strip() for n in m.group(1).split(",") if n.strip()]
    return []


# ---------------------------------------------------------------------------
# Report generation
# ---------------------------------------------------------------------------

def _generate_report(message: str = "") -> str:
    # ── Platform guard ────────────────────────────────────────────────────
    err = _check_platform()
    if err:
        return err

    lines = ["# NCCL Optimizer Report\n"]
    local_hostname = _socket.gethostname()
    is_container = _in_container()

    # ── Environment info ──────────────────────────────────────────────────
    lines.append(f"**Host:** `{local_hostname}`")
    if is_container:
        lines.append("⚠️  **Container environment detected** — interface names may differ from host; "
                      "ensure NCCL can reach peer containers/nodes via the expected interface.\n")

    # ── GPU info ──────────────────────────────────────────────────────────
    gpu_names, driver, cuda, num_gpus = _detect_gpu()
    if gpu_names:
        lines.append(f"**GPUs detected ({num_gpus}):** {', '.join(set(gpu_names))}")
        lines.append(f"**Driver:** {driver}  |  **CUDA:** {cuda}\n")
    else:
        lines.append("⚠️  No NVIDIA GPU info found (`nvidia-smi` missing or no GPU present).\n")
        num_gpus = 1

    # ── Topology ──────────────────────────────────────────────────────────
    has_nvlink, topo_summary = _detect_topology(num_gpus)
    lines.append("## Intra-node GPU Topology\n")
    if has_nvlink:
        lines.append("✅ **NVLink detected** — GPUs can communicate at NVLink speeds.\n")
    else:
        lines.append("ℹ️  **No NVLink** — GPU↔GPU traffic goes over PCIe.\n")
    lines.append("```")
    lines.append(topo_summary)
    lines.append("```\n")

    # ── RDMA check ────────────────────────────────────────────────────────
    rdma_ok, active_devs = _rdma_available()
    best_env: dict = {}

    if rdma_ok:
        lines.append("## ✅ RDMA Available\n")
        lines.append(f"Active InfiniBand/RoCE devices: **{', '.join(active_devs)}**\n")
        gid_idx = _rdma_gid_index()
        hca = active_devs[0]
        best_env = {
            "NCCL_IB_HCA": hca,
            "NCCL_IB_GID_INDEX": gid_idx,
            "NCCL_IB_DISABLE": "0",
            "NCCL_IB_TC": "106",
            "NCCL_IB_RETRY_CNT": "7",
            "NCCL_IB_TIMEOUT": "22",
            "NCCL_NET_GDR_LEVEL": "2",
        }
        lines.append("### Recommended NCCL env-vars (RDMA mode)\n")
        lines.append("```bash")
        for k, v in best_env.items():
            lines.append(f"export {k}={v}")
        lines.append("export NCCL_DEBUG=WARN")
        lines.append("```\n")
        lines.append("> Verify GID index with `show_gids` (RoCEv2 → usually index 3).")

    else:
        lines.append("## ℹ️  No RDMA — Socket mode\n")
        lines.append("No active InfiniBand/RoCE device detected. Falling back to TCP sockets.\n")

        binary = _find_binary("all_reduce_perf")
        if binary is None:
            lines.append("⚠️  `all_reduce_perf` not found. Build nccl-tests first:\n")
            lines.append("```bash")
            lines.append("git clone https://github.com/NVIDIA/nccl-tests.git && cd nccl-tests")
            lines.append("# Adjust arch flag: V100=sm_70, A100/A800=sm_80, H100=sm_90")
            lines.append("make -j$(nproc) CUDA_HOME=/usr/local/cuda \\")
            lines.append("  NVCC_GENCODE=\"-gencode=arch=compute_80,code=sm_80\"")
            lines.append("export PATH=$PWD/build:$PATH")
            lines.append("```\n")
        else:
            lines.append(f"Benchmark binary: `{binary}`\n")

            # ── Intra-node sweep ──────────────────────────────────────────
            lines.append("## Intra-node Benchmark\n")
            lines.append(f"> Measures **all-reduce collective bus bandwidth** across all {num_gpus} local GPUs.")
            lines.append("> Relevant for **single-node multi-GPU** training.\n")
            lines.append("Running parameter sweep (may take several minutes)…\n")

            env, bw = _optimize_socket_intranode(binary, num_gpus)

            if env and bw > 0:
                best_env = env
                lines.append(f"### ✅ Best intra-node config — **{bw:.1f} GB/s bus bandwidth**\n")
                lines.append("```bash")
                for k, v in env.items():
                    lines.append(f"export {k}={v}")
                lines.append("export NCCL_SOCKET_NTHREADS=4")
                lines.append("export NCCL_NSOCKS_PERTHREAD=4")
                lines.append("export NCCL_BUFFSIZE=8388608")
                lines.append("export NCCL_DEBUG=WARN")
                lines.append("```\n")
            else:
                # Benchmark failed — diagnose and provide fallback
                iface = (_socket_interfaces() or ["eth0"])[0]
                best_env = {
                    "NCCL_SOCKET_IFNAME": iface,
                    "NCCL_IB_DISABLE": "1",
                    "NCCL_NET_GDR_LEVEL": "2",
                    "NCCL_IB_TIMEOUT": "22",
                }
                lines.append("⚠️  Benchmark produced no results. Common causes on Linux:\n")
                lines.append("- **CUDA arch mismatch** — rebuild with correct `NVCC_GENCODE` for your GPU:")
                lines.append("  ```bash")
                lines.append("  # V100: sm_70 | A100/A800: sm_80 | H100: sm_90")
                lines.append("  make -j$(nproc) CUDA_HOME=/usr/local/cuda \\")
                lines.append("    NVCC_GENCODE=\"-gencode=arch=compute_80,code=sm_80\"")
                lines.append("  ```")
                lines.append("- **GPU not accessible** — check `nvidia-smi` and driver status")
                lines.append("- **Container without GPU passthrough** — verify `--gpus all` flag\n")
                lines.append("**Fallback config (untuned):**")
                lines.append("```bash")
                for k, v in best_env.items():
                    lines.append(f"export {k}={v}")
                lines.append("export NCCL_SOCKET_NTHREADS=4")
                lines.append("export NCCL_NSOCKS_PERTHREAD=4")
                lines.append("export NCCL_BUFFSIZE=8388608")
                lines.append("export NCCL_DEBUG=WARN")
                lines.append("```\n")

            # ── P2P bandwidth ─────────────────────────────────────────────
            p2p_binary = _find_binary("p2p_bw")
            if p2p_binary and num_gpus >= 2:
                p2p_bw = _run_p2p_bw(p2p_binary, num_gpus)
                if p2p_bw > 0:
                    lines.append(f"**Intra-node GPU peer-to-peer bandwidth:** {p2p_bw:.1f} GB/s  ")
                    lines.append("_(point-to-point, not collective; NVLink typically 600+ GB/s, PCIe ~32 GB/s)_\n")

    # ── Inter-node benchmark ──────────────────────────────────────────────
    lines.append("## Inter-node Benchmark\n")
    lines.append("> Measures **network bandwidth between nodes**. The primary bottleneck for multi-node training.\n")

    peer_nodes = _parse_nodes(message)
    mpirun = _find_mpirun()
    binary = _find_binary("all_reduce_perf")

    if peer_nodes and binary and mpirun:
        all_nodes = [local_hostname] + [n for n in peer_nodes if n != local_hostname]
        lines.append(f"Running inter-node benchmark across: `{', '.join(all_nodes)}`\n")
        internode_bw, internode_out = _run_internode_allreduce(
            binary, mpirun, all_nodes, num_gpus, best_env
        )
        if internode_bw > 0:
            lines.append(f"### ✅ Inter-node all-reduce bus bandwidth: **{internode_bw:.1f} GB/s**")
            lines.append(f"_{len(all_nodes)} nodes × {num_gpus} GPUs = {len(all_nodes)*num_gpus} total GPUs_\n")
        else:
            lines.append("⚠️  Inter-node benchmark produced no readings.\n")
            lines.append("Checklist:")
            lines.append("- Passwordless SSH between all nodes (same user)")
            lines.append("- `all_reduce_perf` at same path on all nodes")
            lines.append("- MPI hostfile / firewall allows traffic on MPI ports\n")
            lines.append("```")
            lines.append(internode_out[:800] if internode_out else "(no output)")
            lines.append("```\n")
    else:
        env_x_args = " \\\n  ".join(f"-x {k}={v}" for k, v in best_env.items())
        lines.append("To benchmark inter-node bandwidth, run on one node:\n")
        lines.append("```bash")
        lines.append("# Replace NODE1, NODE2 etc. with actual hostnames/IPs")
        if not mpirun:
            lines.append("# Install MPI first:  apt install openmpi-bin  (Debian/Ubuntu)")
            lines.append("#                     yum install openmpi       (RHEL/CentOS)")
        lines.append(f"mpirun -np $((NNODES * {num_gpus})) -H NODE1:{num_gpus},NODE2:{num_gpus} \\")
        if env_x_args:
            lines.append(f"  {env_x_args} \\")
        lines.append("  -x NCCL_DEBUG=WARN \\")
        binary_path = binary or "/path/to/nccl-tests/build/all_reduce_perf"
        lines.append(f"  {binary_path} -b 8M -e 4G -f 2 -g 1")
        lines.append("```\n")
        lines.append("Or let this skill run it automatically:")
        lines.append("```")
        lines.append('openclaw skill run nccl_optimizer "nodes=10.0.0.1,10.0.0.2"')
        lines.append("```\n")

    # ── Metric explanation ────────────────────────────────────────────────
    lines.append("---\n")
    lines.append("## Understanding the metrics\n")
    lines.append("| Metric | What it measures | Relevant for |")
    lines.append("|--------|-----------------|--------------|")
    lines.append("| **All-reduce bus BW** (intra-node) | Collective throughput across local GPUs | Single-node multi-GPU |")
    lines.append("| **P2P bandwidth** | Direct GPU↔GPU copy (NVLink / PCIe) | Pipeline parallelism, ZeRO |")
    lines.append("| **All-reduce bus BW** (inter-node) | Collective throughput across nodes | Multi-node training |")
    lines.append("")
    lines.append("> **Bus BW vs raw link speed:** all-reduce with N GPUs transfers `(N-1)/N × data` per GPU.")
    lines.append("> Bus bandwidth normalises for this — valid for comparing across different GPU counts.\n")
    lines.append("## General tuning tips\n")
    lines.append("- `NCCL_DEBUG=INFO` — diagnose hangs / connection errors")
    lines.append("- `NCCL_BUFFSIZE=8388608` — larger buffers improve large-message throughput")
    lines.append("- `NCCL_SOCKET_NTHREADS=4` + `NCCL_NSOCKS_PERTHREAD=4` — more threads for TCP mode")
    lines.append("- Multi-node: set `MASTER_ADDR` / `MASTER_PORT`, or use `torchrun --rdzv_backend=c10d`")
    lines.append("- In containers: mount `/dev/infiniband` and run with `--privileged` or `--cap-add IPC_LOCK` for RDMA")

    return "\n".join(lines)


def main(message: str = "") -> str:
    """Skill entry point. Returns a Markdown-formatted report."""
    return _generate_report(message)


if __name__ == "__main__":
    import sys
    print(main(" ".join(sys.argv[1:])))
ClawHub DevOps Documentation+2
R@clawhub-mitsuha-m-4eec4b4f1b