hfi1stats Highlight
Syntax highlighting and analysis tools for Intel Omni-Path (OPA) and CN5000 HFI statistics output.
Features
- Syntax highlighting for
.hfi1stats / .hfi1 files — errors, warnings, and traffic counters are colour-coded for quick scanning
- Hover tooltips — hover over any counter name to see its description
- Right-click commands (available when file language is
hfi1stats):
| Command |
What it does |
| Delete All Zero-Value Lines |
Removes clutter — keeps only non-zero counters |
| Calculate Average Packet Size |
(TxFlitVL0 × 8 bytes) / TxPktVL0 |
| Congestion Report (TxWait / RnrNak / PktDrop) |
Auto-detects OPA100/CN5000, shows stall time, MB lost, stall % for TxWait and TxFlowStall separately, combined health indicator |
| GBytes Transferred (TX + RX) |
Converts DcXmitFlits / DcRcvFlits to GB / TB |
| Save as CSV |
Exports all counters to CSV (name, raw value, numeric value) |
| Diff with second file → CSV |
Picks a second file and saves only changed counters to CSV |
| Drop Port 1 Counters (CN5000 pseudo port) |
Removes the Port0,1: section — CN5000 Port 1 is a pseudo internal port with no real traffic |
| Analyze — Full Sanity Check Report |
Normalizes counter units, runs 6 sanity checks, writes .analyzed.txt next to the source file |
| Collect from Remote Node(s) via SSH |
SSHes to one or more nodes, pulls hfi1stats, computes TX/RX totals and avg speed, opens files in editor |
Usage
Open any hfi1stats output file. VS Code will auto-detect the language by extension (.hfi1stats, .hfi1) or filename (hfi1stats). Right-click in the editor to access commands.
Counter Descriptions
All counter descriptions are stored in counters_compiled.json — edit that file to add or update descriptions without touching extension code.
Supported Hardware
Both OPA100 and CN5000 are supported. Hardware is auto-detected from the file format:
| Hardware |
Detection |
T_flit |
Link speed |
| Intel OPA100 |
Port0: section headers |
0.8 ns |
100 Gbps (4×25 Gb/s) |
| Cornelis CN5000 |
Port0,1: / Port0,2: section headers |
0.4 ns |
400 Gbps (4×100 Gb/s) |
CN5000 note: Port 1 (Port0,1:) is a pseudo internal port that never transfers data — all counters are always zero. Use Drop Port 1 Counters to remove it before analysis.
Collecting from Remote Nodes
Ctrl+Shift+P (Windows/Linux) or Shift+Command+P (Mac) → hfi1stats: Collect from Remote Node(s) via SSH**
Requires passwordless SSH as root from your workstation to the target nodes.
The command will ask for:
- Hostname(s) — several formats accepted:
- Single node:
node6
- Comma list:
d001,d005,d012,d042
- Range:
d001-d010
- SSH user — defaults to
root
- Save folder — where to store the
.hfi1stats files
Each collected file is opened in the editor automatically. The header contains TX/RX totals and average speed:
# host : node6.cluster.local
# collected : 2026-04-26 09:15:03 UTC
# last reboot : 2026-04-01 08:42
# uptime : 2198400 s
#
# TX transferred : 1.847 TB (avg 0.852 GB/s since last reboot)
# RX transferred : 0.412 TB (avg 0.190 GB/s since last reboot)
After collection, right-click → Analyze to run the full sanity check report.
Analysis Report
Analyze — Full Sanity Check Report writes a .analyzed.txt file next to the source. For multi-unit files (multiple HFI cards) each unit gets its own section.
Example output:
══════════════════════════════════════════════════════════════
hfi1stats Analysis Report
Source : node6.hfi1stats
Hardware : OPA100 (0.8 ns/flit, 100G)
Generated: 2026-04-27 11:03:14
══════════════════════════════════════════════════════════════
╔════════════════════════════════════════════════════════════╗
Unit 0
╚════════════════════════════════════════════════════════════╝
── 1. Normalized Units ────────────────────────────────────────────
Counter Raw Actual Value
──────────────────────────────────────────────────────────────
TxPkt 352679662K 352,679,662,000
TxPktVL0 352673233K 352,673,233,000
TxWords 132779278447K 1.328e+17
TxWait 1062315987864K 1.062e+18
TxFlowStall 3051086K 3,051,086,000
Note: K suffix = ×1,000 (auto-applied by hfi1stats when value overflows column)
── 2. Sanity Checks ───────────────────────────────────────────────
① Avg Packet Size (TxWords × 8 bytes / TxPkt)
Avg size = 3008 bytes/packet
✅ Reasonable — typical RDMA mix
② VL0 Traffic Share (TxPktVL0 / TxPkt)
VL0 share: 99.998% ✅ Single-SL workload, normal
VL15 share: 0.000% (management traffic)
③ Wait Attribution (TxWaitVL0 / TxWait)
VL0 accounts for: 99.977% ✅ Consistent with VL traffic distribution
④ Credit Starvation (TxFlowStall / TxWait)
Credit starvation: 0.0029%
✅ Link is idle-waiting, not credit-starved
⑤ Traffic Asymmetry (TX vs RX)
TxPkt / RxPkt = 1.37 — sender-heavy node
TxWords / RxWords = 20707× — large outbound payloads, tiny inbound (producer node)
⑥ Error Rates
RxICrcErr = 381 / RxPkt = 256937321K → BER ≈ 1.48e-12 ✅ Healthy link
⑦ RC Transport Health (RcQacks NAK analysis)
RcQacks = 33638 | RcAcks = 0
NAK rate = RcQacks / (RcAcks + RcQacks) = 100.0000% ✗ High NAK rate — investigate
Dominant: RcSeqNak=33638 → packets lost or reordered — check link errors, switch drops
⑧ PIO & Driver Send Path (PioWait / PioDrain · KmemWait · TidWait · SendSched)
PioWait = 0 PioDrain = 467 — no PIO traffic
KmemWait = 0 ✅
TidWait = 0 ✅
SendSched = 0 ✅
⑨ Link-Layer Health (DC / LTP counters)
DcAccLTP = 339547394174121 DcGoodLTP = 339547394174123 → ✅ Zero bad LTPs — perfect PHY
DcTxReplay = 0 DcRxReplay = 0 ✅ No LTP replays
CRC / multi-lane errors: all zero ✅
PHY / DC error counters: all zero ✅
FECN / BECN / mark: all zero — fabric uncongested ✅
⑩ RC Key Ratios (coalescing · NAK rate · retransmission · SDMA stall)
RcAcks / TxPkt = 9.53e-7 → 1 ACK per ~1050118.0 pkts (typical RC: 2–5)
RcQacks / (RcAcks + RcQacks) = 100.0000% (typical: < 0.01% clean, < 1% elevated)
RcResend / TxPkt = 0.00e+0 (typical: < 1e-6 clean, < 1e-4 monitor)
RcCrWait / TxPkt = 0.00e+0 (typical: < 1e-6 negligible)
DmaWait / TxWait = 0.000% (typical: < 1% negligible)
── 3. Summary ─────────────────────────────────────────────────────
✅ Traffic concentrated on VL0 (99.998%), VL15 management only
✅ Wait dominated by idle (TxFlowStall 0.003% of TxWait) — no credit pressure
✅ Per-VL wait decomposition consistent (TxWaitVL0 = 99.977% of TxWait)
✅ Asymmetric TX/RX word ratio (20707×) — producer-side node, normal
✅ Error counters negligible relative to traffic (BER ≈ 1.48e-12)
⚠ RC NAK rate 100.0000% — RcQacks=33638, investigate dominant cause