Skip to content
| Marketplace
Sign in
Visual Studio Code>Other>hfi1stats HighlightNew to Visual Studio Code? Get it now.
hfi1stats Highlight

hfi1stats Highlight

Pavel Dobrinskiy

|
1 install
| (0) | Free
Syntax highlighting for hfi1stats output — errors, warnings, traffic counters
Installation
Launch VS Code Quick Open (Ctrl+P), paste the following command, and press enter.
Copied to clipboard
More Info

hfi1stats Highlight

Syntax highlighting and analysis tools for Intel Omni-Path (OPA) and CN5000 HFI statistics output.

Features

  • Syntax highlighting for .hfi1stats / .hfi1 files — errors, warnings, and traffic counters are colour-coded for quick scanning
  • Hover tooltips — hover over any counter name to see its description
  • Right-click commands (available when file language is hfi1stats):
Command What it does
Delete All Zero-Value Lines Removes clutter — keeps only non-zero counters
Calculate Average Packet Size (TxFlitVL0 × 8 bytes) / TxPktVL0
Congestion Report (TxWait / RnrNak / PktDrop) Auto-detects OPA100/CN5000, shows stall time, MB lost, stall % for TxWait and TxFlowStall separately, combined health indicator
GBytes Transferred (TX + RX) Converts DcXmitFlits / DcRcvFlits to GB / TB
Save as CSV Exports all counters to CSV (name, raw value, numeric value)
Diff with second file → CSV Picks a second file and saves only changed counters to CSV
Drop Port 1 Counters (CN5000 pseudo port) Removes the Port0,1: section — CN5000 Port 1 is a pseudo internal port with no real traffic
Analyze — Full Sanity Check Report Normalizes counter units, runs 6 sanity checks, writes .analyzed.txt next to the source file
Collect from Remote Node(s) via SSH SSHes to one or more nodes, pulls hfi1stats, computes TX/RX totals and avg speed, opens files in editor

Usage

Open any hfi1stats output file. VS Code will auto-detect the language by extension (.hfi1stats, .hfi1) or filename (hfi1stats). Right-click in the editor to access commands.

Counter Descriptions

All counter descriptions are stored in counters_compiled.json — edit that file to add or update descriptions without touching extension code.

Supported Hardware

Both OPA100 and CN5000 are supported. Hardware is auto-detected from the file format:

Hardware Detection T_flit Link speed
Intel OPA100 Port0: section headers 0.8 ns 100 Gbps (4×25 Gb/s)
Cornelis CN5000 Port0,1: / Port0,2: section headers 0.4 ns 400 Gbps (4×100 Gb/s)

CN5000 note: Port 1 (Port0,1:) is a pseudo internal port that never transfers data — all counters are always zero. Use Drop Port 1 Counters to remove it before analysis.

Collecting from Remote Nodes

Ctrl+Shift+P (Windows/Linux) or Shift+Command+P (Mac) → hfi1stats: Collect from Remote Node(s) via SSH**

Requires passwordless SSH as root from your workstation to the target nodes.

The command will ask for:

  1. Hostname(s) — several formats accepted:
    • Single node: node6
    • Comma list: d001,d005,d012,d042
    • Range: d001-d010
  2. SSH user — defaults to root
  3. Save folder — where to store the .hfi1stats files

Each collected file is opened in the editor automatically. The header contains TX/RX totals and average speed:

# host        : node6.cluster.local
# collected   : 2026-04-26 09:15:03 UTC
# last reboot : 2026-04-01 08:42
# uptime      : 2198400 s
#
# TX transferred : 1.847 TB  (avg 0.852 GB/s since last reboot)
# RX transferred : 0.412 TB  (avg 0.190 GB/s since last reboot)

After collection, right-click → Analyze to run the full sanity check report.

Analysis Report

Analyze — Full Sanity Check Report writes a .analyzed.txt file next to the source. For multi-unit files (multiple HFI cards) each unit gets its own section.

Example output:

══════════════════════════════════════════════════════════════
  hfi1stats Analysis Report
  Source   : node6.hfi1stats
  Hardware : OPA100 (0.8 ns/flit, 100G)
  Generated: 2026-04-27 11:03:14
══════════════════════════════════════════════════════════════

╔════════════════════════════════════════════════════════════╗
  Unit 0
╚════════════════════════════════════════════════════════════╝

── 1. Normalized Units ────────────────────────────────────────────
Counter                Raw                   Actual Value
──────────────────────────────────────────────────────────────
TxPkt                  352679662K            352,679,662,000
TxPktVL0               352673233K            352,673,233,000
TxWords                132779278447K         1.328e+17
TxWait                 1062315987864K        1.062e+18
TxFlowStall            3051086K              3,051,086,000

Note: K suffix = ×1,000 (auto-applied by hfi1stats when value overflows column)

── 2. Sanity Checks ───────────────────────────────────────────────

① Avg Packet Size  (TxWords × 8 bytes / TxPkt)
   Avg size = 3008 bytes/packet
   ✅ Reasonable — typical RDMA mix

② VL0 Traffic Share  (TxPktVL0 / TxPkt)
   VL0  share: 99.998%  ✅ Single-SL workload, normal
   VL15 share: 0.000%  (management traffic)

③ Wait Attribution  (TxWaitVL0 / TxWait)
   VL0  accounts for: 99.977%  ✅ Consistent with VL traffic distribution

④ Credit Starvation  (TxFlowStall / TxWait)
   Credit starvation: 0.0029%
   ✅ Link is idle-waiting, not credit-starved

⑤ Traffic Asymmetry  (TX vs RX)
   TxPkt / RxPkt    = 1.37  — sender-heavy node
   TxWords / RxWords = 20707×  — large outbound payloads, tiny inbound (producer node)

⑥ Error Rates
   RxICrcErr = 381  /  RxPkt = 256937321K  →  BER ≈ 1.48e-12  ✅ Healthy link

⑦ RC Transport Health  (RcQacks NAK analysis)
   RcQacks = 33638  |  RcAcks = 0
   NAK rate = RcQacks / (RcAcks + RcQacks) = 100.0000%  ✗ High NAK rate — investigate
   Dominant: RcSeqNak=33638 → packets lost or reordered — check link errors, switch drops

⑧ PIO & Driver Send Path  (PioWait / PioDrain · KmemWait · TidWait · SendSched)
   PioWait = 0  PioDrain = 467 — no PIO traffic
   KmemWait = 0  ✅
   TidWait = 0  ✅
   SendSched = 0  ✅

⑨ Link-Layer Health  (DC / LTP counters)
   DcAccLTP = 339547394174121  DcGoodLTP = 339547394174123  →  ✅ Zero bad LTPs — perfect PHY
   DcTxReplay = 0  DcRxReplay = 0  ✅ No LTP replays
   CRC / multi-lane errors: all zero  ✅
   PHY / DC error counters: all zero  ✅
   FECN / BECN / mark: all zero — fabric uncongested  ✅

⑩ RC Key Ratios  (coalescing · NAK rate · retransmission · SDMA stall)
   RcAcks / TxPkt = 9.53e-7  →  1 ACK per ~1050118.0 pkts  (typical RC: 2–5)
   RcQacks / (RcAcks + RcQacks) = 100.0000%  (typical: < 0.01% clean, < 1% elevated)
   RcResend / TxPkt = 0.00e+0  (typical: < 1e-6 clean, < 1e-4 monitor)
   RcCrWait / TxPkt = 0.00e+0  (typical: < 1e-6 negligible)
   DmaWait / TxWait = 0.000%  (typical: < 1% negligible)

── 3. Summary ─────────────────────────────────────────────────────

✅ Traffic concentrated on VL0 (99.998%), VL15 management only
✅ Wait dominated by idle (TxFlowStall 0.003% of TxWait) — no credit pressure
✅ Per-VL wait decomposition consistent (TxWaitVL0 = 99.977% of TxWait)
✅ Asymmetric TX/RX word ratio (20707×) — producer-side node, normal
✅ Error counters negligible relative to traffic (BER ≈ 1.48e-12)
⚠ RC NAK rate 100.0000% — RcQacks=33638, investigate dominant cause
  • Contact us
  • Jobs
  • Privacy
  • Manage cookies
  • Terms of use
  • Trademarks
© 2026 Microsoft