PaxZas — Semantic GPU Performance Analysis EnginePaxZas is a semantic GPU performance analysis engine: it turns NVIDIA PTX and SASS into architecture-aware performance signals—occupancy limits, memory posture, control and tensor patterns, stall-oriented diagnosis, and instruction mix—not just raw opcode tallies. Analysis is static (no kernel execution), runs inside VS Code, and needs no Python, no network, and no GPU. Point the extension at a The published VS Code extension appears in the marketplace as PaxZas Engine; command IDs still use the Analyzing a
|
| Preset | Architecture | SM |
|---|---|---|
a100 |
Ampere A100 | sm_80 |
ampere-like-default |
Ampere generic | sm_80 |
rtx-4090 |
Ada Lovelace | sm_89 |
h100-sxm |
Hopper H100 SXM | sm_90 |
h100-pcie |
Hopper H100 PCIe | sm_90 |
h200 |
Hopper H200 | sm_90 |
b200 |
Blackwell B200 | sm_100 |
gb200 |
Blackwell GB200 NVL | sm_100 |
blackwell-consumer-default |
Blackwell consumer | sm_120 |
When direct SM-count telemetry is unavailable, auto-mode also uses curated GPU-name heuristics (including RTX 4080 SUPER and RTX 4070 Ti SUPER) for better occupancy estimates.
Occupancy Model
- Computes warp occupancy, blocks per SM, and the limiting factor (registers, shared memory, threads, or block limit) for any block size
- Sweep charts — occupancy, blocks/SM, and per-constraint resource limits plotted across all valid block sizes
- Waste metrics — unused threads, warps, registers, and shared bytes per SM at the current configuration
- Register what-if: how many fewer registers to reach the next occupancy tier
- Register margin recommendations are aligned to real per-warp register allocation granularity (actionable
__launch_bounds__targets) - Shared-memory what-if — when shared memory is the limiting factor, an actionable "shed N bytes/block to reach next tier" recommendation is offered (parallel to the register-side margin)
- Launch parameter inference from PTX hints and optional SASS register index; optional
grid=hint caps the device-level “SMs active” string whensmCountis known (tiny launches cannot occupy more SMs than blocks) - SASS-only files accept the same launch overrides (
threads,shared,regs,grid) instead of silently fixing only threads/shared defaults - Low-level reference: Occupancy Model Synthesis
Bottleneck Diagnosis
- The legacy
heuristicBottlenecklabel now consumes optional SASS features (same FLOP/byte weighting as the memory model) so it does not disagree with memory class on SASS-only or supplemental-SASS runs - Fuses memory posture, stall profile, and pattern class into a primary and secondary bottleneck with the firing rule that triggered it
- Four stall profile flags: memory dependency, memory throttle, local memory (register spill), sync overhead
- Per-bottleneck optimization suggestions ranked by impact
- Low-level reference: Diagnosis Layer (
diagnoseKernel)
Memory Model
- Integer-only SASS ALU (I9) — when weighted FLOPs from
arithmetic_ops/tensor/SFU/FP64 are all zero butinteger_ops > 0(e.g.ISCADD-heavy code), the SASS FLOP proxy falls back tointeger_ops×2so empty-PTX + SASS paths are not mis-read as zero compute - Classifies the kernel as memory-bound or compute-friendly
- Arithmetic intensity (ops/byte), reuse ratio, cache policy, load/store balance and vectorization score
- Sub-32-bit precision aware —
LDG.E.U16/LDG.E.U8(FP16, BF16, INT8 / FP8) and their store counterparts are costed at their true 2-byte / 1-byte widths, not the legacy 4-byte fallback - FP64-aware FLOP proxy —
DFMA/DADD/DMULare counted as FP64 arithmetic and given an extra weighting (FP64 throughput is ~1/32 of FP32 on consumer GPUs), so HPC kernels are no longer mis-classified as memory-bound - Cache-policy precedence — when both
.CGand.CSloads exist, the dominant policy wins (streaming/L2/mixed); previously a single.CGwould mask hundreds of streaming loads - Distinguishes global, shared, and local (spill) traffic from SASS; falls back to PTX heuristics when SASS is absent
- Low-level reference: Memory Model Synthesis and Feature Fusion
Pattern Model
- Ratio semantics (I1) —
shared_to_globalandcompute_to_memorytreat “divide by zero global ops” as unbounded reuse / compute (exported as a large finite sentinel for JSON safety), instead of collapsing to0 - Classifies as
tiled,streaming,reduction,compute_heavy, or mixed - Detects archetypes:
GEMM,CONV,ELEMENTWISE,STENCIL, and others - 15 micro-flags: register spill, uncoalesced loads/stores, atomic contention, missing tensor cores, SFU-heavy, over-synchronized, FP16 scalar, warp divergence, and more
over_synchronizednow fires on fully-unrolled outer loops — when PTX showsloops = 0because the compiler unrolled them, SASS back-edges (BRA <addr>whose target precedes the issuing PC) are used as a per-iteration substitute- Hopper / Blackwell aware —
BARRIER/BMOV/BSSY/WARPSYNC/ARRIVE/WAITare recognised as barriers;UTMALDG/UTMASTG(TMA bulk copies),LDGSTSandCP.ASYNC(async global→shared) are counted as global memory traffic - PTX-only utilisation fields are honest —
tensor_utilization_fractionandproductive_instruction_fractionare reported as undefined (rendered as—) when no SASS is available, instead of a misleading hard0 - Warp primitive detection: shuffle, vote, and reduction patterns
- Low-level reference: Pattern Model Synthesis and Feature Fusion
Instruction Mix (SASS)
- Category breakdown: arithmetic, tensor, SFU, global mem, shared mem, local mem, control/sync
- Full load/store width spectrum — LDG.128 / .64 / .32 / .16 (half / bf16) / .8 (int8 / fp8); STG equivalents
- Modern tensor / async / TMA paths —
LDGSTS,CP.ASYNC,UTMALDG,UTMASTGare tracked as global ops with sub-counters (async_global_loads,tma_ops);LDC(constant bank) is counted separately so global byte accounting stays honest - Tensor core op counts (WMMA / HMMA / WGMMA / IMMA / BMMA)
- Precision-specific counters —
fp16_arith_ops(HFMA/HADD/HMUL) andfp64_arith_ops(DFMA/DADD/DMUL/DMNMX/DSETP) - Atomic and warp-primitive counts (shuffle / vote / match)
branchandkernel_exitare now distinct — RET / EXIT no longer inflate branch density on small kernelsback_edges— backwardBRAjumps that close a logical loop (used byover_synchronizedwhen the compiler unrolled the visible loop)- Productive instruction fraction and tensor utilization fraction (when SASS is present)
- Low-level reference: SASS Features (especially sections 2.1-2.9)
Roofline Chart
- Plots the kernel's arithmetic intensity against the FP32 roof and bandwidth slope for the selected GPU
- Region classification: memory-bound or compute-bound with a ridge-point marker
- Low-level reference: Memory Model Synthesis (bytes/flops and intensity derivation)
Raw Feature Inspection
- Side-by-side PTX vs SASS instruction counts for every extracted feature
- Rows with differing values highlighted; column headers adapt to the available data source
- Low-level reference: PTX Features, SASS Features, and Feature Fusion
Commands
| Command | Description |
|---|---|
| Paxzas: Kernel Analysis | Opens the full 10-tab analysis panel |
| Paxzas: Analyze CUDA File with Launch Spec | Same analysis with optional threads=…,shared=…,regs=…,grid=… overrides (grid caps the SM-util estimate when device SM count is known) |
Both commands are available from the Command Palette, editor title bar, editor right-click, and Explorer right-click on .ptx, and .sass files.
Settings
paxzas.gpuPreset — default GPU for the capability dropdown.
auto detects the local GPU via nvidia-smi when available; named presets force a specific architecture. Regardless of this setting, all presets are always shown in the panel dropdown.
Requirements
- VS Code ≥ 1.85
- Optional:
nvidia-smion PATH for automatic GPU detection inautomode
Development
npm install
npm run compile # or: npm run watch
npm test
F5 in VS Code (with this folder open) launches an Extension Development Host.
Build a .vsix
./build-vsix.sh
# or
npm run vsix
Install locally: Extensions → ··· → Install from VSIX…
