Skip to content
| Marketplace
Sign in
Visual Studio Code>Linters>CatalystOps - PySpark OptimizerNew to Visual Studio Code? Get it now.
CatalystOps - PySpark Optimizer

CatalystOps - PySpark Optimizer

CatalystOps

|
124 installs
| (1) | Free
Analyze PySpark code for performance issues using Catalyst execution plan analysis
Installation
Launch VS Code Quick Open (Ctrl+P), paste the following command, and press enter.
Copied to clipboard
More Info
CatalystOps Logo

CatalystOps — PySpark Optimizer

CatalystOps catches PySpark performance issues before they reach production. It runs 40+ anti-pattern checks instantly in the editor, validates schemas at edit time, analyzes Catalyst execution plans via Databricks dry run or SSH, inspects historical job runs from Spark event logs, and tracks Databricks spending in a built-in billing dashboard — without touching your data.

Install from the VS Code Marketplace


CatalystOps Demo


Features

Local Analysis — No Cluster Required

Runs instantly on every file open and save. Detects:

Severity Examples
Critical collect(), crossJoin(), SQL injection via f-strings, for row in df.collect(), DLT CDC ordering bugs
Warning toPandas(), coalesce(1), global window functions, withColumn in loops, streaming without watermark, UDFs in .filter(), .repartition(N) before write, repeated actions without .cache()
Info Schema inference, SELECT *, missing write mode, ZORDER (use Liquid Clustering), unnamed streaming queries, MERGE without Deletion Vectors

Streaming, Delta, and DLT pipelines have dedicated rule sets. Add # noqa: catalystops to suppress a line.

Hover cards — every issue shows a one-sentence explanation and a quick-fix code snippet on hover.


Schema Validation — No Cluster Required

When a StructType or DDL schema is defined in the same file, CatalystOps validates column references and types at edit time:

  • Unknown column names with "did you mean?" suggestions
  • Type mismatches (numeric/string/date/array functions on wrong column types)
  • union(), intersect(), except() schema alignment — catches silent wrong-result bugs from column order mismatches
  • Join condition type mismatches

Schemas propagate through .filter(), .select(), .drop(), .withColumn(), .withColumnRenamed(). External sources (spark.table(), spark.read without .schema()) are skipped — no false positives.


Static Cost Estimation — No Cluster Required

Annotate your file to get an instant dollar estimate via CodeLens:

# @compute: nodes=4, cores=2, memory=16GB, rate=0.25

events = spark.read.parquet("s3://bucket/events")  # @size: 50GB
lookup = spark.read.csv("s3://bucket/lookup")       # @size: 200MB

Dry Run — Catalyst Plan Inspection

Submits a neutralized version of your script to a Databricks cluster, serverless compute, or SSH tunnel. All writes and actions are replaced with explain("formatted") captures — no data is modified.

Plan issues detected: broadcast vs sort-merge vs cartesian joins, shuffle exchanges, too-few partitions, repeated table scans (including cross-DataFrame), cache spill, missing partition filters, missing table statistics, SinglePartitionBottleneck, SortAggregate, GlobalWindow. Fully Photon-aware.

Execution modes:

Mode How to enable
Interactive cluster Set catalystops.databricks.clusterId
Serverless Leave cluster ID blank; set executionMode: serverless
SSH tunnel Set connection.sshTunnel.enabled: true + connectionName

Job Run Analysis — Historical Runs

Analyze a past run directly from the Jobs sidebar — no re-execution needed. CatalystOps reads the Spark event log from DBFS, extracts physical plans, and opens an interactive DAG view showing operator trees, filter conditions in plain English, issue badges, and a View Source button.

Prerequisites: Cluster log delivery to DBFS must be enabled on the cluster. Serverless runs are not supported (no DBFS event log).


Clusters Sidebar — One-Click SSH

The Clusters panel lists all interactive clusters with their current state. Click the SSH icon on any cluster to:

  1. Auto-start it if stopped (polls every 5 s)
  2. Run databricks ssh setup automatically
  3. Fix access mode / upgrade Spark if needed (one-click modal)
  4. Open VS Code Remote SSH

Right-click a cluster for Stop Cluster and Reset SSH Host (clears cached alias).

Requirements: Databricks CLI ≥ 0.269, VS Code Remote SSH extension, DBR 17+.


Billing Dashboard

Queries system.billing.usage via Databricks SQL and shows spend by user, job, and workload type. Supports Day / Week / Month / Custom ranges. Results are cached for 1 hour; click Refresh to force a live query.

Requirement: Unity Catalog System Tables must be enabled on your workspace.


MCP Server — AI Assistant Integration

CatalystOps exposes live analysis data to Claude, GitHub Copilot (VS Code 1.99+), Cursor, and any MCP-compatible client via a Streamable HTTP server that starts automatically on a dynamic port.

Tools:

Tool Description
analyze_pyspark Static analysis on any code snippet
get_active_file_issues Issues for the currently open file
get_plan_analysis Plan issues from the last dry run
run_dry_run Trigger a dry run and return results
get_billing_summary Cached billing data (day/week/month)
refresh_billing Force a live billing query
list_clusters Workspace clusters with state and Spark version
list_job_runs Jobs and their most recent run status
get_job_run_plan Fetch plan issues from a historical run by ID
get_last_job_run_analysis Plan issues from the last job analyzed in VS Code

Resources: catalystops://issues/current, catalystops://plans/last, catalystops://billing/summary

Prompts: pyspark_code_review, optimize_spark_plan

VS Code Copilot discovers the server automatically. For other clients, add the URL shown in the CatalystOps Output panel:

{ "servers": { "catalystops": { "url": "http://127.0.0.1:<port>/mcp" } } }

Getting Started

1. Install — search for CatalystOps in the Extensions panel or:

ext install CatalystOps.catalystops

Local analysis works immediately — no configuration needed.

2. Connect to Databricks (for dry-run plan analysis) — run CatalystOps: Configure Databricks Connection from the Command Palette.

The wizard auto-detects available auth methods and shows only what applies to you:

Method Requirement
Azure CLI Signed into az login — picks up workspaces automatically
GCP ADC Signed into gcloud auth application-default login
~/.databrickscfg Databricks CLI already configured
OAuth / Browser Login Any workspace — log in via browser, no token needed
Personal Access Token Workspace URL + token

Cluster ID is only prompted when running a dry-run — not during connection setup.

3. Analyze — open a .py file. Local checks run automatically. Press ⌘⇧K / Ctrl+Shift+K for a full dry-run plan analysis.


Commands

Command Shortcut Description
Analyze Cost (Dry Run) ⌘⇧K Run local + cluster/serverless/SSH analysis
Analyze Selected Code — Analyze only the highlighted selection
Configure Databricks Connection — Interactive setup wizard
Show Report — Open a shareable HTML report
Show Billing Dashboard — Open the billing dashboard
Refresh Billing Data — Force a fresh billing query
Refresh Jobs List — Reload the Jobs sidebar
Refresh Clusters — Reload the Clusters sidebar
Connect via SSH — One-click SSH into a cluster
Stop Cluster — Stop a running cluster
Reset SSH Host — Clear cached SSH alias for a cluster

Settings Reference

Setting Default Description
catalystops.databricks.host "" Databricks workspace URL
catalystops.databricks.token "" Personal access token
catalystops.databricks.clusterId "" Interactive cluster ID (blank = serverless)
catalystops.databricks.configPath ~/.databrickscfg Databricks CLI config file
catalystops.databricks.profile DEFAULT Config profile name
catalystops.databricks.executionMode cluster cluster | serverless
catalystops.connection.sshTunnel.enabled false Use SSH tunnel for script execution
catalystops.connection.sshTunnel.connectionName "" SSH alias from databricks ssh setup
catalystops.ssh.shutdownDelay 30m Cluster idle time before SSH tunnel closes
catalystops.jobs.enabled true Show the Jobs sidebar panel
catalystops.analysis.enableLocalCodeAnalysis true Enable local anti-pattern detection
catalystops.analysis.enableRepeatedScanDetection false Warn when a source DataFrame is reused without .cache()
catalystops.analysis.autoAnalyzeOnSave false Auto-run dry run on save
catalystops.cost.dbuRatePerHour 0.4 DBU rate ($/hr) for cluster cost estimation
catalystops.cost.serverlessRatePerHour 0.7 Effective hourly cost for serverless estimation
catalystops.cost.queryBillingUsage false Query system.billing.usage after each serverless run for actual cost
catalystops.billing.warehouseId "" SQL warehouse ID for billing queries (auto-discovers if blank)
catalystops.mcp.enabled true Enable the MCP server
catalystops.mcp.port 49152 Port for the MCP server (restart VS Code to apply)
catalystops.debug false Log diagnostic details to the Output panel

Development

Prerequisites: Node.js v16+, VS Code v1.85+

git clone https://github.com/lezwon/CatalystOps
cd CatalystOps
npm install
npm run build
npm run compile   # TypeScript compilation
npm run watch     # Rebuild on save
npm run build     # Production bundle (esbuild)
npm test          # Run test suite

Press F5 in VS Code to launch an Extension Development Host.


License

Elastic License 2.0 (ELv2) — source is publicly available; hosting or redistributing as a competing service is not permitted.

  • Contact us
  • Jobs
  • Privacy
  • Manage cookies
  • Terms of use
  • Trademarks
© 2026 Microsoft