TyPy Spark

Live PySpark column autocomplete, schema-on-hover, and missing-column diagnostics for Jupyter notebooks in VS Code.

No schema classes. No type annotations. No typedspark-style boilerplate. Just type df. and see your columns.

Features

Column autocomplete everywhere a column name goes

df.<here>                       # attribute access
df["<here>"]                    # bracket access
df.select("<here>")             # any string-arg method
df.select("a", "<here>")        # any position
df.withColumn("new", col("<here>"))   # nested col() refs
df.filter("<here> > 5")         # SQL strings

Use the arrow keys or click to select. Standard VS Code autocomplete UX.

Hover any DataFrame variable

Shows the schema as a table with column names and types. Click "🔍 Search N columns" in the hover to open a fuzzy-search picker over all columns — type any part of a column name or its type (string, bigint, etc.).

Searchable schema explorer

Cmd+K Cmd+S (macOS) / Ctrl+K Ctrl+S (Linux/Windows) — opens a quick-pick scoped to whichever DataFrame your cursor is on. Selecting a column inserts it at your cursor.

If your cursor isn't on a DataFrame, you get a global search across every DataFrame currently in scope, with results formatted as df.column_name.

Activity bar → TyPy Spark icon → tree view of every DataFrame in the kernel, expandable to show columns and their types. Click any DataFrame to open the search picker.

Missing-column diagnostics

Red squiggle on df.bad_col, df["bad_col"], df.select("bad_col"), etc. The error message lists what's actually available so you don't have to guess.

Static flow analysis

Schemas update after each cell execution, but autocomplete also works on transformations you've written but haven't run yet. Recognized:

x = df (plain alias)
x = df.alias(...), .filter(...), .where(...), .limit(...), .distinct(...), .dropDuplicates(...), .cache(), .persist(), .repartition(...), .coalesce(...), .orderBy(...), .sort(...)
x = df.drop("a", "b")
x = df.withColumn("new_col", ...)
x = df.withColumnRenamed("old", "new")
x = df.select("a", "b")
x = df.toDF("a", "b", "c")
x = df.join(other, "key") (and ["k1", "k2"], on=..., how=... including left_semi / left_anti)
x = df.groupBy("a", "b").agg(F.sum("x").alias("total"), F.count("*")) — including .count(), .sum(...), .avg(...), .min(...), .max(...) shortcuts and dict-form .agg({"col": "sum"})
Chains of all of the above: df.join(...).filter(...).groupBy(...).agg(...).orderBy(...) resolves end-to-end

Requirements

VS Code 1.85 or newer
The Jupyter extension (ms-toolsai.jupyter) — auto-installed as a dependency
An active kernel with PySpark available — TyPy Spark reads from a running Python kernel via runtime introspection. It works anywhere VS Code can attach to such a kernel:
- Standard .ipynb notebooks (local or remote)
- Apache Zeppelin notebooks exported to .ipynb and opened in VS Code
- Databricks notebooks via the Databricks VS Code extension (sync down as .ipynb, kernel runs against your cluster)
- Remote Jupyter servers
- Spark Connect / Databricks Connect setups (kernel local, cluster remote — works transparently)
- Remote-SSH workspaces — extension installs on the remote side automatically

If you don't have a running kernel with DataFrames in scope, the extension stays quiet — there's nothing to introspect.

Usage

Open a .ipynb.
Attach to a kernel that has PySpark.
Run any cell that creates a DataFrame:

   df = spark.read.parquet("...")

In any cell, start typing df. — the dropdown appears.

That's the whole workflow. Schemas refresh automatically after every successful cell execution. Manual refresh: TyPy: Refresh schemas from the Command Palette.

Commands

Command	Default keybinding
`TyPy: Search DataFrame columns`	`Cmd+K Cmd+S` / `Ctrl+K Ctrl+S`
`TyPy: Refresh schemas`	—

How it works

When you run a cell, TyPy Spark silently executes a small Python probe in the same kernel. The probe walks the kernel's globals, finds every PySpark DataFrame (or wrapper around one), and dumps the schemas as JSON. The extension caches them and powers completion, hover, diagnostics, and the sidebar from that cache.

Combined with a static analyzer that walks the current cell from the top down to your cursor, transformations like x = df.filter(...).select("a", "b") resolve correctly even before the cell has been run — as long as the upstream DataFrame (df) is something the kernel already knows about.

License

MIT — see LICENSE.

TyPy Spark Notebook Extension

ADM