EMR Serverless PySpark NotebookA VS Code / Cursor extension for running PySpark and Spark SQL in Features
Prerequisites
Installation (development)
Quick start
SidebarThe EMR Serverless activity bar has two views. ApplicationsTree structure: region → application → Livy sessions.
Only applications with Session PresetsSaved Livy
Preset fields: name, execution role ARN, driver/executor memory and cores, executor count, heartbeat timeout, optional TTL, and free-form Workspace presets fileCommit
Configure the file path with Status barTwo items appear on the left when the extension is active:
Kernel pickerEach notebook uses one of two controllers:
Use the kernel picker (Select EMR Serverless Session) or run a cell to connect. Running cells while disconnected opens session selection. Disconnect ( Commands
Settings
Click the $(key) item in the status bar to change profile. Region comes from the selected profile in Notebook format (
|
| Magic | Supported |
|---|---|
%%sql |
Yes — set cell language to SQL or use %%sql at the top of a Python cell |
%pip install … |
Yes — runs python -m pip … on the Livy driver (same as Jupyter) |
!pip install … |
Yes — same as %pip |
Examples:
%pip install pandas
!pip install --quiet scikit-learn
Other %pip subcommands work too (%pip list, %pip show numpy, etc.). On install, packages go to a session directory on the driver, sys.path is updated immediately, and a zip is sent to executors via addPyFile (pure-Python packages only).
Limitations on EMR Serverless (unlike classic EMR Notebooks with sc.install_pypi_package):
- Packages with native binaries (e.g. some builds of
pandas,numpy) may install but still fail on executors — use a venv archive on S3 in session presets for production deps. - Run
%pip installin one cell, thenimportin the next cell (or the same cell after the magic line). - If an import was attempted before install in the same session, restart the Livy session or
importlib.reloadwon't fix a failed first import — use a new session.
Browse tables with SHOW TABLES IN spark_catalog.stage or spark.table(...) in notebook cells.
Do not rely on SparkSession.builder in notebook cells for catalog setup — Livy already created spark with session conf; builder .config() for catalogs is ignored. The extension warns when a cell attempts this.
Iceberg catalogs (spark_catalog + glue_catalog)
Spark registers Iceberg catalogs at session creation only. The extension merges catalog conf into the Livy session body (settings + session presets). Both catalogs can coexist:
| Catalog | Typical use | Config source |
|---|---|---|
spark_catalog |
EMR default (SparkSessionCatalog on Glue) | emrServerless.icebergCatalog.sessionConf |
glue_catalog |
Notebooks with explicit GlueCatalog + warehouse | emrServerless.icebergCatalog.glueCatalog or preset sparkConf |
Enable glue_catalog in settings (replace warehouse):
"emrServerless.icebergCatalog.glueCatalog": {
"enabled": true,
"name": "glue_catalog",
"warehouse": "s3://your-stage-bucket/"
}
Or put full keys in a session preset under sparkConf, or in emrServerless.icebergCatalog.additionalCatalogConf. Then create a new Livy session (attached sessions keep their original conf).
Default Iceberg/Glue session conf:
{
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog",
"spark.sql.catalog.spark_catalog.type": "glue"
}
IAM permissions
Your IAM user/role needs at minimum:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"emr-serverless:ListApplications",
"emr-serverless:GetApplication",
"emr-serverless:StartApplication",
"emr-serverless:StopApplication",
"emr-serverless:AccessLivyEndpoints",
"emr-serverless:GetResourceDashboard",
"emr-serverless:GetDashboardForJobRun"
],
"Resource": "arn:aws:emr-serverless:*:ACCOUNT_ID:/applications/*"
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::ACCOUNT_ID:role/EMRServerlessExecutionRole",
"Condition": {
"StringLike": {
"iam:PassedToService": "emr-serverless.amazonaws.com"
}
}
}
]
}
Replace ACCOUNT_ID and the execution role ARN with your values.
Spark UI and driver logs
After connecting, the extension fetches a Spark UI URL via GetDashboardForJobRun (using the Livy session appId or session id), with GetResourceDashboard as fallback. The link appears in the first cell output and a notification offers Open Spark UI / Refresh Link. URLs expire after about one hour — use Refresh Spark UI Link.
Driver logs: Spark UI → Executors tab → driver row → Logs.
Session isolation
AWS enforces Livy session isolation per IAM principal. You can only attach to sessions created by the same credentials.
Development
Scripts
| Script | Description |
|---|---|
npm run build |
Bundle extension (dist/extension.js) and table renderer (dist/tableRenderer.js) |
npm run watch |
Rebuild on file changes |
npm run typecheck |
TypeScript check without emit |
npm run package |
Build and create a .vsix in releases/ |
npm run vscode:prepublish |
Pre-publish build hook |
Build uses esbuild (Node 18 target for the extension host, ESM for the notebook renderer webview).
Package for sharing (.vsix)
Create an installable extension bundle for other machines:
npm install
npm run package
This builds the extension and writes releases/emr-serverless-pyspark-<version>.vsix.
Options:
npm run package -- --out ./releases # output directory (default: releases/)
npm run package -- --skip-build # package without rebuilding
npm run package -- --pre-release # mark as pre-release in metadata
The script refreshes media/icon.png from media/icon.svg when rsvg-convert is available (macOS: brew install librsvg). icon.png is also committed so packaging works without it.
Install the .vsix on another machine:
- VS Code:
code --install-extension releases/emr-serverless-pyspark-0.1.0.vsix - Cursor:
cursor --install-extension releases/emr-serverless-pyspark-0.1.0.vsix - Or: Extensions sidebar → ⋯ → Install from VSIX…
For local builds, package.json version controls the .vsix filename. In CI, semantic-release bumps the version automatically on merge to main.
Releases (CI/CD)
Releases are automated with semantic-release on every push to main.
Workflow
- Create a feature branch from
main. - Open a PR with a Conventional Commits title (squash merge uses the PR title as the commit message).
- CI runs typecheck, build, and a packaging smoke test.
- Merge to
main— the release workflow bumps the version, updatesCHANGELOG.md, builds a.vsix, and publishes a GitHub Release.
Version bumps (from merged PR titles)
| PR title prefix | Release | Example |
|---|---|---|
fix: |
patch | fix(livy): retry on 503 → 0.1.0 → 0.1.1 |
feat: |
minor | feat(sidebar): session presets → 0.1.0 → 0.2.0 |
feat!: or BREAKING CHANGE: |
minor while on 0.x |
feat!: drop legacy kernel API → 0.1.0 → 0.2.0 |
chore:, docs:, ci: |
none | No GitHub Release (CI still runs) |
Install from a release
Download the .vsix from the GitHub Release assets, then:
code --install-extension emr-serverless-pyspark-X.Y.Z.vsix
cursor --install-extension emr-serverless-pyspark-X.Y.Z.vsix
Release baseline
Baseline tag v0.1.0 is set on main. Only releasable merges (feat:, fix:) produce new GitHub Releases.
Project layout
src/
extension.ts # Activation, command registration
aws/ # EMR Serverless SDK client, config, Iceberg helpers
livy/ # SigV4 Livy HTTP client, session, code transforms
emr/connectionManager.ts # Notebook ↔ Livy session bindings
notebook/ # Serializer, controller, kernel manager, ipynb compat
browser/ # Sidebar tree providers and context-menu actions
session/ # Session presets store and Livy body builder
ui/ # Status bar, kernel picker, connect wizard, preset editor
output/ # Livy result → notebook output mappers
renderer/ # DataFrame table webview renderer
media/ # Icons (emr-serverless.svg, icon.svg) and renderer CSS
scripts/spike.mjs # Standalone AWS / Livy connectivity test
Validate AWS connectivity (spike)
node scripts/spike.mjs
EMR_APPLICATION_ID=00fxxxxxxxx EMR_EXECUTION_ROLE_ARN=arn:aws:iam::...:role/... node scripts/spike.mjs
The spike lists Livy-enabled applications, optionally starts a test session, and prints dashboard URLs — useful before debugging the extension.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ VS Code / Cursor UI │
│ ┌──────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Sidebar tree │ │ Notebook │ │ Status bar │ │
│ │ + presets │ │ controller │ │ │ │
│ └──────┬───────┘ └──────┬──────┘ └─────────────────────┘ │
└─────────┼─────────────────┼─────────────────────────────────┘
│ │
▼ ▼
ConnectionManager ──► LivySession
│ │
▼ ▼
EmrServerlessService LivySigV4Client
(AWS SDK) (SigV4 HTTP)
│ │
▼ ▼
EMR Serverless API https://{appId}.livy.emr-serverless-services.{region}.amazonaws.com
- Control plane:
@aws-sdk/client-emr-serverless— list/start/stop applications, dashboard URLs - Data plane: SigV4-signed HTTP to the per-application Livy endpoint
- Credentials:
@aws-sdk/credential-providers— default chain or explicit profile viaemrServerless.awsProfile - Region: from the active profile in
~/.aws/config(orAWS_REGIONwhen profile is auto) - Table renderer: custom MIME type
application/vnd.emr-spark.table+jsonrendered in a notebook webview
License
MIT