KF SQL Formatter (Databricks/Spark)

A custom token-based SQL formatter for VS Code that formats Databricks/Spark SQL with 100% idempotent output.

✨ Features

🎯 100% Idempotent - Format twice = same result (140/140 tests passing)
📏 Width-Aware Alignment - Automatic 80/120 character width preference per statement
📐 Content-Aware Separators - Comment separators snap to 80 or 120 dashes
📍 Leading Commas - SELECT list items formatted with leading commas (,col)
🔤 Smart Keyword Casing - Configurable keyword case (lower, upper, or preserve)
🔗 Multi-word JOIN Handling - Keeps INNER JOIN, LEFT JOIN, etc. together
📦 CTE Indentation - Perfect context tracking for WHERE/GROUP BY in CTEs
✨ Trailing Whitespace Removal - Clean, consistent output
⚡ Spark SQL Support - Arrays, maps, structs, named_struct, explode, lateral view
🧪 SQL Test Generator - Generate test queries to debug WHERE/JOIN filtering logic

📦 Installation

Option 1: VS Code Marketplace (Recommended)

Search for "KF SQL Formatter" in VS Code Extensions or install from: VS Code Marketplace - KF SQL Formatter

Option 2: Direct Download

⬇️ Download Latest Release (v0.4.3)

After downloading:

Open VS Code
Press Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (Mac)
Type "Extensions: Install from VSIX..."
Select the downloaded .vsix file

Option 3: Build from Source

git clone https://github.com/k-f-/kf.sql.formatter.git
cd kf.sql.formatter
npm install
npm run compile
npm run package

🚀 Usage

Open any .sql, .hql, or .spark.sql file
Run Format Document (Shift+Alt+F on Windows/Linux, Shift+Option+F on Mac)
Your SQL will be formatted with consistent style!

🧪 SQL Test Generator

Generate test queries to debug WHERE and JOIN filtering logic. Useful for identifying why data disappears between pipeline layers.

How to Use

Open a SQL file with a SELECT query
Use one of these methods:
- Keyboard: Ctrl+Shift+T (Windows/Linux) or Cmd+Shift+T (Mac)
- Right-click: Select KF.SQL: Generate SQL Tests from context menu
- Command Palette: Ctrl+Shift+P → KF.SQL: Generate SQL Tests
A new tab opens with your test query

What It Does

Extracts WHERE conditions → Creates TEST columns showing PASS/FAIL for each filter
Extracts CTE conditions → Creates CTE columns (CTE1, CTE2) for conditions inside WITH clauses
Extracts JOIN conditions → Creates JOIN columns showing PASS/FAIL/N/A for each join
Converts INNER JOIN → LEFT JOIN → Preserves unmatched rows for debugging
Comments out WHERE clause → Returns ALL rows so you can see what would be filtered
Adds dual-comment pattern → --[TEST1] markers link results to original conditions

Example

Input:

SELECT * FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE u.status = 'active' AND o.amount > 100

Output (automatically formatted using your workspace settings):

select
     *
     -- [Test Indicators] ----------
    ,CASE
        WHEN (u.status = 'active' and o.amount > 100) THEN 'Y'
        ELSE 'N'
    END               as TEST1                                                 --[TEST1]
    ,CASE
        WHEN u.status = 'active' THEN 'Y'
        ELSE 'N'
    END               as TEST1_1                                               --[TEST1.1]
    ,CASE
        WHEN o.amount > 100 THEN 'Y'
        ELSE 'N'
    END               as TEST1_2                                               --[TEST1.2]
    ,CASE
        WHEN o.user_id is null THEN 'N/A'
        WHEN u.id = o.user_id THEN 'Y'
        ELSE 'N'
    END               as JOIN1                                                 --[JOIN1]
from users            as u
left join orders      as o
    on u.id = o.user_id                                                        --[JOIN1]
-- WHERE clause commented for testing:
-- WHERE u.status = 'active' --[TEST1.1] AND o.amount > 100 --[TEST1.2]

Key features shown:

TEST1: Combined parent test (both conditions AND'd)
TEST1_1, TEST1_2: Individual condition tests
JOIN1: JOIN condition with N/A for unmatched rows
Dual-comment pattern: --[TEST1] markers link results to original conditions
Auto-formatted: Output uses your workspace formatter settings

Test Generator Settings

Setting	Default	Description
`testGenerator.convertInnerJoins`	`true`	Convert INNER → LEFT JOIN for non-destructive testing
`testGenerator.addJoinNullHandling`	`true`	Add N/A for unmatched JOIN rows
`testGenerator.showMetadata`	`true`	Show notification with test counts

Supported Features

Feature	Test ID Pattern	Example
WHERE conditions	TEST1, TEST2, etc.	`WHERE status = 'active'` → TEST1
JOIN ON conditions	JOIN1, JOIN2, etc.	`ON u.id = o.user_id` → JOIN1
CTE WHERE conditions	CTE1, CTE2, etc.	`WITH cte AS (... WHERE x > 10)` → CTE1
HAVING conditions	HAVING1, HAVING2, etc.	`HAVING COUNT(*) > 5` → HAVING1
Subquery WHERE	SUB1, SUB2, etc.	`FROM (SELECT ... WHERE x > 10) subq` → SUB1
AND decomposition	Parent + children	`WHERE a AND b` → TEST1, TEST1_1, TEST1_2
BETWEEN decomposition	Parent + >= / <=	`BETWEEN 100 AND 500` → TEST1, TEST1_1 (>=), TEST1_2 (<=)
IN list decomposition	Parent + each value	`IN ('a', 'b')` → TEST1, TEST1_1, TEST1_2
CTE internal JOINs	CTE_JOIN1, CTE_JOIN2	`WITH cte AS (... INNER JOIN ...)` → CTE_JOIN1
Subquery JOINs	SUB_JOIN1	`FROM (SELECT ... INNER JOIN ...) subq` → SUB_JOIN1
Complex ON clauses	JOIN1	`ON LOWER(a.email) = LOWER(b.email)` → JOIN1
Multi-statement SQL	S1_TEST1, S2_TEST1	Multiple SELECTs → prefixed test IDs

Note: Output is automatically formatted using your workspace formatter settings (keywordCase, indent, leadingCommas, etc.).

⚙️ Configuration

Configure the formatter in VS Code settings (settings.json):

{
  "databricksSqlFormatter.keywordCase": "lower",
  "databricksSqlFormatter.functionCase": "lower",
  "databricksSqlFormatter.indent": 4,
  "databricksSqlFormatter.leadingCommas": true,
  "databricksSqlFormatter.dialect": "spark",
  "databricksSqlFormatter.aliasAlignmentScope": "file",
  "databricksSqlFormatter.aliasMinGap": 8,
  "databricksSqlFormatter.aliasMaxColumnCap": 120,
  "databricksSqlFormatter.trimTrailingWhitespace": true
}

📖 Complete Settings Guide - Detailed documentation for all 16 configuration options with examples, recommendations, and configuration recipes.

Quick Settings Access

Command Palette: Cmd+Shift+P → "KF.SQL: Open Settings"
Settings UI: Cmd+, → Search "databricks"
Documentation: See VSCode Settings Guide

Available Settings

Core Formatting

Setting	Type	Default	Description
`keywordCase`	`"lower"` \| `"upper"` \| `"preserve"`	`"lower"`	SQL keyword letter case
`functionCase`	`"lower"` \| `"upper"` \| `"preserve"`	`"lower"`	Function name letter case
`indent`	`number`	`4`	Spaces per indent level (2-8)
`leadingCommas`	`boolean`	`true`	Use leading commas in SELECT lists
`dialect`	`"spark"` \| `"hive"` \| `"ansi"`	`"spark"`	SQL dialect (for future use)
`trimTrailingWhitespace`	`boolean`	`true`	Remove trailing whitespace from lines

Alias Alignment

Setting	Type	Default	Description
`aliasAlignmentScope`	`"file"` \| `"statement"` \| `"none"`	`"statement"`	Scope for alias and comment alignment. `"statement"` aligns within each SQL statement, `"file"` aligns across entire file
`aliasMinGap`	`number`	`8`	Minimum spaces between expression and `AS` (1-20)
`aliasMaxColumnCap`	`number`	`120`	Maximum column for alias alignment (40-200)
`aliasColumnCapMode`	`"fixed"` \| `"adaptive"`	`"adaptive"`	Column cap behavior. `"adaptive"` uses 80 or 120 based on expression length, `"fixed"` always uses `aliasMaxColumnCap`
`aliasAdaptiveThreshold`	`number`	`80`	When `aliasColumnCapMode: "adaptive"`, expressions >= this length use 120 column cap, shorter use 80
`addExplicitAs`	`boolean`	`true`	Add explicit `AS` keyword to all aliases

Auto-Generate Aliases

Setting	Type	Default	Description
`autoGenerateAliases`	`boolean`	`true`	Master switch for auto-generating aliases (must be `true` to enable sub-options)
`autoGenerateTableAliases`	`boolean`	`true`	Auto-generate table aliases from table names (e.g., `customer_orders` → `co`)
`autoGenerateSelectAliases`	`boolean`	`false`	Auto-generate SELECT expression aliases for complex expressions

📐 Formatting Examples

Before Formatting

WITH active_users AS (SELECT user_id,user_name,account_type FROM users WHERE status='active' AND created_date>='2024-01-01'),order_stats AS(SELECT user_id,COUNT(*) as order_count,SUM(amount) as total_spent,AVG(amount) as avg_order FROM orders WHERE order_date>='2024-01-01' GROUP BY user_id)SELECT u.user_id,u.user_name as customer_name,u.account_type,CASE WHEN o.order_count>10 THEN 'high' WHEN o.order_count>5 THEN 'medium' ELSE 'low' END as engagement_level,o.total_spent,o.avg_order FROM active_users u INNER JOIN order_stats o ON u.user_id=o.user_id WHERE o.total_spent>100 ORDER BY o.total_spent DESC

After Formatting

Configuration Used: This example uses aliasAlignmentScope: 'statement' with all formatting features enabled:
{
  "keywordCase": "lower",
  "functionCase": "lower",
  "indent": 4,
  "leadingCommas": true,
  "aliasAlignmentScope": "statement",
  "aliasMinGap": 8,
  "aliasMaxColumnCap": 120,
  "addExplicitAs": true,
  "trimTrailingWhitespace": true
}

with active_users as (
    select
         user_id
        ,user_name
        ,account_type
    from users
    where status = 'active'
        and created_date >= '2024-01-01'
)
,order_stats as (
    select
         user_id
        ,count(*)           as order_count
        ,sum(amount)        as total_spent
        ,avg(amount)        as avg_order
    from orders
    where order_date >= '2024-01-01'
    group by user_id
)
select
     u.user_id
    ,u.user_name            as customer_name
    ,u.account_type
    ,case
        when o.order_count > 10 then 'high'
        when o.order_count > 5 then 'medium'
        else 'low'
    end                     as engagement_level
    ,o.total_spent
    ,o.avg_order
from active_users           as u
inner join order_stats      as o
    on u.user_id = o.user_id
where o.total_spent > 100
order by o.total_spent desc

Key Features Demonstrated:

📍 Leading commas for SELECT columns (leadingCommas: true)
📏 Smart alias alignment - all AS keywords align within statement scope (aliasAlignmentScope: 'statement', aliasMinGap: 8)
🔤 Consistent keyword casing (keywordCase: 'lower', functionCase: 'lower')
📦 Perfect CTE indentation with proper WHERE clause alignment (indent: 4)
🔗 Multi-word keywords kept together (INNER JOIN, ORDER BY)
⚡ Clean, readable structure with explicit AS keywords (addExplicitAs: true)

Understanding `aliasAlignmentScope`

The aliasAlignmentScope option controls how AS keywords align across your SQL file. Here's a visual comparison:

With scope: 'statement' (each statement aligns independently):

select
     customer_id
    ,very_long_customer_name        as name
    ,account_balance                as balance
from customers;
select
     id          as user_id
    ,name        as user_name
from users

Notice how:

Statement 1: AS keywords align at column 36 (due to long column name)
Statement 2: AS keywords align at column 17 (shorter expressions)
Each statement uses optimal spacing for its own expressions

With scope: 'file' (all statements align together):

select
     customer_id
    ,very_long_customer_name        as name
    ,account_balance                as balance
from customers;
select
     id                             as user_id
    ,name                           as user_name
from users

Notice how:

All AS keywords align at column 36 across both statements
Statement 2 has extra spacing to match statement 1's longest expression
Provides consistent alignment across the entire file

Recommendation: Use 'statement' for optimal readability in files with multiple queries, or 'file' when you want uniform alignment across all statements.

🏗️ Architecture

The formatter uses a token-based architecture with three stages:

Tokenize - SQL → Tokens (preserves 100% of original text)
Transform - Apply transformations:
- Keyword case normalization (configurable: lower, upper, or preserve)
- Line breaks insertion
- Leading commas conversion
- Context-aware indentation (CTEs, subqueries, CASE statements)
- Explicit AS keyword insertion
- Unified alias alignment (column, table, and subquery aliases align together)
- Comment separator adjustment (80/120 dashes)
- Trailing whitespace removal
Generate - Tokens → Formatted SQL

📊 Test Results

140/140 tests passing ✅

Unit Tests: Tokenizer (4/4), Generator (2/2), Integration (6/6)
Golden Files: 5 files - Idempotence tests + Similarity metrics (informational)
Test Cases: 8 files - Idempotent formatting verification
Edge Cases: 7 files covering complex SQL patterns
Minified Path: Unformatted → Formatted idempotence testing
Options Tests: 70+ tests for all configuration options
VSCode Integration: Extension and settings tests

📁 Project Structure

kf.sql.formatter/
├── package.json                   # Extension manifest
├── tsconfig.json                  # TypeScript config
├── src/                          # Source code
│   ├── extension.ts              # VS Code extension entry point
│   ├── formatter-custom.ts       # Main formatter
│   ├── tokenizer/               # SQL tokenizer
│   │   ├── tokenize.ts
│   │   └── token-types.ts
│   ├── transformers/            # Token transformers
│   │   ├── keywords.ts
│   │   ├── line-breaks.ts
│   │   ├── leading-commas.ts
│   │   ├── indentation.ts
│   │   ├── add-explicit-as.ts
│   │   ├── alias-alignment.ts
│   │   ├── comment-separators.ts
│   │   └── trim-trailing-whitespace.ts
│   └── generator/               # SQL generator
│       └── generate-sql.ts
├── dist/                        # Compiled JavaScript (gitignored)
├── tests/                       # Test suite
│   └── test-suite.js           # Comprehensive tests
├── examples/                    # Test SQL files
│   ├── manual/                 # Manual test cases (7 files)
│   ├── golden-references/      # Reference SQL (5 files)
│   └── edge-cases/            # Complex edge cases (7 files)
└── docs/                       # Documentation
    ├── development/           # Developer guides
    └── releases/             # Release notes

🔮 Future Features

The following features are planned but not yet implemented:

Semantic Auto-Aliasing - Automatically add meaningful aliases to expressions
Advanced Comment Handling - Comment alignment and wrapping
More Dialects - Full support for Hive and ANSI SQL variations

🤝 Contributing

Contributions are welcome! Please see the development documentation for:

Architecture overview
Testing guidelines
Development workflow

Development Workflow

Before Pushing

The repository includes a pre-push hook that automatically validates:

✅ TypeScript compilation (npm run compile)
✅ Test suite (140 tests via test-suite.js)
✅ Markdown linting (npm run lint)

If validation fails, you'll see helpful error messages with fix instructions.

The hook runs automatically before every git push and takes ~5-10 seconds.

Manual Validation

To check your changes before pushing:

npm run verify

This runs the same checks as the pre-push hook.

Fixing Markdown Linting Errors

Most markdown errors can be auto-fixed:

npm run lint:fix

Some errors (like MD040 - missing code fence languages) require manual fixes.

Bypassing Pre-Push Hook

⚠️ Use only in emergencies:

git push --no-verify

Warning: This will likely cause CI failures on GitHub Actions.

📝 License

MIT License - see LICENSE file for details

🔗 Links

📈 Version History

v0.4.3 (2026-02-12) - Fix alias line breaks, inline comment positioning, and GROUP BY loss with IN-lists
v0.4.0 (2025-11-25) - Phase 3: CTE/subquery JOINs, complex ON clauses, multi-statement support
v0.3.0 (2025-11-25) - Dead code cleanup, hidden options exposure, golden file improvements
v0.2.0 (2025-11-24) - Phase 2: HAVING, CTE WHERE, subquery WHERE, BETWEEN/IN/CASE decomposition
v0.1.0 (2025-11-24) - Test Generator MVP: WHERE/JOIN extraction, INNER→LEFT conversion
v0.0.9 (2025-11-13) - Table alias alignment fix, IN list indentation, examples reorganization (issue #18)
v0.0.8 (2025-11-12) - Width-aware formatting, comment separators, CTE indentation fixes, 6/7 golden files at 100%
v0.0.7 (2025-11-11) - Alias alignment improvements, CASE keyword casing
v0.0.5 (2025-11-01) - Custom token-based formatter, 100% idempotent, 31 tests passing
v0.0.4 (2025-10-12) - Security fixes, marketplace publishing
v0.0.3 (2025-10-02) - Comment handling fixes, idempotence improvements
v0.0.2 (2025-10-02) - Initial multi-pass formatter
v0.0.1 (2025-10-02) - First release

Made with ❤️ for the Databricks community

KF SQL Formatter (Databricks/Spark)

Kyle Fring

KF SQL Formatter (Databricks/Spark)

✨ Features

📦 Installation

Option 1: VS Code Marketplace (Recommended)

Option 2: Direct Download

Option 3: Build from Source

🚀 Usage

🧪 SQL Test Generator

How to Use

What It Does

Example

Test Generator Settings

Supported Features

⚙️ Configuration

Quick Settings Access

Available Settings

Core Formatting

Alias Alignment

Auto-Generate Aliases

📐 Formatting Examples

Before Formatting

After Formatting

Understanding aliasAlignmentScope

🏗️ Architecture

📊 Test Results

📁 Project Structure

🔮 Future Features

🤝 Contributing

Development Workflow

Before Pushing

Manual Validation

Fixing Markdown Linting Errors

Bypassing Pre-Push Hook

📝 License

🔗 Links

📈 Version History

Understanding `aliasAlignmentScope`