
A custom token-based SQL formatter for VS Code that formats Databricks/Spark SQL with 100% idempotent output.
✨ Features
- 🎯 100% Idempotent - Format twice = same result (381/381 tests passing)
- 📏 Width-Aware Alignment - Automatic 80/120 character width preference per statement
- 📐 Content-Aware Separators - Comment separators snap to 80 or 120 dashes
- 📍 Leading Commas - SELECT list items formatted with leading commas (
,col)
- 🔤 Smart Keyword Casing - Configurable keyword case (lower, upper, or preserve)
- 🔗 Multi-word JOIN Handling - Keeps
INNER JOIN, LEFT JOIN, etc. together
- 📦 CTE Indentation - Perfect context tracking for WHERE/GROUP BY in CTEs
- ✨ Trailing Whitespace Removal - Clean, consistent output
- ⚡ Spark SQL Support - Arrays, maps, structs, named_struct, explode, lateral view
- 🧪 SQL Test Generator - Generate test queries to debug WHERE/JOIN filtering logic
📦 Installation
Option 1: VS Code Marketplace (Recommended)
Search for "KF SQL Formatter" in VS Code Extensions or install from:
VS Code Marketplace - KF SQL Formatter
Option 2: Direct Download
⬇️ Download Latest Release (v0.5.1)
After downloading:
- Open VS Code
- Press
Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (Mac)
- Type "Extensions: Install from VSIX..."
- Select the downloaded
.vsix file
Option 3: Build from Source
git clone https://github.com/k-f-/kf.sql.formatter.git
cd kf.sql.formatter
npm install
npm run compile
npm run package
🚀 Usage
- Open any
.sql, .hql, or .spark.sql file
- Run Format Document (
Shift+Alt+F on Windows/Linux, Shift+Option+F on Mac)
- Your SQL will be formatted with consistent style!
🧪 SQL Test Generator
Generate test queries to debug WHERE and JOIN filtering logic. Useful for identifying why data disappears between pipeline layers.
How to Use
- Open a SQL file with a SELECT query
- Use one of these methods:
- Keyboard:
Ctrl+Shift+T (Windows/Linux) or Cmd+Shift+T (Mac)
- Right-click: Select
KF.SQL: Generate SQL Tests from context menu
- Command Palette:
Ctrl+Shift+P → KF.SQL: Generate SQL Tests
- A new tab opens with your test query
What It Does
- Extracts WHERE conditions → Creates TEST columns showing PASS/FAIL for each filter
- Extracts CTE conditions → Creates CTE columns (CTE1, CTE2) for conditions inside WITH clauses
- Extracts JOIN conditions → Creates JOIN columns showing PASS/FAIL/N/A for each join
- Converts INNER JOIN → LEFT JOIN → Preserves unmatched rows for debugging
- Comments out WHERE clause → Returns ALL rows so you can see what would be filtered
- Adds dual-comment pattern →
--[TEST1] markers link results to original conditions
Example
Input:
SELECT * FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE u.status = 'active' AND o.amount > 100
Output (automatically formatted using your workspace settings):
select
*
-- [Test Indicators] ----------
,CASE
WHEN (u.status = 'active' and o.amount > 100) THEN 'Y'
ELSE 'N'
END as TEST1 --[TEST1]
,CASE
WHEN u.status = 'active' THEN 'Y'
ELSE 'N'
END as TEST1_1 --[TEST1.1]
,CASE
WHEN o.amount > 100 THEN 'Y'
ELSE 'N'
END as TEST1_2 --[TEST1.2]
,CASE
WHEN o.user_id is null THEN 'N/A'
WHEN u.id = o.user_id THEN 'Y'
ELSE 'N'
END as JOIN1 --[JOIN1]
from users as u
left join orders as o
on u.id = o.user_id --[JOIN1]
-- WHERE clause commented for testing:
-- WHERE u.status = 'active' --[TEST1.1] AND o.amount > 100 --[TEST1.2]
Key features shown:
- TEST1: Combined parent test (both conditions AND'd)
- TEST1_1, TEST1_2: Individual condition tests
- JOIN1: JOIN condition with N/A for unmatched rows
- Dual-comment pattern:
--[TEST1] markers link results to original conditions
- Auto-formatted: Output uses your workspace formatter settings
Test Generator Settings
| Setting |
Default |
Description |
testGenerator.convertInnerJoins |
true |
Convert INNER → LEFT JOIN for non-destructive testing |
testGenerator.addJoinNullHandling |
true |
Add N/A for unmatched JOIN rows |
testGenerator.showMetadata |
true |
Show notification with test counts |
Supported Features
| Feature |
Test ID Pattern |
Example |
| WHERE conditions |
TEST1, TEST2, etc. |
WHERE status = 'active' → TEST1 |
| JOIN ON conditions |
JOIN1, JOIN2, etc. |
ON u.id = o.user_id → JOIN1 |
| CTE WHERE conditions |
CTE1, CTE2, etc. |
WITH cte AS (... WHERE x > 10) → CTE1 |
| HAVING conditions |
HAVING1, HAVING2, etc. |
HAVING COUNT(*) > 5 → HAVING1 |
| Subquery WHERE |
SUB1, SUB2, etc. |
FROM (SELECT ... WHERE x > 10) subq → SUB1 |
| AND decomposition |
Parent + children |
WHERE a AND b → TEST1, TEST1_1, TEST1_2 |
| BETWEEN decomposition |
Parent + >= / <= |
BETWEEN 100 AND 500 → TEST1, TEST1_1 (>=), TEST1_2 (<=) |
| IN list decomposition |
Parent + each value |
IN ('a', 'b') → TEST1, TEST1_1, TEST1_2 |
| CTE internal JOINs |
CTE_JOIN1, CTE_JOIN2 |
WITH cte AS (... INNER JOIN ...) → CTE_JOIN1 |
| Subquery JOINs |
SUB_JOIN1 |
FROM (SELECT ... INNER JOIN ...) subq → SUB_JOIN1 |
| Complex ON clauses |
JOIN1 |
ON LOWER(a.email) = LOWER(b.email) → JOIN1 |
| Multi-statement SQL |
S1_TEST1, S2_TEST1 |
Multiple SELECTs → prefixed test IDs |
Note: Output is automatically formatted using your workspace formatter settings (keywordCase, indent, leadingCommas, etc.).
⚙️ Configuration
Configure the formatter in VS Code settings (settings.json):
{
"databricksSqlFormatter.keywordCase": "lower",
"databricksSqlFormatter.functionCase": "lower",
"databricksSqlFormatter.indent": 4,
"databricksSqlFormatter.leadingCommas": true,
"databricksSqlFormatter.dialect": "spark",
"databricksSqlFormatter.aliasAlignmentScope": "file",
"databricksSqlFormatter.aliasMinGap": 8,
"databricksSqlFormatter.aliasMaxColumnCap": 120,
"databricksSqlFormatter.trimTrailingWhitespace": true
}
Quick Settings Access
- Command Palette:
Cmd+Shift+P → "KF.SQL: Open Settings"
- Settings UI:
Cmd+, → Search "databricks"
Available Settings
| Setting |
Type |
Default |
Description |
keywordCase |
"lower" | "upper" | "preserve" |
"lower" |
SQL keyword letter case |
functionCase |
"lower" | "upper" | "preserve" |
"lower" |
Function name letter case |
indent |
number |
4 |
Spaces per indent level (2-8) |
leadingCommas |
boolean |
true |
Use leading commas in SELECT lists |
dialect |
"spark" | "hive" | "ansi" |
"spark" |
SQL dialect (for future use) |
trimTrailingWhitespace |
boolean |
true |
Remove trailing whitespace from lines |
Alias Alignment
| Setting |
Type |
Default |
Description |
aliasAlignmentScope |
"file" | "statement" | "none" |
"statement" |
Scope for alias and comment alignment. "statement" aligns within each SQL statement, "file" aligns across entire file |
aliasMinGap |
number |
8 |
Minimum spaces between expression and AS (1-20) |
aliasMaxColumnCap |
number |
120 |
Maximum column for alias alignment (40-200) |
aliasColumnCapMode |
"fixed" | "adaptive" |
"adaptive" |
Column cap behavior. "adaptive" uses 80 or 120 based on expression length, "fixed" always uses aliasMaxColumnCap |
aliasAdaptiveThreshold |
number |
80 |
When aliasColumnCapMode: "adaptive", expressions >= this length use 120 column cap, shorter use 80 |
addExplicitAs |
boolean |
true |
Add explicit AS keyword to all aliases |
Auto-Generate Aliases
| Setting |
Type |
Default |
Description |
autoGenerateAliases |
boolean |
true |
Master switch for auto-generating aliases (must be true to enable sub-options) |
autoGenerateTableAliases |
boolean |
true |
Auto-generate table aliases from table names (e.g., customer_orders → co) |
autoGenerateSelectAliases |
boolean |
false |
Auto-generate SELECT expression aliases for complex expressions |
WITH active_users AS (SELECT user_id,user_name,account_type FROM users WHERE status='active' AND created_date>='2024-01-01'),order_stats AS(SELECT user_id,COUNT(*) as order_count,SUM(amount) as total_spent,AVG(amount) as avg_order FROM orders WHERE order_date>='2024-01-01' GROUP BY user_id)SELECT u.user_id,u.user_name as customer_name,u.account_type,CASE WHEN o.order_count>10 THEN 'high' WHEN o.order_count>5 THEN 'medium' ELSE 'low' END as engagement_level,o.total_spent,o.avg_order FROM active_users u INNER JOIN order_stats o ON u.user_id=o.user_id WHERE o.total_spent>100 ORDER BY o.total_spent DESC
Configuration Used: This example uses aliasAlignmentScope: 'statement' with all formatting features enabled:
{
"keywordCase": "lower",
"functionCase": "lower",
"indent": 4,
"leadingCommas": true,
"aliasAlignmentScope": "statement",
"aliasMinGap": 8,
"aliasMaxColumnCap": 120,
"addExplicitAs": true,
"trimTrailingWhitespace": true
}
with active_users as (
select
user_id
,user_name
,account_type
from users
where status = 'active'
and created_date >= '2024-01-01'
)
,order_stats as (
select
user_id
,count(*) as order_count
,sum(amount) as total_spent
,avg(amount) as avg_order
from orders
where order_date >= '2024-01-01'
group by user_id
)
select
u.user_id
,u.user_name as customer_name
,u.account_type
,case
when o.order_count > 10 then 'high'
when o.order_count > 5 then 'medium'
else 'low'
end as engagement_level
,o.total_spent
,o.avg_order
from active_users as u
inner join order_stats as o
on u.user_id = o.user_id
where o.total_spent > 100
order by o.total_spent desc
Key Features Demonstrated:
- 📍 Leading commas for SELECT columns (
leadingCommas: true)
- 📏 Smart alias alignment - all AS keywords align within statement scope (
aliasAlignmentScope: 'statement', aliasMinGap: 8)
- 🔤 Consistent keyword casing (
keywordCase: 'lower', functionCase: 'lower')
- 📦 Perfect CTE indentation with proper WHERE clause alignment (
indent: 4)
- 🔗 Multi-word keywords kept together (INNER JOIN, ORDER BY)
- ⚡ Clean, readable structure with explicit AS keywords (
addExplicitAs: true)
Understanding aliasAlignmentScope
The aliasAlignmentScope option controls how AS keywords align across your SQL file. Here's a visual comparison:
With scope: 'statement' (each statement aligns independently):
select
customer_id
,very_long_customer_name as name
,account_balance as balance
from customers;
select
id as user_id
,name as user_name
from users
Notice how:
- Statement 1: AS keywords align at column 36 (due to long column name)
- Statement 2: AS keywords align at column 17 (shorter expressions)
- Each statement uses optimal spacing for its own expressions
With scope: 'file' (all statements align together):
select
customer_id
,very_long_customer_name as name
,account_balance as balance
from customers;
select
id as user_id
,name as user_name
from users
Notice how:
- All AS keywords align at column 36 across both statements
- Statement 2 has extra spacing to match statement 1's longest expression
- Provides consistent alignment across the entire file
Recommendation: Use 'statement' for optimal readability in files with multiple queries, or 'file' when you want uniform alignment across all statements.
🏗️ Architecture
The formatter uses a token-based architecture with three stages:
- Tokenize - SQL → Tokens (preserves 100% of original text)
- Transform - Apply transformations:
- Keyword case normalization (configurable: lower, upper, or preserve)
- Line breaks insertion
- Leading commas conversion
- Context-aware indentation (CTEs, subqueries, CASE statements)
- Explicit AS keyword insertion
- Unified alias alignment (column, table, and subquery aliases align together)
- Comment separator adjustment (80/120 dashes)
- Trailing whitespace removal
- Generate - Tokens → Formatted SQL
Split-Merge Pipeline (Experimental)
A new split-merge pipeline is available behind the useSplitMerge feature flag (default: false). This pipeline improves handling of complex multi-statement files by processing each statement independently before merging them back together.
📊 Test Results
381/381 tests passing ✅
- Unit Tests: Tokenizer, Generator
- Pipeline Tests: normalize, split, indent, merge, align, cosmetics, alias-generator, validation
- Options Tests: 70+ tests for all configuration options
- Golden Files: Idempotence tests + Similarity metrics (informational)
- Test Cases: Idempotent formatting verification
- Edge Cases: Complex SQL patterns
- Test Generator: AST condition extraction, comment injection, join conversion, named_struct, detail level
- Integration: End-to-end formatter tests
📁 Project Structure
kf.sql.formatter/
├── package.json # Extension manifest
├── tsconfig.json # TypeScript config
├── esbuild.js # Production bundler
├── vitest.config.ts # Test configuration
├── src/ # Source code
│ ├── extension.ts # VS Code extension entry point
│ ├── formatter-custom.ts # Main formatter orchestration
│ ├── tokenizer/ # SQL tokenizer
│ │ ├── tokenize.ts
│ │ └── token-types.ts
│ ├── transformers/ # Token transformers (15 files)
│ │ ├── keywords.ts
│ │ ├── function-case.ts
│ │ ├── operator-spacing.ts
│ │ ├── line-breaks.ts
│ │ ├── leading-commas.ts
│ │ ├── cleanup-whitespace.ts
│ │ ├── indentation.ts
│ │ ├── in-list-indentation.ts
│ │ ├── auto-generate-aliases.ts
│ │ ├── add-explicit-as.ts
│ │ ├── alias-alignment.ts
│ │ ├── comment-alignment.ts
│ │ ├── comment-separators.ts
│ │ ├── trim-trailing-whitespace.ts
│ │ └── symbol-table-builder.ts
│ ├── pipeline/ # Split-merge pipeline (experimental)
│ │ ├── normalize.ts
│ │ ├── split.ts
│ │ ├── indent.ts
│ │ ├── merge.ts
│ │ ├── align.ts
│ │ └── cosmetics.ts
│ ├── test-generator/ # SQL test query generator
│ │ ├── index.ts
│ │ ├── types.ts
│ │ ├── ast-condition-extractor.ts
│ │ ├── comment-injector.ts
│ │ ├── join-converter.ts
│ │ └── test-case-generator.ts
│ ├── utils/ # Utilities
│ │ └── alias-generator.ts
│ └── generator/ # SQL generator
│ └── generate-sql.ts
├── dist/ # Compiled bundle (gitignored)
│ └── extension.js # esbuild bundle (only VSIX file)
├── tests/ # Test suite (381 tests)
│ ├── unit/ # Tokenizer, generator tests
│ ├── pipeline/ # Pipeline stage tests
│ ├── options/ # 70+ configuration option tests
│ ├── golden/ # Idempotence + similarity tests
│ ├── test-generator/ # Test generator tests (149 tests)
│ ├── integration/ # End-to-end formatter tests
│ └── framework/ # Test framework utilities
└── examples/ # SQL examples
├── test-cases/ # Test case files (8 files)
├── golden-references/ # Reference SQL (12 files)
└── edge-cases/ # Complex SQL patterns (7 files)
🔗 Recommended Companions
For a complete SQL development workflow, we recommend using this formatter alongside:
- SQLFluff - The dialect-flexible SQL linter for modern data stacks.
While this extension handles the visual formatting and layout of your code, SQLFluff excels at linting and enforcing style rules. Using them together ensures your SQL is both beautiful and compliant with organizational standards.
🤝 Contributing
Contributions are welcome! Please see the guidelines below:
Development Workflow
Before Pushing
The repository includes a pre-push hook that automatically validates:
- ✅ TypeScript compilation (
npm run compile)
- ✅ Test suite (381 tests via vitest)
- ✅ Markdown linting (
npm run lint)
If validation fails, you'll see helpful error messages with fix instructions.
The hook runs automatically before every git push and takes ~5-10 seconds.
Manual Validation
To check your changes before pushing:
npm run verify
This runs the same checks as the pre-push hook.
Fixing Markdown Linting Errors
Most markdown errors can be auto-fixed:
npm run lint:fix
Some errors (like MD040 - missing code fence languages) require manual fixes.
Bypassing Pre-Push Hook
⚠️ Use only in emergencies:
git push --no-verify
Warning: This will likely cause CI failures on GitHub Actions.
📝 License
MIT License - see LICENSE file for details
🔗 Links
📈 Version History
For full version history, see CHANGELOG.md.
Made with ❤️ for the Databricks community