
A custom token-based SQL formatter for VS Code that formats Databricks/Spark SQL with 100% idempotent output.
✨ Features
- 🎯 100% Idempotent - Format twice = same result (140/140 tests passing)
- 📏 Width-Aware Alignment - Automatic 80/120 character width preference per statement
- 📐 Content-Aware Separators - Comment separators snap to 80 or 120 dashes
- 📍 Leading Commas - SELECT list items formatted with leading commas (
,col)
- 🔤 Smart Keyword Casing - Configurable keyword case (lower, upper, or preserve)
- 🔗 Multi-word JOIN Handling - Keeps
INNER JOIN, LEFT JOIN, etc. together
- 📦 CTE Indentation - Perfect context tracking for WHERE/GROUP BY in CTEs
- ✨ Trailing Whitespace Removal - Clean, consistent output
- ⚡ Spark SQL Support - Arrays, maps, structs, named_struct, explode, lateral view
- 🧪 SQL Test Generator - Generate test queries to debug WHERE/JOIN filtering logic
📦 Installation
Option 1: VS Code Marketplace (Recommended)
Search for "KF SQL Formatter" in VS Code Extensions or install from:
VS Code Marketplace - KF SQL Formatter
Option 2: Direct Download
⬇️ Download Latest Release (v0.4.0)
After downloading:
- Open VS Code
- Press
Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (Mac)
- Type "Extensions: Install from VSIX..."
- Select the downloaded
.vsix file
Option 3: Build from Source
git clone https://github.com/k-f-/kf.sql.formatter.git
cd kf.sql.formatter
npm install
npm run compile
npm run package
🚀 Usage
- Open any
.sql, .hql, or .spark.sql file
- Run Format Document (
Shift+Alt+F on Windows/Linux, Shift+Option+F on Mac)
- Your SQL will be formatted with consistent style!
🧪 SQL Test Generator
Generate test queries to debug WHERE and JOIN filtering logic. Useful for identifying why data disappears between pipeline layers.
How to Use
- Open a SQL file with a SELECT query
- Use one of these methods:
- Keyboard:
Ctrl+Shift+T (Windows/Linux) or Cmd+Shift+T (Mac)
- Right-click: Select
KF.SQL: Generate SQL Tests from context menu
- Command Palette:
Ctrl+Shift+P → KF.SQL: Generate SQL Tests
- A new tab opens with your test query
What It Does
- Extracts WHERE conditions → Creates TEST columns showing PASS/FAIL for each filter
- Extracts CTE conditions → Creates CTE columns (CTE1, CTE2) for conditions inside WITH clauses
- Extracts JOIN conditions → Creates JOIN columns showing PASS/FAIL/N/A for each join
- Converts INNER JOIN → LEFT JOIN → Preserves unmatched rows for debugging
- Comments out WHERE clause → Returns ALL rows so you can see what would be filtered
- Adds dual-comment pattern →
--[TEST1] markers link results to original conditions
Example
Input:
SELECT * FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE u.status = 'active' AND o.amount > 100
Output (automatically formatted using your workspace settings):
select
*
-- [Test Indicators] ----------
,CASE
WHEN (u.status = 'active' and o.amount > 100) THEN 'Y'
ELSE 'N'
END as TEST1 --[TEST1]
,CASE
WHEN u.status = 'active' THEN 'Y'
ELSE 'N'
END as TEST1_1 --[TEST1.1]
,CASE
WHEN o.amount > 100 THEN 'Y'
ELSE 'N'
END as TEST1_2 --[TEST1.2]
,CASE
WHEN o.user_id is null THEN 'N/A'
WHEN u.id = o.user_id THEN 'Y'
ELSE 'N'
END as JOIN1 --[JOIN1]
from users as u
left join orders as o
on u.id = o.user_id --[JOIN1]
-- WHERE clause commented for testing:
-- WHERE u.status = 'active' --[TEST1.1] AND o.amount > 100 --[TEST1.2]
Key features shown:
- TEST1: Combined parent test (both conditions AND'd)
- TEST1_1, TEST1_2: Individual condition tests
- JOIN1: JOIN condition with N/A for unmatched rows
- Dual-comment pattern:
--[TEST1] markers link results to original conditions
- Auto-formatted: Output uses your workspace formatter settings
Test Generator Settings
| Setting |
Default |
Description |
testGenerator.convertInnerJoins |
true |
Convert INNER → LEFT JOIN for non-destructive testing |
testGenerator.addJoinNullHandling |
true |
Add N/A for unmatched JOIN rows |
testGenerator.showMetadata |
true |
Show notification with test counts |
Supported Features
| Feature |
Test ID Pattern |
Example |
| WHERE conditions |
TEST1, TEST2, etc. |
WHERE status = 'active' → TEST1 |
| JOIN ON conditions |
JOIN1, JOIN2, etc. |
ON u.id = o.user_id → JOIN1 |
| CTE WHERE conditions |
CTE1, CTE2, etc. |
WITH cte AS (... WHERE x > 10) → CTE1 |
| HAVING conditions |
HAVING1, HAVING2, etc. |
HAVING COUNT(*) > 5 → HAVING1 |
| Subquery WHERE |
SUB1, SUB2, etc. |
FROM (SELECT ... WHERE x > 10) subq → SUB1 |
| AND decomposition |
Parent + children |
WHERE a AND b → TEST1, TEST1_1, TEST1_2 |
| BETWEEN decomposition |
Parent + >= / <= |
BETWEEN 100 AND 500 → TEST1, TEST1_1 (>=), TEST1_2 (<=) |
| IN list decomposition |
Parent + each value |
IN ('a', 'b') → TEST1, TEST1_1, TEST1_2 |
Note: Output is automatically formatted using your workspace formatter settings (keywordCase, indent, leadingCommas, etc.).
⚙️ Configuration
Configure the formatter in VS Code settings (settings.json):
{
"databricksSqlFormatter.keywordCase": "lower",
"databricksSqlFormatter.functionCase": "lower",
"databricksSqlFormatter.indent": 4,
"databricksSqlFormatter.leadingCommas": true,
"databricksSqlFormatter.dialect": "spark",
"databricksSqlFormatter.aliasAlignmentScope": "file",
"databricksSqlFormatter.aliasMinGap": 8,
"databricksSqlFormatter.aliasMaxColumnCap": 120,
"databricksSqlFormatter.trimTrailingWhitespace": true
}
📖 Complete Settings Guide - Detailed documentation for all 16 configuration options with examples, recommendations, and configuration recipes.
Quick Settings Access
- Command Palette:
Cmd+Shift+P → "KF.SQL: Open Settings"
- Settings UI:
Cmd+, → Search "databricks"
- Documentation: See VSCode Settings Guide
Available Settings
| Setting |
Type |
Default |
Description |
keywordCase |
"lower" | "upper" | "preserve" |
"lower" |
SQL keyword letter case |
functionCase |
"lower" | "upper" | "preserve" |
"lower" |
Function name letter case |
indent |
number |
4 |
Spaces per indent level (2-8) |
leadingCommas |
boolean |
true |
Use leading commas in SELECT lists |
dialect |
"spark" | "hive" | "ansi" |
"spark" |
SQL dialect (for future use) |
trimTrailingWhitespace |
boolean |
true |
Remove trailing whitespace from lines |
Alias Alignment
| Setting |
Type |
Default |
Description |
aliasAlignmentScope |
"file" | "statement" | "none" |
"statement" |
Scope for alias and comment alignment. "statement" aligns within each SQL statement, "file" aligns across entire file |
aliasMinGap |
number |
8 |
Minimum spaces between expression and AS (1-20) |
aliasMaxColumnCap |
number |
120 |
Maximum column for alias alignment (40-200) |
aliasColumnCapMode |
"fixed" | "adaptive" |
"adaptive" |
Column cap behavior. "adaptive" uses 80 or 120 based on expression length, "fixed" always uses aliasMaxColumnCap |
aliasAdaptiveThreshold |
number |
75 |
When aliasColumnCapMode: "adaptive", expressions >= this length use 120 column cap, shorter use 80 |
addExplicitAs |
boolean |
true |
Add explicit AS keyword to all aliases |
Auto-Generate Aliases
| Setting |
Type |
Default |
Description |
autoGenerateAliases |
boolean |
true |
Master switch for auto-generating aliases (must be true to enable sub-options) |
autoGenerateTableAliases |
boolean |
true |
Auto-generate table aliases from table names (e.g., customer_orders → co) |
autoGenerateSelectAliases |
boolean |
false |
Auto-generate SELECT expression aliases for complex expressions |
WITH active_users AS (SELECT user_id,user_name,account_type FROM users WHERE status='active' AND created_date>='2024-01-01'),order_stats AS(SELECT user_id,COUNT(*) as order_count,SUM(amount) as total_spent,AVG(amount) as avg_order FROM orders WHERE order_date>='2024-01-01' GROUP BY user_id)SELECT u.user_id,u.user_name as customer_name,u.account_type,CASE WHEN o.order_count>10 THEN 'high' WHEN o.order_count>5 THEN 'medium' ELSE 'low' END as engagement_level,o.total_spent,o.avg_order FROM active_users u INNER JOIN order_stats o ON u.user_id=o.user_id WHERE o.total_spent>100 ORDER BY o.total_spent DESC
Configuration Used: This example uses aliasAlignmentScope: 'statement' with all formatting features enabled:
{
"keywordCase": "lower",
"functionCase": "lower",
"indent": 4,
"leadingCommas": true,
"aliasAlignmentScope": "statement",
"aliasMinGap": 8,
"aliasMaxColumnCap": 120,
"addExplicitAs": true,
"trimTrailingWhitespace": true
}
with active_users as (
select
user_id
,user_name
,account_type
from users
where status = 'active'
and created_date >= '2024-01-01'
)
,order_stats as (
select
user_id
,count(*) as order_count
,sum(amount) as total_spent
,avg(amount) as avg_order
from orders
where order_date >= '2024-01-01'
group by user_id
)
select
u.user_id
,u.user_name as customer_name
,u.account_type
,case
when o.order_count > 10 then 'high'
when o.order_count > 5 then 'medium'
else 'low'
end as engagement_level
,o.total_spent
,o.avg_order
from active_users as u
inner join order_stats as o
on u.user_id = o.user_id
where o.total_spent > 100
order by o.total_spent desc
Key Features Demonstrated:
- 📍 Leading commas for SELECT columns (
leadingCommas: true)
- 📏 Smart alias alignment - all AS keywords align within statement scope (
aliasAlignmentScope: 'statement', aliasMinGap: 8)
- 🔤 Consistent keyword casing (
keywordCase: 'lower', functionCase: 'lower')
- 📦 Perfect CTE indentation with proper WHERE clause alignment (
indent: 4)
- 🔗 Multi-word keywords kept together (INNER JOIN, ORDER BY)
- ⚡ Clean, readable structure with explicit AS keywords (
addExplicitAs: true)
Understanding aliasAlignmentScope
The aliasAlignmentScope option controls how AS keywords align across your SQL file. Here's a visual comparison:
With scope: 'statement' (each statement aligns independently):
select
customer_id
,very_long_customer_name as name
,account_balance as balance
from customers;
select
id as user_id
,name as user_name
from users
Notice how:
- Statement 1: AS keywords align at column 36 (due to long column name)
- Statement 2: AS keywords align at column 17 (shorter expressions)
- Each statement uses optimal spacing for its own expressions
With scope: 'file' (all statements align together):
select
customer_id
,very_long_customer_name as name
,account_balance as balance
from customers;
select
id as user_id
,name as user_name
from users
Notice how:
- All AS keywords align at column 36 across both statements
- Statement 2 has extra spacing to match statement 1's longest expression
- Provides consistent alignment across the entire file
Recommendation: Use 'statement' for optimal readability in files with multiple queries, or 'file' when you want uniform alignment across all statements.
🏗️ Architecture
The formatter uses a token-based architecture with three stages:
- Tokenize - SQL → Tokens (preserves 100% of original text)
- Transform - Apply transformations:
- Keyword case normalization (configurable: lower, upper, or preserve)
- Line breaks insertion
- Leading commas conversion
- Context-aware indentation (CTEs, subqueries, CASE statements)
- Explicit AS keyword insertion
- Unified alias alignment (column, table, and subquery aliases align together)
- Comment separator adjustment (80/120 dashes)
- Trailing whitespace removal
- Generate - Tokens → Formatted SQL
📊 Test Results
140/140 tests passing ✅
- Unit Tests: Tokenizer (4/4), Generator (2/2), Integration (6/6)
- Golden Files: 5 files - Idempotence tests + Similarity metrics (informational)
- Test Cases: 8 files - Idempotent formatting verification
- Edge Cases: 7 files covering complex SQL patterns
- Minified Path: Unformatted → Formatted idempotence testing
- Options Tests: 70+ tests for all configuration options
- VSCode Integration: Extension and settings tests
📁 Project Structure
kf.sql.formatter/
├── package.json # Extension manifest
├── tsconfig.json # TypeScript config
├── src/ # Source code
│ ├── extension.ts # VS Code extension entry point
│ ├── formatter-custom.ts # Main formatter
│ ├── tokenizer/ # SQL tokenizer
│ │ ├── tokenize.ts
│ │ └── token-types.ts
│ ├── transformers/ # Token transformers
│ │ ├── keywords.ts
│ │ ├── line-breaks.ts
│ │ ├── leading-commas.ts
│ │ ├── indentation.ts
│ │ ├── add-explicit-as.ts
│ │ ├── alias-alignment.ts
│ │ ├── comment-separators.ts
│ │ └── trim-trailing-whitespace.ts
│ └── generator/ # SQL generator
│ └── generate-sql.ts
├── dist/ # Compiled JavaScript (gitignored)
├── tests/ # Test suite
│ └── test-suite.js # Comprehensive tests
├── examples/ # Test SQL files
│ ├── manual/ # Manual test cases (7 files)
│ ├── golden-references/ # Reference SQL (5 files)
│ └── edge-cases/ # Complex edge cases (7 files)
└── docs/ # Documentation
├── development/ # Developer guides
└── releases/ # Release notes
🔮 Future Features
The following features are planned but not yet implemented:
- Semantic Auto-Aliasing - Automatically add meaningful aliases to expressions
- Advanced Comment Handling - Comment alignment and wrapping
- More Dialects - Full support for Hive and ANSI SQL variations
🤝 Contributing
Contributions are welcome! Please see the development documentation for:
- Architecture overview
- Testing guidelines
- Development workflow
Development Workflow
Before Pushing
The repository includes a pre-push hook that automatically validates:
- ✅ TypeScript compilation (
npm run compile)
- ✅ Test suite (24 tests via
test-suite.js)
- ✅ Markdown linting (
npm run lint)
If validation fails, you'll see helpful error messages with fix instructions.
The hook runs automatically before every git push and takes ~5-10 seconds.
Manual Validation
To check your changes before pushing:
npm run verify
This runs the same checks as the pre-push hook.
Fixing Markdown Linting Errors
Most markdown errors can be auto-fixed:
npm run lint:fix
Some errors (like MD040 - missing code fence languages) require manual fixes.
Bypassing Pre-Push Hook
⚠️ Use only in emergencies:
git push --no-verify
Warning: This will likely cause CI failures on GitHub Actions.
📝 License
MIT License - see LICENSE file for details
🔗 Links
📈 Version History
- v0.0.9 (2025-11-13) - Table alias alignment fix, IN list indentation, examples reorganization (issue #18)
- v0.0.8 (2025-11-12) - Width-aware formatting, comment separators, CTE indentation fixes, 6/7 golden files at 100%
- v0.0.7 (2025-11-11) - Alias alignment improvements, CASE keyword casing
- v0.0.5 (2025-11-01) - Custom token-based formatter, 100% idempotent, 31 tests passing
- v0.0.4 (2025-10-12) - Security fixes, marketplace publishing
- v0.0.3 (2025-10-02) - Comment handling fixes, idempotence improvements
- v0.0.2 (2025-10-02) - Initial multi-pass formatter
- v0.0.1 (2025-10-02) - First release
Made with ❤️ for the Databricks community