PDF to Markdown — VS Code Extension

Convert PDF files into clean, structured Markdown inside VS Code. The extension extracts text, detects document structure (headings, paragraphs, lists, tables, code blocks), and produces a .md file — all processed locally with zero cloud dependencies.

Features

One-click conversion: Right-click any .pdf file in the Explorer → "Convert PDF to Markdown"
Command Palette: Ctrl+Shift+P → "PDF to Markdown: Convert PDF to Markdown"
Structure preservation: Headings, bold, italic, lists (bullet + ordered), links, code blocks
Table detection: Heuristic column-boundary analysis renders tables as pipe-delimited Markdown
Side-by-side preview: Automatically opens the VS Code Markdown preview next to the raw file
Configurable: Adjust layout preservation, table detection, line merging, and output folder
Graceful error handling: Clear notifications for corrupt, encrypted, empty, and inaccessible PDFs
Progress indicator: Per-page progress notification for large documents
100% local: No data leaves your machine — all processing happens in the extension host

Installation

From Marketplace (when published)

Open VS Code
Go to Extensions (Ctrl+Shift+X)
Search for "PDF to Markdown"
Click Install

From Source

git clone https://github.com/karthik-dasari/pdf-to-markdown.git
cd pdf-to-markdown
npm install
npm run build

Then press F5 to launch the Extension Development Host.

Usage

Right-click a .pdf file in the Explorer sidebar
Select "Convert PDF to Markdown"
The .md file is saved next to the original and opened with a live preview

Command Palette

Press Ctrl+Shift+P (or Cmd+Shift+P on macOS)
Type "PDF to Markdown: Convert PDF to Markdown"
Select a PDF file from the file picker
Conversion runs with a progress indicator

Configuration

All settings are under pdfToMarkdown.* in VS Code Settings:

Setting	Type	Default	Description
`pdfToMarkdown.preserveLayout`	boolean	`false`	Retain original spacing and line breaks from the PDF
`pdfToMarkdown.detectTables`	boolean	`true`	Enable heuristic table detection and Markdown table output
`pdfToMarkdown.mergeLines`	boolean	`true`	Merge broken lines into coherent paragraphs
`pdfToMarkdown.outputFolder`	string	`""`	Custom output folder (absolute or workspace-relative). Empty = same directory as source PDF

Example settings.json

{
  "pdfToMarkdown.preserveLayout": false,
  "pdfToMarkdown.detectTables": true,
  "pdfToMarkdown.mergeLines": true,
  "pdfToMarkdown.outputFolder": "converted"
}

Supported PDF Types

PDF Type	Support
Text-based PDFs	✅ Full support
Mixed text + images	✅ Text extracted, images ignored
Multi-page documents	✅ All pages processed
PDFs with tables	✅ Heuristic detection
Scanned/image-only PDFs	⚠️ Warning shown (no OCR)
Password-protected PDFs	❌ Warning shown
Corrupt/invalid files	❌ Error shown

Architecture

src/
├── extension.ts              # Entry point (activate/deactivate)
├── commands/
│   └── convert-command.ts    # VS Code command handler (only file with vscode imports)
├── services/
│   ├── pdf-extractor.ts      # pdfjs-dist text extraction
│   ├── text-normalizer.ts    # Line grouping, font detection, paragraph merging
│   ├── table-detector.ts     # Heuristic table structure detection
│   ├── markdown-transformer.ts  # Heading/list/code/table → Markdown
│   └── conversion-pipeline.ts   # Orchestrator: extract → normalize → detect → transform
├── models/
│   ├── types.ts              # All shared TypeScript interfaces
│   └── errors.ts             # PdfExtractionError with ErrorCode enum
└── utils/
    └── font-utils.ts         # Font name parsing (bold/italic/monospace)

Design principle: Only src/commands/ and src/extension.ts import vscode. All services are pure TypeScript — testable in isolation.

Development

Prerequisites

Node.js 18+
VS Code ^1.85.0

Setup

npm install

Build

npm run build            # Production bundle
npm run watch            # Watch mode with auto-rebuild
npm run compile          # TypeScript type-checking only

Test

npm run test:unit        # Vitest unit tests (fast, no VS Code needed)
npm run test:unit:watch  # Vitest watch mode
npm run test:integration # Integration tests in Extension Development Host
npm run test             # Both unit + integration

Lint

npm run lint

Package

npm run package          # Creates .vsix file

Debug

Open this project in VS Code
Press F5 → "Run Extension" to launch the Extension Development Host
Set breakpoints in src/ files
Use the "Extension Tests" launch configuration for debugging tests

Known Limitations

No OCR: Image-only (scanned) PDFs will produce empty output with a warning
No password support: Encrypted PDFs are not supported in this version
Table detection is heuristic: Complex tables with merged cells, nested tables, or irregular layouts may not be detected correctly
No image extraction: Images in PDFs are ignored; only text content is converted
Font-based heading detection: Headings are detected by font size + bold heuristics — documents with non-standard font hierarchies may produce incorrect heading levels
Single-file conversion: Batch conversion is not yet supported

Contributing

Fork the repository
Create a feature branch
Write tests first (TDD approach)
Implement the feature
Ensure all tests pass (npm test)
Submit a pull request

License

MIT

PDF to Markdown

Karthik Dasari