PDF Xtract
A Visual Studio Code extension that converts PDF files to TXT or JSON format — with built-in OCR for scanned and vector-graphics PDFs.
Features
- Convert PDF to TXT: Extract plain text content from PDF files
- Convert PDF to JSON: Extract PDF content with metadata in structured JSON format
- Built-in OCR: Automatically uses Tesseract OCR for PDFs without embedded text (scanned docs, "Print to PDF", etc.)
- Context Menu Integration: Right-click on any PDF file in the explorer to convert
- Command Palette: Access conversion commands via Command Palette (Ctrl+Shift+P)
- Progress Feedback: Visual progress indicator during conversion
Usage
- Right-click on any
.pdf file in the VS Code Explorer
- Select either:
- PDF: Convert to TXT - Extracts plain text
- PDF: Convert to JSON - Extracts text with metadata
Method 2: Command Palette
- Press
Ctrl+Shift+P (or Cmd+Shift+P on Mac)
- Type "PDF" and select:
- PDF: Convert to TXT
- PDF: Convert to JSON
- Select the PDF file you want to convert
- Simple plain text extraction
- Preserves text content from all pages
- Saved as
filename.txt in the same directory
Structured output with metadata:
{
"metadata": {
"filename": "document.pdf",
"convertedAt": "2026-02-26T...",
"totalPages": 5,
"info": {
"Title": "Document Title",
"Author": "Author Name",
...
}
},
"content": {
"text": "Full text content...",
"pages": 5,
"version": "1.7"
}
}
Installation
From Source
- Clone or download this repository
- Copy the folder to your VS Code extensions directory:
- Windows:
%USERPROFILE%\.vscode\extensions
- macOS/Linux:
~/.vscode/extensions
- Run
npm install in the extension folder
- Restart VS Code
From VSIX Package (if available)
- Download the
.vsix file
- In VS Code, go to Extensions view (Ctrl+Shift+X)
- Click the "..." menu at the top
- Select "Install from VSIX..."
- Choose the downloaded file
Development Setup
# Install dependencies
npm install
# Run the extension in development mode
# Press F5 in VS Code to open Extension Development Host
Requirements
- Visual Studio Code 1.80.0 or higher
- Node.js installed for dependency management
Dependencies
pdf-parse: PDF parsing library
Known Limitations
- Complex PDF layouts may not preserve exact formatting in text output
- Scanned PDFs (images) require OCR and are not supported
- Very large PDFs may take longer to process
Release Notes
1.0.0
- Initial release
- PDF to TXT conversion
- PDF to JSON conversion with metadata
- Context menu integration
Contributing
Feel free to submit issues and enhancement requests!
License
MIT License
| |