PDF/DOCX Reader for Cursor IDE
A simple Python tool I built to help read PDF and DOCX files directly in Cursor IDE. I was tired of not being able to process documents in my AI workflows, so I created this tool to extract text content from PDFs and Word documents.
What it does
- Reads PDF files (using pdfplumber and PyPDF2 as backup)
- Reads DOCX files (using python-docx)
- Extracts metadata like title, author, creation date
- Outputs in JSON or plain text format
- Handles errors gracefully
- Preserves document structure (pages/paragraphs)
Setup
- Clone this repo
- Install the Python dependencies:
pip install -r requirements.txt
That's it! No complex setup needed.
How to use it
Just run it with a PDF or DOCX file:
# Read a PDF (defaults to JSON output)
python pdf_docx_reader.py document.pdf
# Read a DOCX file
python pdf_docx_reader.py document.docx
# Get plain text instead of JSON
python pdf_docx_reader.py document.pdf --output-format text
You can also pipe the output to files or use it in scripts:
# Save to file
python pdf_docx_reader.py document.pdf > output.json
# Process multiple files
for file in *.pdf; do
python pdf_docx_reader.py "$file" > "${file%.pdf}.txt"
done
Options
file_path
: The PDF or DOCX file to read (required)
--output-format
: Choose json
(default) or text
--help
: Show help
--version
: Show version
JSON Output (Default)
{
"file_path": "/path/to/document.pdf",
"file_type": "PDF",
"pages": [
{
"page_number": 1,
"text": "Page content here...",
"char_count": 150
}
],
"full_text": "Complete document text...",
"metadata": {
"title": "Document Title",
"author": "Author Name",
"creation_date": "2024-01-01",
"modification_date": "2024-01-02"
},
"page_count": 1
}
Text Output
File: /path/to/document.pdf
Type: PDF
Pages: 1
Metadata:
title: Document Title
author: Author Name
creation_date: 2024-01-01
Content:
Complete document text here...
Using with Cursor IDE
I built this specifically for Cursor IDE, so it works great there. Just open the terminal in Cursor and run:
python pdf_docx_reader.py your_document.pdf
You can also create a simple wrapper script if you want:
#!/bin/bash
# pdf_reader.sh
python /path/to/pdf_docx_reader.py "$1" --output-format text
Or use it in your Python code:
from pdf_docx_reader import FileReader
reader = FileReader()
data = reader.read_file("document.pdf")
print(data['full_text'])
Troubleshooting
If something goes wrong:
"PDF reading libraries not available"
- Run
pip install -r requirements.txt
"File is not a PDF/DOCX"
- Make sure the file has
.pdf
or .docx
extension
Empty text extraction
- Some PDFs are just images - you'll need OCR for those
- Try the other PDF reader (it switches between pdfplumber and PyPDF2)
Permission errors
- Make sure the file isn't locked by another app
Dependencies
pdfplumber>=0.9.0
- Main PDF reader
PyPDF2>=3.0.0
- Backup PDF reader
python-docx>=0.8.11
- DOCX reader
Examples
Reading a research paper:
python pdf_docx_reader.py research_paper.pdf --output-format text
Batch processing:
for pdf in *.pdf; do
echo "Processing: $pdf"
python pdf_docx_reader.py "$pdf" > "${pdf%.pdf}.txt"
done
Get just the metadata:
python pdf_docx_reader.py document.pdf | jq '.metadata'
License
MIT License - feel free to use it however you want.
Contributing
Found a bug? Have an idea? Open an issue or send a PR. I'm always looking to improve this tool.