Similar Files Extension

A VS Code extension that shows similar files to the currently open file using BM25 search algorithm. Perfect for managing Markdown notes, documentation, and related documents.

✨ Features

This extension adds a "Similar Files" sidebar to VS Code that intelligently shows files similar to the one you're currently editing.

🎯 Core Functionality

📋 Similar Files Sidebar: A dedicated view in the activity bar that updates automatically
🔍 BM25 Search Algorithm: Uses the same algorithm as search engines for accurate content similarity
📊 Similarity Scores: Shows relevance scores next to each suggested file
🖱️ Clickable Results: Click any file in the sidebar to open it instantly
⚡ Real-time Updates: Index updates automatically when you save files
⚙️ Fully Configurable: Customize file patterns, result limits, and content filters

🛠️ Advanced Features

🔄 Incremental Re-indexing: Efficient updates without full rebuilds
🎯 Smart Filtering: Excludes current file from results
📁 Multi-format Support: Works with Markdown, text files, and more
⏱️ Performance Optimized: Debounced refresh and intelligent caching

🚀 How to Use

1. Installation & Setup

Install the extension in VS Code
Open a workspace with text files (Markdown works best)
The extension will automatically index your files on activation

2. Finding Similar Files

Open any text file in your workspace
Look for the "Similar Files" panel in the sidebar (activity bar)
The panel will show files similar to your currently open file
Files are ranked by similarity score (higher = more similar)

Click any file in the Similar Files panel to open it
Switch files in the editor to see updated similarity results
Save changes to a file to update the similarity index

4. Customization

Open VS Code settings (Ctrl/Cmd + ,) and search for "Similar Files":

similarFiles.maxResults (default: 5): Number of similar files to show
similarFiles.fileGlobs (default: ["**/*.md", "**/*.markdown"]): File patterns to include
similarFiles.minContentLength (default: 10): Minimum file size to index

💡 Pro Tips

Works best with content-rich files (documentation, notes, articles)
The more text content in your files, the better the similarity matching
Files with similar keywords, topics, or writing style will rank higher
Try opening different files to see how the suggestions change

Example: The Similar Files panel showing relevant documents with similarity scores

📋 Requirements

VS Code 1.100.0 or higher
Workspace with text files (works best with Markdown files)
Files with meaningful text content for best similarity results

⚙️ Configuration

This extension contributes the following settings:

similarFiles.maxResults: Number of similar files to show in the sidebar (default: 5)
similarFiles.fileGlobs: File patterns to include in similarity search (default: ["**/*.md", "**/*.markdown"])
similarFiles.minContentLength: Minimum character length for files to be indexed (default: 10)

Example Configuration

{
  "similarFiles.maxResults": 8,
  "similarFiles.fileGlobs": ["**/*.md", "**/*.txt", "**/*.rst"],
  "similarFiles.minContentLength": 50
}

🧪 Testing the Extension

Method 1: Quick Manual Testing (Recommended for trying it out)

📋 See MANUAL_TESTING.md for a complete step-by-step testing guide!

Quick Start:

Clone this repository and run npm install
Open in VS Code and press F5 to launch Extension Development Host
In the new window, open the test-workspace folder
Open any .md file and check the "Similar Files" panel in the sidebar
Click on suggested files to test navigation

Method 2: Extension Development Host (Full Development)

Clone and setup the repository:

git clone <repository-url>
cd first-extension
npm install

Open in VS Code and launch:
```
code .
```
Start debugging (F5):
- Press F5 or go to Run > Start Debugging
- This opens a new "Extension Development Host" window with the extension loaded
Test with sample content:
- Open the test-workspace folder in the Extension Development Host
- Open any .md file (e.g., ai-ml.md)
- Check the "Similar Files" panel in the sidebar
- Click on suggested files to test navigation
- Edit and save files to test real-time updates

Method 2: Manual Testing with Your Own Content

Prepare test content:
- Create several Markdown files with related content
- For example: project documentation, meeting notes, research files
Test scenarios:
- Open files with similar topics and verify relevant suggestions appear
- Save changes to a file and check if suggestions update
- Try different file types if you've configured custom fileGlobs
- Test the settings by changing maxResults and observing the sidebar

Method 3: Automated Testing

npm run test

The test suite includes 9 comprehensive tests covering:

Extension activation and setup
Index building and querying
File updates and incremental indexing
Configuration handling
TreeDataProvider functionality

🚀 Installation for Daily Use

From Source (Development)

Clone this repository
Run npm install and npm run compile
Press F5 to test in Extension Development Host

From Package (Future)

Will be available on VS Code Marketplace
Install via Extensions panel in VS Code

🔧 How It Works

The extension uses the BM25 algorithm (Best Matching 25) - the same algorithm used by search engines like Elasticsearch:

📚 Indexing Phase: When you open VS Code, the extension scans your workspace for configured file types and builds a search index
🔍 Query Phase: When you open or edit a file, it uses the file's content as a search query against the index
📊 Ranking: Results are ranked by BM25 similarity scores, showing the most relevant files first
⚡ Updates: The index updates incrementally when you save files, keeping suggestions current

Why BM25?

Relevance: Considers both term frequency and document length
Performance: Fast queries even with large document collections
Proven: Used by major search engines and information retrieval systems
Adaptable: Works well with both short and long documents
Balanced: Properly handles common terms without overwhelming results

🔢 Understanding Similarity Scores

How Scoring Works

The Similar Files extension uses the BM25 algorithm (implemented via MiniSearch) to calculate similarity between documents. Here's how the scoring works:

Score Range: Scores typically range from 0 to around 10, with:
- Higher scores (e.g., 2.0+) indicating strong similarity
- Medium scores (e.g., 0.5-2.0) indicating moderate similarity
- Lower scores (e.g., <0.5) indicating minimal similarity
Score Calculation: The BM25 algorithm considers:
- Term Frequency (TF): How often important terms appear in both documents
- Inverse Document Frequency (IDF): How unique those terms are across all documents
- Document Length: Normalized by document length to avoid bias toward longer documents
Display Format: Scores are displayed as (score) filename in the TreeView, rounded to 2 decimal places.

About MiniSearch (Our Dependency)

This extension uses MiniSearch, a small yet powerful full-text search engine that:

Implements the BM25 ranking algorithm (same as used by Elasticsearch and Lucene)
Provides fuzzy matching with configurable edit distance
Has zero external dependencies
Supports incremental indexing (essential for our file-change updates)
Is optimized for in-memory usage in JavaScript environments

Potential Improvements

Several approaches could enhance the similarity detection in future versions:

Semantic Search Integration:
- Using embeddings from models like @xenova/transformers to capture semantic meaning
- Creating vector representations of documents and measuring cosine similarity
- Implementing hybrid search combining BM25 lexical matching with semantic similarity
Advanced Preprocessing:
- Adding stemming to match different forms of the same word
- Implementing stopword removal for more meaningful comparisons
- Using language-specific tokenization for international content
Alternative Algorithms:
- Okapi BM25+: An enhanced version of BM25 with better handling of term frequency saturation
- TF-IDF with SVD: Using singular value decomposition for dimension reduction
- SimHash: A technique for quickly finding similar documents using locality-sensitive hashing
Custom Weighting:
- Giving higher weight to document titles, headings, or specific sections
- Adjusting the relative importance of rare vs. common terms
- Implementing user-defined boosting for certain keywords

Facilitate Insights

corey-data