Diffinite — Source Code ComparisonForensic source-code comparison tool for IP litigation and code audit, now available as a VSCode extension. Compare two directories of source code with Winnowing fingerprints (Schleimer et al., 2003 — the algorithm behind Stanford MOSS) and generate professional PDF/HTML/Markdown reports — all from within VSCode.
Features
Usage
How It WorksDiffinite runs a two-stage pipeline: Stage 1: 1:1 File Matching (
|
| Column | Description |
|---|---|
| File A / File B | Matched file paths |
| Match | SequenceMatcher.ratio() — proportion of matching characters (1.0 = identical) |
| Added / Deleted | Lines added to or deleted from File A to produce File B |
Diff Pages
Side-by-side diff for each matched pair:
- 🟢 Green — Lines present only in File B (additions)
- 🔴 Red — Lines present only in File A (deletions)
- No highlight — Identical lines (with configurable context folding)
Deep Compare Section
N:M cross-matching table (deep mode):
| Column | Description |
|---|---|
| File A | Source file from directory A |
| Matched Files (B) | All B-files sharing fingerprints above the Jaccard threshold |
| Jaccard | |A∩B| / |A∪B| — fraction of shared Winnowing fingerprints |
Extension Settings
| Setting | Default | Description |
|---|---|---|
diffinite.pythonPath |
python |
Path to Python interpreter with diffinite installed |
diffinite.defaultMode |
deep |
Default execution mode (simple or deep) |
GUI Options Panel
All options are configurable through the built-in GUI panel:
| Option | Default | Description |
|---|---|---|
| Mode | deep |
simple = 1:1 only. deep = 1:1 + N:M cross-matching |
| Strip Comments | off | Remove comments before comparison |
| By Word | off | Compare by word instead of by line |
| Normalize | off | Normalize identifiers/literals for Type-2 clone detection |
| Collapse Identical | off | Fold unchanged blocks (3 context lines) |
| No Autojunk | off | Disable autojunk heuristic for more precise forensic analysis |
| Threshold | 60 |
Fuzzy file-name matching threshold (0–100) |
| K-gram | 5 |
Winnowing K-gram size (Schleimer 2003 §4.2) |
| Window | 4 |
Winnowing window size. Detection guarantee: sequences ≥ K+W−1 tokens |
| Threshold (Deep) | 0.05 |
Minimum Jaccard similarity to include in results |
Requirements
This extension requires either:
- Bundled binary (included for Windows / Linux / macOS when available), or
- Python ≥ 3.10 with diffinite installed:
pip install diffinite
Comment Stripping Support
The Strip Comments option removes comments using a 5-state FSM parser:
| Extensions | Comment Styles |
|---|---|
.py |
# line, """docstrings""" |
.js, .ts, .jsx, .tsx |
// line, /* block */, `template literals` |
.java, .c, .cpp, .h, .cs, .go, .rs, .kt, .scala |
// line, /* block */ |
.html, .xml, .svg |
<!-- block --> |
.css, .scss, .less |
/* block */ |
.sql |
-- line, /* block */ |
.rb, .sh, .bash, .r |
# line |
.lua |
-- line, --[[ block ]] |
📊 Benchmark results and dataset rationale → see GitHub README
Limitations
- General-purpose tokenizer — Uses a single regex tokenizer, not language-specific parsers.
- Position-independent — Reordered functions may produce higher similarity than expected.
- No corpus-wide weighting — Pairwise comparison only; no TF-IDF to down-weight common idioms.
- Not a legal opinion — Similarity scores are mathematical measurements, not legal conclusions.
License
See NOTICE for attribution.