Diffinite — Source Code Comparison

Forensic source-code comparison tool for IP litigation and code audit, now available as a VSCode extension.

Compare two directories of source code with Winnowing fingerprints (Schleimer et al., 2003 — the algorithm behind Stanford MOSS) and generate professional PDF/HTML/Markdown reports — all from within VSCode.

Design Principle: Diffinite reports how similar and where similar. It does not classify the type of copying — that is the expert witness's job.

Features

1:1 File Matching — Pairs files across two directories using fuzzy name matching, then computes line-by-line or word-by-word diffs with syntax highlighting.
N:M Cross-Matching (Deep Mode) — Winnowing fingerprint-based Jaccard similarity across all file pairs. Detects code reuse even across renamed, split, or merged files.
Comment Stripping — 5-state FSM parser supporting 30+ file extensions (.py, .js, .ts, .java, .c, .cpp, .go, .rs, .rb, .sql, .html, .css, and more).
Multiple Report Formats — Export to PDF, HTML, or Markdown.
Forensic Annotations — Page numbers, file numbers, Bates stamps, filenames on every page.
GUI Options Panel — Configure all analysis parameters visually without touching the CLI.
Bundled Binary Support — Ships with standalone binaries when available; falls back to Python if needed.

Usage

Open the Command Palette (Ctrl+Shift+P / Cmd+Shift+P)
Run "Diffinite: Compare Directories"
Select the original directory (A) and comparison directory (B)
Configure options in the GUI panel (mode, thresholds, comment stripping, etc.)
View results in the built-in diff viewer or export a report

How It Works

Diffinite runs a two-stage pipeline:

Stage 1: 1:1 File Matching (`simple` mode)

Fuzzy name matching — Pairs files across directories using string similarity (configurable threshold).
Comment stripping — Optionally removes comments using a 5-state finite state machine parser.
Side-by-side diff — Computes line-by-line (or word-by-word) diffs using difflib.SequenceMatcher.
Report generation — Renders syntax-highlighted HTML diffs via Pygments, converts to PDF with xhtml2pdf.

Stage 2: N:M Cross-Matching (`deep` mode, default)

Winnowing fingerprint extraction — Extracts position-independent code fingerprints (K-gram → rolling hash → window selection).
Inverted index construction — Builds a hash-to-file mapping for all B-directory fingerprints.
Jaccard similarity computation — For each A-file, computes |A∩B| / |A∪B| against all B-files sharing fingerprints.
Cross-match reporting — Appends an N:M similarity matrix showing which files from A are similar to which files in B.

Output Report

Cover Page

Summary table for each matched file pair:

Column	Description
File A / File B	Matched file paths
Match	`SequenceMatcher.ratio()` — proportion of matching characters (`1.0` = identical)
Added / Deleted	Lines added to or deleted from File A to produce File B

Diff Pages

Side-by-side diff for each matched pair:

🟢 Green — Lines present only in File B (additions)
🔴 Red — Lines present only in File A (deletions)
No highlight — Identical lines (with configurable context folding)

Deep Compare Section

N:M cross-matching table (deep mode):

Column	Description
File A	Source file from directory A
Matched Files (B)	All B-files sharing fingerprints above the Jaccard threshold
Jaccard	`\|A∩B\| / \|A∪B\|` — fraction of shared Winnowing fingerprints

Extension Settings

Setting	Default	Description
`diffinite.pythonPath`	`python`	Path to Python interpreter with diffinite installed
`diffinite.defaultMode`	`deep`	Default execution mode (`simple` or `deep`)

GUI Options Panel

All options are configurable through the built-in GUI panel:

Option	Default	Description
Mode	`deep`	`simple` = 1:1 only. `deep` = 1:1 + N:M cross-matching
Strip Comments	off	Remove comments before comparison
By Word	off	Compare by word instead of by line
Normalize	off	Normalize identifiers/literals for Type-2 clone detection
Collapse Identical	off	Fold unchanged blocks (3 context lines)
No Autojunk	off	Disable autojunk heuristic for more precise forensic analysis
Threshold	`60`	Fuzzy file-name matching threshold (0–100)
K-gram	`5`	Winnowing K-gram size (Schleimer 2003 §4.2)
Window	`4`	Winnowing window size. Detection guarantee: sequences ≥ K+W−1 tokens
Threshold (Deep)	`0.05`	Minimum Jaccard similarity to include in results

Requirements

This extension requires either:

Bundled binary (included for Windows / Linux / macOS when available), or
Python ≥ 3.10 with diffinite installed:
```
pip install diffinite
```

Comment Stripping Support

The Strip Comments option removes comments using a 5-state FSM parser:

Extensions	Comment Styles
`.py`	`# line`, `"""docstrings"""`
`.js`, `.ts`, `.jsx`, `.tsx`	`// line`, `/* block */`, `template literals`
`.java`, `.c`, `.cpp`, `.h`, `.cs`, `.go`, `.rs`, `.kt`, `.scala`	`// line`, `/* block */`
`.html`, `.xml`, `.svg`	`<!-- block -->`
`.css`, `.scss`, `.less`	`/* block */`
`.sql`	`-- line`, `/* block */`
`.rb`, `.sh`, `.bash`, `.r`	`# line`
`.lua`	`-- line`, `--[[ block ]]`

📊 Benchmark results and dataset rationale → see GitHub README

Limitations

General-purpose tokenizer — Uses a single regex tokenizer, not language-specific parsers.
Position-independent — Reordered functions may produce higher similarity than expected.
No corpus-wide weighting — Pairwise comparison only; no TF-IDF to down-weight common idioms.
Not a legal opinion — Similarity scores are mathematical measurements, not legal conclusions.

License

Apache License 2.0

See NOTICE for attribution.

Diffinite — Source Code Comparison

nash-dir

Diffinite — Source Code Comparison

Features

Usage

How It Works

Stage 1: 1:1 File Matching (simple mode)

Stage 2: N:M Cross-Matching (deep mode, default)

Output Report

Cover Page

Diff Pages

Deep Compare Section

Extension Settings

GUI Options Panel

Requirements

Comment Stripping Support

Limitations

License

Stage 1: 1:1 File Matching (`simple` mode)

Stage 2: N:M Cross-Matching (`deep` mode, default)