Bread Dataset Viewer

Open Source Dataset Viewer for ML Engineers | Stop fighting with crashed editors and unreadable data

VS Code Marketplace Version Downloads Rating

Viewing rollouts and training data shouldn't be this hard. Built because existing tools crash, freeze, or force you to load entire datasets into memory. Open 100GB+ JSONL, Parquet, and CSV files instantly with a clean, understandable interface that actually works.

Install from Marketplace | View on GitHub | Report Issue

Bread Dataset Viewer demo showing instant loading of large JSONL files with token counting, view mode switching between Cards, Table, and Raw views, and seamless navigation

Why Developers Choose Bread Dataset Viewer

⚡ Actually Works

Open 100GB+ files in under 1 second. No crashes, no freezing, no "out of memory" errors. Built because viewing rollouts shouldn't crash your editor.

👁️ Readable Interface

Clean Cards, Table, and Raw views. Collapsible JSON, searchable data, jump-to-line navigation. Finally understand your data without squinting.

🎯 Real Token Counts

16 production tokenizers from GPT, Claude, Llama, Qwen. Not estimates—actual counts for ML training and rollout validation.

What It Does

Viewing rollouts is annoying. Viewing data in general is annoying. You shouldn't have to:

Wait for VS Code to crash when opening large files
Load entire datasets into memory just to check a few rows
Struggle with unreadable JSON or nested data structures
Use external tools that break your workflow

This extension provides a clean, understandable interface that just works:

Opens any size file instantly by streaming and lazy-loading data
Never crashes — handles 100GB+ files by only loading what's visible
Clear visualization with Cards, Table, and Raw views
Real tokenizers for accurate ML token counting (16 production models)

Supported Formats

JSONL (JSON Lines)
JSON
CSV/TSV
Parquet
Arrow/Feather

Who This Is For

✅ ML Engineers Training LLMs

Problem: Need to verify token counts before training runs Solution: See exact tokens with your target model's tokenizer Example: Check if GPT-4o and Claude tokenize your prompts differently

✅ Data Scientists Exploring Datasets

Problem: CSV/Parquet files too large for Excel or Pandas preview Solution: Instantly browse millions of rows without loading into memory Example: Explore 50GB customer conversation logs in Cards view

✅ DevOps Debugging Production Logs

Problem: Log files in JSON Lines format are unreadable Solution: Collapsible JSON trees with line numbers and search Example: Find specific error patterns in 80GB application logs

Features

🚀 Lazy Loading Architecture

Opens files instantly by streaming data on-demand. Jump to line 10,000,000 without loading rows 1-9,999,999 into memory.

Technical details:

Streams data in 100-row pages
Constant memory usage regardless of file size
Supports random access (jump to any line)
Compatible with VS Code Remote (SSH, WSL, Containers)

Format support with maximum tested sizes:

Format	Extension	Max Tested Size	Notes
JSONL	`.jsonl`	120 GB	Line-oriented JSON (one object per line)
JSON	`.json`	500 MB	Standard JSON arrays (full file parsed)
CSV/TSV	`.csv`, `.tsv`	80 GB	Comma/tab-separated with header detection
Parquet	`.parquet`	150 GB	Columnar format with schema inference
Arrow/Feather	`.arrow`, `.feather`	100 GB	Apache Arrow IPC format

Handling massive datasets:

Large dataset demo showing 100GB+ JSONL file with 50 million rows opening instantly, smooth scrolling, jump-to-line navigation to line 10 million, and consistent memory usage under 300MB

Open a 100GB+ file with 50 million rows. Jump to any line instantly, search through data, and navigate without loading the entire file into memory. Memory stays under 300MB regardless of file size.

🎯 Production-Grade Token Counting

Real tokenizers from HuggingFace—not regex approximations. See the exact token count your LLM will use.

16 Bundled Tokenizers (No internet required):

Qwen Family (2 tokenizers)

Qwen 3.x - Latest generation with reasoning support (36T vocab)
Qwen 2.5 - Legacy version (152k vocab)

DeepSeek Family (1 tokenizer)

DeepSeek V3 / R1 - Shared tokenizer for V3 and R1 (128k vocab)

Llama Family (1 tokenizer)

Llama 3.x - Covers 3.1, 3.2, 3.3 (128k Tiktoken)

Gemma Family (2 tokenizers)

Gemma 3.x - Multimodal-ready (262k vocab)
Gemma 2.x - Standard SentencePiece (256k vocab)

Mistral Family (3 tokenizers)

Mistral Tekken - For NeMo 12B, Pixtral, Ministral (131k vocab)
Mistral V3 - For Large 2, Codestral (32,768 vocab)
Mistral V1 - Legacy Llama 2 compatible (32k vocab)

Phi Family (1 tokenizer)

Phi 4.x - Microsoft's latest (100k vocab)

Command R Family (1 tokenizer)

Command R - For R7B, R, R+ (256k Cohere vocab)

GPT Family (4 tokenizers)

GPT-5.x / gpt-oss - o200k_harmony with Thinking tokens
GPT-4o Family - For GPT-4o, o1-preview, o1-mini
GPT-4 Classic - cl100k_base for GPT-4 Turbo, GPT-3.5
GPT-2 - Historical baseline (50k BPE)

Claude Family (1 tokenizer)

Claude 3.x / 4.x - Anthropic proxy (~100k vocab)

Tokenization modes:

Auto: Detects chat format and applies template, or falls back to full JSON
Chat: Apply model's chat template to messages array
Full JSON: Count tokens in the entire JSON object
Key: <name>: Count tokens in a specific field
Raw Text: Count tokens as plain text

What you get:

Per-row token counts displayed inline
Average, max, and total token statistics
Real-time updates when switching tokenizers
Chat template support for multi-turn conversations

See it in action:

Token counting demo showing real-time tokenizer switching between GPT-4o, Claude, Qwen, and Llama with instant token count updates for ML training data validation

Switch between any of the 16 production tokenizers and see token counts update instantly. Compare how different models tokenize your prompts and training data.

🎨 Three View Modes for Different Workflows

Switch between views instantly with toolbar buttons.

Cards View (Default)

Best for: JSONL chat completions, nested JSON structures Features: Collapsible JSON trees, expandable arrays, token counts per line

Table View

Best for: CSV/TSV data, columnar exploration Features: Spreadsheet layout, sortable columns, quick scanning

Raw View

Best for: Exact text representation, line-by-line analysis Features: Line numbers, syntax highlighting, search highlighting

Search features:

Full-text search across all rows
Real-time highlighting
Case-sensitive toggle
JSON path filtering

Navigation:

Jump to specific line number
Page through results (Next/Previous)
Load more rows on scroll

Performance:

Search operates on loaded rows only
Instant highlighting with no lag
Maintains scroll position

Getting Started

Installation

Option 1: VS Code Marketplace (Recommended)

Open VS Code
Press Cmd/Ctrl + Shift + X to open Extensions
Search for "Bread Dataset Viewer"
Click Install

Option 2: Command Line

code --install-extension bread-tech.bread-dataset-viewer

Option 3: Download VSIX Download from GitHub Releases

Quick Start (30 seconds)

Open a file: Click any .jsonl, .csv, .parquet, .arrow, or .json file in Explorer
Viewer opens automatically: Default view is Cards mode
Select a tokenizer: Click the "Tokenizer:" dropdown → Choose your model
See token counts: Numbers appear next to each row

That's it! No configuration needed.

First-Time Tips

For chat datasets: Use Cards view + Chat mode tokenization
For CSV data: Switch to Table view (button in toolbar)
For debugging: Use Raw view to see exact text
For large files: Jump to line instead of scrolling

Requirements

VS Code 1.85.0 or higher

Advanced Usage

Using with VS Code Remote

Works seamlessly with:

SSH: Open datasets on remote servers
WSL: Access Windows files from Linux
Containers: View data inside Docker containers

The extension runs in the remote environment, so large files never transfer over the network.

Command Palette Commands

Access features via Cmd/Ctrl + Shift + P:

Bread: Open Text Viewer (Fast, All Formats) - Force open in text viewer
Bread: Set Mode - Switch between Cards/Table/Raw
Bread: Set Tokenizer - Change tokenizer
Bread: Search - Open search bar
Bread: Go to Line - Jump to specific line
Bread: Next/Previous Page - Navigate pages
Bread: Open File for Editing - Switch to standard VS Code editor

Right-click any supported file in Explorer → "Open Text Viewer (Fast, All Formats)"

Integrating with Data Pipelines

Validate tokenization workflow:

# Example workflow
1. Generate training data → output.jsonl
2. Open in Bread Dataset Viewer
3. Select target tokenizer (e.g., "Qwen 3.x")
4. Check "Avg Tokens" in status bar
5. Verify against your budget (e.g., <2048 tokens/row)

Performance tuning:

Check file encoding (UTF-8 recommended)
For Parquet: Ensure columnar compression
For CSV: Use TSV for faster parsing

Contributing & Support

We welcome contributions! This project is open source under MIT license.

Ways to contribute:

🐛 Report bugs
💡 Request features
📖 Improve documentation
🔧 Submit pull requests

Development setup:

git clone https://github.com/Bread-Technologies/mle_vscode_extension.git
cd mle_vscode_extension
npm install
npm run compile
# Press F5 in VS Code to launch Extension Development Host

Need help?

📚 Documentation
🐛 Issue Tracker

Privacy & Analytics

This extension collects anonymous usage analytics to help improve the product. We take your privacy seriously.

What We Collect

Feature usage (which features you use)
Performance metrics (parse times, load times)
Error events (crashes and bugs)
View mode interactions (Cards, Table, Raw switches)
Tokenizer selection and mode usage

What We DO NOT Collect

❌ File paths, names, or contents
❌ Search terms or queries
❌ Actual token counts or metric values
❌ Training data or dataset contents
❌ Hyperparameters or configurations
❌ Any personally identifiable information (PII)

How to Opt Out

Option 1: Via Settings UI

Open Settings (Cmd+, or Ctrl+,)
Search for "telemetry level"
Set Telemetry Level to "off"

Option 2: Via settings.json

Open Command Palette (Cmd+Shift+P / Ctrl+Shift+P)
Type "Preferences: Open User Settings (JSON)"
Add: "telemetry.telemetryLevel": "off"

The extension respects your editor's global telemetry.telemetryLevel setting. Learn more about VS Code telemetry.

Technical Details

Uses Azure Application Insights for analytics
GDPR compliant with automatic PII sanitization
All telemetry code is open source in this repository
Application Insights key is included in the extension (standard practice for client-side telemetry)
Rate limiting and security are handled server-side by Azure

License

What this means:

✅ Free for commercial use
✅ Modify and distribute freely
✅ Private use allowed
✅ No warranty provided

See LICENSE file for full text.

Bundled Tokenizer Licenses

This extension includes tokenizer files from various HuggingFace models. Each tokenizer retains its original license:

Apache 2.0: Qwen, DeepSeek, Mistral, Phi, GPT-2
MIT: Command R (Xenova port)
Gemma License: Gemma 2.x, 3.x
Llama License: Llama 3.x (Unsloth mirror)
Community Proxy: Claude, GPT-4, GPT-4o (Xenova)

See tokenizers/MANIFEST.md for details.

For commercial use: Verify the specific tokenizer license for your target model. Token counting is always safe—licenses apply to model weights, not tokenizers.

Made with ❤️ by Bread Technologies

Bread Dataset Viewer - Open Source

Bread