Automatic Comment Code Generator

A VS Code extension + local PyTorch Transformer model that generates code comments from selected source code.

The project combines:

A TypeScript VS Code extension (src/) that detects comment targets and inserts comments in-place
A from-scratch GPT-style Decoder-only Transformer (model/) that performs local inference

Project Goals

Generate comments that are relevant, specific, and action-oriented.
Reduce vague/boilerplate outputs using repetition penalty and quality gates.
Keep inference fully local (no external API dependency).
Achieve < 100ms inference latency on local CPU/GPU.
Measure quality with repeatable benchmark metrics.

Repository Layout

├── src/
│   ├── extension.ts          # VS Code command (auto-comment.generate), target detection, comment insertion
│   └── aiProvider.ts         # Invokes Python inference process and parses JSON telemetry
├── model/
│   ├── model.py              # GPT-style Decoder-only Transformer (CausalSelfAttention, Pre-Norm Blocks)
│   ├── dataset.py            # BPE tokenizer, FormattingPipe prompt template, CodeSearchNet data pipeline
│   ├── train_pipeline.py     # Training loop (AdamW, linear warmup + cosine decay, early stopping)
│   └── predict.py            # Inference with greedy/top-k/top-p/beam decoding, KV-cache, repetition penalty
├── package.json              # VS Code extension manifest
├── webpack.config.js         # Extension bundler config
└── README.md

Architecture Overview

Model Architecture

Decoder-only Transformer (GPT-style) — ~12M parameters

Component	Detail
CausalSelfAttention	Multi-head attention with triangular causal mask
FeedForward	Linear → GELU → Linear (d_ff = 4 × d_model)
TransformerBlock	Pre-norm residual: `x + Attn(LN(x))`, `x + FFN(LN(x))`
Embeddings	Token embedding + sinusoidal positional encoding (weight-tied with LM head)
LM Head	Linear projection to vocab_size

Default hyperparameters:

n_layers   = 6       d_model = 384       n_heads = 6
d_ff       = 1536    max_seq = 512       dropout = 0.1
vocab_size = 8192 (BPE)

Tokenization

Custom BPE (Byte-Pair Encoding) tokenizer trained on the joint code+comment corpus:

Handles camelCase splitting (attendanceScore → attend, ance, Score)
Handles snake_case splitting
Shared vocabulary for both code and natural language (8192 tokens)
Prompt template: Code:\n{code}\n\nComment: {comment}<eos>

Extension Flow

User runs command: Auto-Comment Code.
Extension analyzes selection/file and finds classes, functions, and control-flow blocks.
For each target snippet, extension calls Python predictor via child_process.
Predictor returns JSON containing comment + telemetry fields.
Extension decides whether to use model output or deterministic rule-based fallback.
Comment is inserted with language-aware prefix (#, //, , etc.).

Decoding Strategies

Strategy	Description
Greedy	argmax at each step
Top-k	Sample from top k=50 highest-probability tokens
Top-p (Nucleus)	Sample from smallest set where cumulative probability ≥ p=0.95
Beam Search	Beam search with length-normalized scoring

All strategies support:

Temperature scaling for controlling randomness
Logit-based repetition penalty to avoid boilerplate loops
KV-cache for O(1) per-step inference
Min/max length constraints

Training Pipeline

Aspect	Configuration
Optimizer	AdamW (weight decay 0.01, β=(0.9, 0.95))
LR Schedule	Linear warmup (2000 steps) + cosine decay
Loss	Cross-entropy with `ignore_index` for padding
Data	CodeSearchNet (Python, Java, JavaScript, Go)
Metrics	Loss, perplexity, token F1, BLEU-4, gradient norm

Prerequisites

Windows/macOS/Linux
Python 3.11+ with PyTorch for local inference
Node.js 18+
VS Code 1.80+

Setup

1) Create and activate virtual environment

python -m venv venv11
# Windows:
venv11\Scripts\activate
# Linux/macOS:
source venv11/bin/activate

2) Install Python dependencies

pip install torch datasets tqdm tabulate matplotlib

For GPU training, install PyTorch with CUDA support:

pip install torch --index-url https://download.pytorch.org/whl/cu121

3) Install Node dependencies

npm install

4) Build extension

npm run compile

Training

python model/train_pipeline.py

Outputs:

model/checkpoints/checkpoint.pt — Best model weights
model/bpe_vocab.json — Trained BPE tokenizer vocabulary

Inference

Plain text output

python model/predict.py "def add(a, b): return a + b"

JSON output with telemetry

python model/predict.py --b64 "<base64-code>" --json

Decoding controls

python model/predict.py "..." --mode top_p --top-p 0.95 --temperature 0.7 --repetition-penalty 1.2
python model/predict.py "..." --mode beam --beam-width 5 --min-len 6 --max-len 80
python model/predict.py "..." --mode top_k --top-k 50
python model/predict.py "..." --mode greedy

VS Code Usage

Install the extension from a packaged VSIX or the VS Code Marketplace.
Ensure the configured Python executable can import torch.
If needed, set autoComment.pythonPath to the Python executable that has PyTorch installed.
Open a code file and optionally select code.
Run command: Auto-Comment Code.
Generated comments are inserted above detected targets.

Marketplace Packaging Notes

The extension is packaged from the compiled dist/extension.js bundle plus the local inference assets under model/. Training-only artifacts, telemetry logs, experiment manifests, source maps, virtual environments, and .env files are excluded by .vscodeignore.

Before publishing, confirm that publisher in package.json matches the Azure DevOps Marketplace publisher ID that owns the extension.

Fallback Behavior

If the model checkpoint is missing, corrupted, or produces low-quality output:

Inference does not crash
Predictor falls back to a local LLM as a measure to generate high-context comments
Telemetry marks fallback_reason: model_load_failed
The extension continues to work with these fallback mechanisms
Note: For better accuracy, please keep the internet on so the local LLM fallback can function optimally.

The fallback path is also used when the local model emits empty, repetitive, or jargon-heavy text. It is intentionally pattern-aware for common return expressions, filters, predicates, loops, storage, API, and UI code so comments remain specific instead of generic.

NPM Scripts

Script	Description
`npm run compile`	Build extension bundle
`npm run watch`	Incremental webpack build
`npm run package`	Production bundle for publish
`npm run lint`	Lint TypeScript sources
`npm test`	Run extension test entrypoint

Version

Extension: 0.0.1
Command ID: auto-comment.generate
Model: Decoder-only Transformer v3

License

MIT License

Contact

Developer: Shaun Mathew
Email: [EMAIL_ADDRESS]

Auto Comment Code Generator

Shaun Mathew