
Building SmallGPT - Recreation of GPT2
A GPT-style transformer language model built from scratch in PyTorch, trained on OpenWebText with GPT-2 tokenization. The project demonstrates the full LLM pipeline including tokenization, transformer architecture, training optimization, and autoregressive text generation.
Most developers interact with Large Language Models through APIs. Few actually build one from the ground up. But understanding how a GPT-style model works internally — from tokenization to training loops — reveals the mechanics behind modern AI systems.
This project implements SmallGPT, a GPT-style decoder-only transformer language model trained from scratch on OpenWebText using PyTorch. The goal is not to compete with billion-parameter models, but to demonstrate how a complete LLM pipeline works: data ingestion, tokenization, transformer architecture, training optimization, and autoregressive text generation.
This article breaks down the entire project and explains how each component works.
Project Overview
SmallGPT is a miniature implementation of a GPT-style transformer language model.
Key characteristics:
- Architecture: Decoder-only Transformer
- Tokenizer: GPT-2 Byte Pair Encoding (BPE)
- Dataset: OpenWebText
- Framework: PyTorch
- Parameters: ~10M–26M depending on configuration
- Training Objective: Next-token prediction
The pipeline follows the standard LLM workflow:
Raw Text ↓ Tokenization ↓ Dataset Processing ↓ Transformer Model ↓ Training ↓ Text Generation
The system learns by predicting the next token in a sequence, enabling it to generate coherent text.
Dataset: OpenWebText
The model is trained on OpenWebText, an open-source replication of the dataset originally used to train GPT-2.
OpenWebText contains web page content scraped from high-quality sources, making it suitable for training general-purpose language models.
Data Processing Steps
- Stream documents from the dataset.
- Tokenize each document.
- Concatenate tokens into a long sequence.
- Split the sequence into fixed-size blocks.
Example:
Input tokens:
[The, future, of, artificial, intelligence]
Target tokens:
[future, of, artificial, intelligence, is]
The model learns the mapping:
current token sequence → next token
Dataset Statistics
- Documents processed: ~20,000
- Total tokens: ~22.6 million
- Unique tokens: ~49,607
- Training samples: ~88,572
Each training sample consists of:
256 input tokens 256 target tokens
Tokenization: Byte Pair Encoding (BPE)
Natural language must be converted into numerical tokens before it can be processed by neural networks.
This project uses GPT-2’s Byte Pair Encoding tokenizer via tiktoken.
Vocabulary Size
50,257 tokens
BPE works by breaking text into frequent subword units, allowing the model to represent rare words efficiently.
Example:
"unbelievable" ↓ ["un", "believ", "able"]
Advantages:
- Handles rare words
- Keeps vocabulary size manageable
- Improves generalization
Model Architecture
SmallGPT implements a decoder-only transformer, the same architecture used in GPT-2, GPT-3, and ChatGPT.
Model Configuration
| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention Heads | 6 |
| Embedding Size | 384 |
| Context Length | 256 tokens |
| Vocabulary Size | 50,257 |
| Dropout | 0.1 |
Total parameters:
~10.7 million
The architecture consists of the following components:
Token Embedding
Position Embedding ↓ Transformer Block × N ↓ LayerNorm ↓ Linear Output Layer
Token and Position Embeddings
Tokens are first converted into dense vectors.
Two embeddings are used:
Token Embeddings
Maps each token ID to a vector representation.
token_id → embedding vector
Positional Embeddings
Transformers do not inherently understand word order.
Positional embeddings encode the position of tokens in the sequence, enabling the model to understand sentence structure.
Example:
Token: "cat" Position: 3
Both embeddings are summed before entering the transformer blocks.
Transformer Blocks
Each transformer block contains two main components:
- Causal Self-Attention
- Feedforward Network (MLP)
Residual connections and layer normalization stabilize training.
Causal Self-Attention
Self-attention allows the model to determine which previous tokens are relevant when predicting the next token.
Example sentence:
The cat sat on the
To predict the next word, the model attends to previous tokens.
However, it cannot look ahead. This is enforced using a causal mask.
Mathematically:
Attention(Q,K,V) = softmax(QKᵀ / √d) V
Where:
- Q: Query
- K: Key
- V: Value
Multiple attention heads operate in parallel.
Advantages of multi-head attention:
- Capture multiple relationships simultaneously
- Improve contextual understanding
The implementation uses:
torch.nn.functional.scaled_dot_product_attention
which automatically enables Flash Attention kernels when supported by the hardware.
Feedforward Network (MLP)
After attention, each token passes through a position-wise feedforward network.
Structure:
Linear → GELU → Linear
The hidden dimension expands by 4× before projecting back to the original embedding size.
This allows the model to learn richer transformations of the token representations.
Training Setup
Training is performed using next-token prediction with cross-entropy loss.
Optimizer
AdamW
Configuration:
| Parameter | Value |
|---|---|
| Learning Rate | 3e-4 |
| Weight Decay | 0.1 |
| β1 | 0.9 |
| β2 | 0.95 |
Training Optimizations
Several techniques are used to improve efficiency.
Gradient Accumulation
Allows the model to simulate larger batch sizes without exceeding GPU memory.
effective_batch = batch_size × grad_accum_steps
Example:
8 × 4 = 32 samples per update
Mixed Precision Training
The model uses bfloat16 or float16 precision via PyTorch AMP.
Benefits:
- Reduced memory usage
- Faster computation on GPUs
Learning Rate Schedule
Training uses a cosine decay schedule with warm-up.
Training phases:
Warm-up → Stable Training → Gradual Decay
Warm-up prevents unstable gradients early in training.
Training Performance
Training ran for:
5000 iterations
Loss progression:
| Stage | Loss |
|---|---|
| Initial | ~10.67 |
| Final | ~5.12 |
Perplexity decreased significantly during training:
Final Perplexity ≈ 169
Perplexity measures how well the model predicts text.
Lower values indicate better language modeling.
Text Generation (Inference)
After training, the model can generate text using autoregressive sampling.
Process:
- Provide a prompt.
- Predict next token.
- Append token to sequence.
- Repeat.
Generation controls include:
Temperature
Controls randomness.
| Value | Effect |
|---|---|
| <1 | More deterministic |
| 1 | Balanced |
| >1 | More creative |
Top-K Sampling
Limits predictions to the K most probable tokens.
Example:
top_k = 50
Only the 50 highest probability tokens are considered.
This prevents low-probability outputs from appearing.
Example Generation
Prompt:
The future of artificial intelligence is
The trained model generates a paragraph continuing the idea.
While not as coherent as large-scale models, it demonstrates that the system has learned basic language structure and context prediction.
Why This Project Matters
SmallGPT demonstrates how modern language models actually work under the hood.
Instead of relying on high-level libraries, the implementation manually builds:
- Tokenization pipeline
- Dataset streaming
- Transformer architecture
- Attention mechanisms
- Training loop
- Sampling-based inference
This provides a complete view of how GPT-style models function.
Conclusion
Building a transformer language model from scratch reveals the core mechanics behind today’s most powerful AI systems.
SmallGPT shows that with the right architecture and training pipeline, even a relatively small model trained on commodity hardware can learn meaningful patterns in natural language.
Understanding these systems at a low level is essential for engineers working on:
- LLM training
- model optimization
- AI infrastructure
- generative AI applications
Projects like this bridge the gap between using AI models and truly understanding how they work.