Building SmallGPT - Recreation of GPT2

A GPT-style transformer language model built from scratch in PyTorch, trained on OpenWebText with GPT-2 tokenization. The project demonstrates the full LLM pipeline including tokenization, transformer architecture, training optimization, and autoregressive text generation.

March 11, 2026·

artificial intelligencelarge language modelstransformersnatural language processingpytorch

Most developers interact with Large Language Models through APIs. Few actually build one from the ground up. But understanding how a GPT-style model works internally — from tokenization to training loops — reveals the mechanics behind modern AI systems.

This project implements SmallGPT, a GPT-style decoder-only transformer language model trained from scratch on OpenWebText using PyTorch. The goal is not to compete with billion-parameter models, but to demonstrate how a complete LLM pipeline works: data ingestion, tokenization, transformer architecture, training optimization, and autoregressive text generation.

This article breaks down the entire project and explains how each component works.

Project Overview

SmallGPT is a miniature implementation of a GPT-style transformer language model.

Key characteristics:

Architecture: Decoder-only Transformer
Tokenizer: GPT-2 Byte Pair Encoding (BPE)
Dataset: OpenWebText
Framework: PyTorch
Parameters: ~10M–26M depending on configuration
Training Objective: Next-token prediction

The pipeline follows the standard LLM workflow:

Raw Text ↓ Tokenization ↓ Dataset Processing ↓ Transformer Model ↓ Training ↓ Text Generation

The system learns by predicting the next token in a sequence, enabling it to generate coherent text.

Dataset: OpenWebText

The model is trained on OpenWebText, an open-source replication of the dataset originally used to train GPT-2.

OpenWebText contains web page content scraped from high-quality sources, making it suitable for training general-purpose language models.

Data Processing Steps

Stream documents from the dataset.
Tokenize each document.
Concatenate tokens into a long sequence.
Split the sequence into fixed-size blocks.

Example:

Input tokens:

[The, future, of, artificial, intelligence]

Target tokens:

[future, of, artificial, intelligence, is]

The model learns the mapping:

current token sequence → next token

Dataset Statistics

Documents processed: ~20,000
Total tokens: ~22.6 million
Unique tokens: ~49,607
Training samples: ~88,572

Each training sample consists of:

256 input tokens 256 target tokens

Tokenization: Byte Pair Encoding (BPE)

Natural language must be converted into numerical tokens before it can be processed by neural networks.

This project uses GPT-2’s Byte Pair Encoding tokenizer via tiktoken.

Vocabulary Size

50,257 tokens

BPE works by breaking text into frequent subword units, allowing the model to represent rare words efficiently.

Example:

"unbelievable" ↓ ["un", "believ", "able"]

Advantages:

Handles rare words
Keeps vocabulary size manageable
Improves generalization

Model Architecture

SmallGPT implements a decoder-only transformer, the same architecture used in GPT-2, GPT-3, and ChatGPT.

Model Configuration

Parameter	Value
Layers	6
Attention Heads	6
Embedding Size	384
Context Length	256 tokens
Vocabulary Size	50,257
Dropout	0.1

Total parameters:

~10.7 million

The architecture consists of the following components:

Token Embedding

Position Embedding ↓ Transformer Block × N ↓ LayerNorm ↓ Linear Output Layer

Token and Position Embeddings

Tokens are first converted into dense vectors.

Two embeddings are used:

Token Embeddings

Maps each token ID to a vector representation.

token_id → embedding vector

Positional Embeddings

Transformers do not inherently understand word order.

Positional embeddings encode the position of tokens in the sequence, enabling the model to understand sentence structure.

Example:

Token: "cat" Position: 3

Both embeddings are summed before entering the transformer blocks.

Transformer Blocks

Each transformer block contains two main components:

Causal Self-Attention
Feedforward Network (MLP)

Residual connections and layer normalization stabilize training.

Causal Self-Attention

Self-attention allows the model to determine which previous tokens are relevant when predicting the next token.

Example sentence:

The cat sat on the

To predict the next word, the model attends to previous tokens.

However, it cannot look ahead. This is enforced using a causal mask.

Mathematically:

Attention(Q,K,V) = softmax(QKᵀ / √d) V

Where:

Q: Query
K: Key
V: Value

Multiple attention heads operate in parallel.

Advantages of multi-head attention:

Capture multiple relationships simultaneously
Improve contextual understanding

The implementation uses:

torch.nn.functional.scaled_dot_product_attention

which automatically enables Flash Attention kernels when supported by the hardware.

Feedforward Network (MLP)

After attention, each token passes through a position-wise feedforward network.

Structure:

Linear → GELU → Linear

The hidden dimension expands by 4× before projecting back to the original embedding size.

This allows the model to learn richer transformations of the token representations.

Training Setup

Training is performed using next-token prediction with cross-entropy loss.

Optimizer

AdamW

Configuration:

Parameter	Value
Learning Rate	3e-4
Weight Decay	0.1
β1	0.9
β2	0.95

Training Optimizations

Several techniques are used to improve efficiency.

Gradient Accumulation

Allows the model to simulate larger batch sizes without exceeding GPU memory.

effective_batch = batch_size × grad_accum_steps

Example:

8 × 4 = 32 samples per update

Mixed Precision Training

The model uses bfloat16 or float16 precision via PyTorch AMP.

Benefits:

Reduced memory usage
Faster computation on GPUs

Learning Rate Schedule

Training uses a cosine decay schedule with warm-up.

Training phases:

Warm-up → Stable Training → Gradual Decay

Warm-up prevents unstable gradients early in training.

Training Performance

Training ran for:

5000 iterations

Loss progression:

Stage	Loss
Initial	~10.67
Final	~5.12

Perplexity decreased significantly during training:

Final Perplexity ≈ 169

Perplexity measures how well the model predicts text.

Lower values indicate better language modeling.

Text Generation (Inference)

After training, the model can generate text using autoregressive sampling.

Process:

Provide a prompt.
Predict next token.
Append token to sequence.
Repeat.

Generation controls include:

Temperature

Controls randomness.

Value	Effect
<1	More deterministic
1	Balanced
>1	More creative

Top-K Sampling

Limits predictions to the K most probable tokens.

Example:

top_k = 50

Only the 50 highest probability tokens are considered.

This prevents low-probability outputs from appearing.

Example Generation

Prompt:

The future of artificial intelligence is

The trained model generates a paragraph continuing the idea.

While not as coherent as large-scale models, it demonstrates that the system has learned basic language structure and context prediction.

Why This Project Matters

SmallGPT demonstrates how modern language models actually work under the hood.

Instead of relying on high-level libraries, the implementation manually builds:

Tokenization pipeline
Dataset streaming
Transformer architecture
Attention mechanisms
Training loop
Sampling-based inference

This provides a complete view of how GPT-style models function.

Conclusion

Building a transformer language model from scratch reveals the core mechanics behind today’s most powerful AI systems.

SmallGPT shows that with the right architecture and training pipeline, even a relatively small model trained on commodity hardware can learn meaningful patterns in natural language.

Understanding these systems at a low level is essential for engineers working on:

LLM training
model optimization
AI infrastructure
generative AI applications

Projects like this bridge the gap between using AI models and truly understanding how they work.