build GPT with me! implementing GPT from scratch step-by-step

Dev Shah
30 min readJan 31, 2024


The emergence of ChatGPT, the groundbreaking AI-powered chatbot by OpenAI, has sparked a surge of fascination and interest in the realm of artificial intelligence. The allure of these conversational wonders extends not only to the broader field of AI but also to the intricate class of technologies that underpins them. Large Language Models (LLMs), like ChatGPT and Google Bard, have taken the spotlight, demonstrating their remarkable ability to generate text across an astonishing array of subjects. These advanced chatbots promise to revolutionize various aspects of our lives, from revolutionizing web searches to producing boundless creative literature, and even acting as a repository of global knowledge.

But one of the things we take for granted is how these language models work and in this article I’m going to go over how these language models work together and are able to produce / generate text. To understand how language models like ChatGPT work, let’s delve into the core mechanics behind their impressive capabilities. At the heart of these models lies a complex neural network architecture known as the Transformer. Developed by researchers at Google, the Transformer architecture has become a cornerstone in the world of natural language processing.

Before we go any further, this article was heavily inspired by Andrej Karpathy’s tutorial on implementing GPT from scratch.

breaking down language models

Before diving straight into the thick of implementing gpt from scratch, let’s understand what language models are and how they work. In very simple terms, language models are models that are used to determine the probability of a given sequence of words occurring in a sentence. In other words, it’s a probability distribution over words / sequence of words; it pretty much gives the probability if a certain word would be valid. For a word to be valid, it’s not based on grammatical correctness, but it’s more so if the next word makes sense in accordance to the users writing style.

There are 2 main types of language models that are used: probabilistic methods & neural network-based modern language models. The one that we’re going to look at are probabilistic models. Constructing a basic probabilistic language model involves calculating probabilities for n-grams, where an n-gram is a sequence of n words (with n being an integer greater than zero). The probability of an n-gram is the likelihood that its last word follows a specific n-1 gram (excluding the last word). This essentially captures the ratio of instances where the last word follows the given n-1 gram. This concept is rooted in the Markov assumption, indicating that, given the present (n-1 gram), the future (n-gram probabilities) is independent of the past (n-2, n-3 grams, and so on). These large language models use something called a transformer model to train massive datasets. ChatGPT and many other famous LLMs are built off the transformer architecture which was proposed back in 2017. Back then, the transformer architecture was mainly used to perform translation tasks i.e language translation. However, its versatility and effectiveness paved the way for its application in a wide range of natural language processing tasks. The transformer model revolutionized the field by introducing a self-attention mechanism, allowing the model to weigh different words in a sequence differently based on their relevance to the current word. Now let’s dive into how transformers work and how all of this comes together to build GPT from scratch!

understanding the transformer architecture.

Transformers were developed by a group of researchers back in 2017, in the paper named ‘Attention is All You Need’. The transformer architecture is heavily influenced by the concept of self attention → this mechanism allows the LLM to consider all the different parts of the text input together. This allows the model to put greater significance on the “more important” parts of the text input; in doing so, the model is able to identify relationships between words and as a result, it will be able to generate a highly accurate output.

transformers visualized.

The general idea of the attention mechanism is to compute a score for each word depending on the task. The model uses these scores to create a weighted representation of the input → this representation is then passed through the feed-forward neural network. This weighted representation generated by the attention mechanism plays a crucial role in enhancing the model’s ability to focus on relevant parts of the input when performing various tasks. By assigning higher scores to certain words or tokens in the input, the attention mechanism effectively prioritizes the information that is most relevant to the task at hand. This selective attention mechanism allows the model to filter out noise and irrelevant details, leading to more accurate and contextually informed predictions.

One of the key advantages of the attention mechanism is its ability to capture long-range dependencies in the input data. Traditional neural network architectures often struggle with maintaining information across distant elements in a sequence, which can be a limitation in tasks involving long sentences or sequences of data. However, the attention mechanism enables the model to look back at any position in the input sequence and weigh its importance according to the current context, providing a solution to this problem. Now that we have a rough understanding of what transformers are + how large language models work, let’s jump into building a smaller version of GPT from scratch 🚀

extracting data + encoding / decoding process.

For the sake of this implementation, we will be building a GPT model which can produce Shakespeare-esque text. In order to do this, we’re going to need a lot of Shakespeare text and in order to do this, run the following line in your Jupyter notebook:


This command will access a massive text file that has a bunch of text on Shakespearian text. If you want to check out what the text looks like, click here. Once we have access to this data, we can’t exactly train a model on random pieces of text, we need to perform some encoding and decoding. Assuming that you have all the contents of the file stored in a variable called text, we will first do the following:

chars = sorted(list(set(text))) #get all the characters in the first 1000 characters
vocab_size = len(chars) # get the size of it

The first line is extracting unique characters from the given text and then sorting them. The purpose of this is to create a set of distinct characters in the text, and then convert it into a sorted list. This list essentially represents the vocabulary of characters present in the text. The second line calculates the size of the vocabulary, i.e., the total number of unique characters in the text. The variable vocab_size will be used to define the size of the embedding layer in the neural network (this will be touched on later in the article).

After this is done, we need to map each unique character to an index and each index to each unique character. In order to do this, we can run the following lines of code:

char_to_index = {char: index for index, char in enumerate(chars)}
index_to_char = {index: char for index, char in enumerate(chars)}

These dictionaries are created to map each unique character to an index and vice versa. char_to_index maps a character to its corresponding index, and index_to_char maps an index back to the character. This is crucial for converting characters into numerical representations that can be fed into the neural network. The primary motivation for creating these mappings lies in converting textual data into a numerical format suitable for input to a neural network. Neural networks inherently operate on numerical data, and the character-to-index mapping provides a means to represent textual information in a numerical sequence. This numeric representation is essential for the network to learn and generalize patterns, relationships, and dependencies within the text during the training process.

Moreover, in many natural language processing models, including those based on transformer architectures like GPT, the numeric representations obtained from the character-to-index mapping are often further processed through an embedding layer. This layer learns to represent each character as a dense vector in a continuous space, capturing semantic relationships between characters and enhancing the model’s ability to understand and generalize from the input text more effectively.

Now that we’ve done this, we can now define 2 functions: an encoding function & a decoding function. Here is the encoding function:

def encode_string(s):
encoded_list = [char_to_index[char] for char in s]
return encoded_list

The `encode_string(s)` function is a vital component in the process of preparing textual data for neural network input, particularly within natural language processing contexts. This function takes a string `s` as its input, which represents a segment of text. Leveraging the previously constructed `char_to_index` dictionary, the function systematically replaces each character in the input string with its corresponding numerical index. This conversion results in the creation of a list of integers, where each integer signifies the numeric index of the character in the vocabulary.

This numeric representation serves as a fundamental input for the neural network during training or inference. By transforming the original string into a sequence of integers, the neural network can efficiently process and learn intricate patterns, relationships, and dependencies within the text. For instance, if the input string is “hello” and the `char_to_index` mapping assigns indices {h: 0, e: 1, l: 2, o: 3}, the `encode_string` function would output the list [0, 1, 2, 2, 3] as the numerical representation of “hello.”

This numeric representation of the text is not only a bridge between the original characters and the neural network but also facilitates subsequent layers in the network, such as the embedding layer. The embedding layer further transforms these indices into dense vectors, capturing semantic relationships between characters. This process enhances the model’s capacity to comprehend and generalize from the input text effectively. In summary, the `encode_string(s)` function plays a pivotal role in seamlessly integrating textual data into the numeric domain, enabling the neural network to glean meaningful insights and generate coherent outputs.

Now for the decode string function:

def decode_list(l):
decoded_string = ''.join([index_to_char[index] for index in l])
return decoded_string

The `decode_list(l)` function plays a pivotal role in the post-processing phase of neural network output, particularly within the domain of natural language processing. Operating on a list of integers `l`, which represents a sequence of numerical indices potentially generated by the neural network during inference or generation, this function relies on the previously constructed `index_to_char` dictionary. This dictionary associates each numerical index with its corresponding character in the vocabulary.

During the conversion process, for every index in the input list `l`, the function retrieves the associated character from the `index_to_char` dictionary. The result is a reconstructed sequence of characters, forming a string that represents the human-readable output of the neural network. This string essentially serves as a textual reconstruction of the numerical predictions generated by the model.

For example, if the input list is [0, 1, 2, 2, 3], and the `index_to_char` mapping for indices 0, 1, 2, 3 corresponds to characters {0: ‘h’, 1: ‘e’, 2: ‘l’, 3: ‘o’}, the `decode_list` function would output the string “hello” as the reconstructed text.

The `decode_list(l)` function proves particularly valuable when interpreting the output of a neural network, especially in language generation tasks. It facilitates the conversion of the model’s numeric predictions back into human-readable text, making it easier to comprehend, evaluate, and further utilize the generated content. In summary, this function is a critical component in the bidirectional process of converting textual data into numeric representations for neural network input and then translating the network’s output back into human-readable form.

Now that we I’ve broken down what these 2 functions do, you can test it out on any string. For example:

input_string = "hello"
encoded_result = encode_string(input_string)
decoded_result = decode_list(encoded_result)


>>> [46, 43, 50, 50, 53]
>>> hello

What happened above is that the encoded result contains a number that is mapped to each letter / character of the input_string. When we decode it, it reverses the mapping and returns the correct characters. This is crucial for when we build the language model later on! Now, we’re going to build on this idea and understand what ‘context’ means in a language model setting.

understanding context.

To understand how context works, we will be using PyTorch to better understand what’s going on. The first step is to convert all of the text that we have into tensors, like this:

import torch
data = torch.tensor(encode_string(text), dtype=torch.long)

You can print the first 100 characters of data, but it will output it as a tensor with a bunch of numbers. This basically means that we have successfully represented the textual information as numerical data using PyTorch tensors. Each number in the tensor corresponds to a specific element or feature of the input text, and the entire tensor captures the sequential and hierarchical structure of the text. This numerical representation allows us to perform various mathematical operations and apply machine learning techniques to gain insights or make predictions based on the given text.

The next step will be to split the data into a testing set and a training set. Use 90% of the data for training and the remaining for testing. Now to explain what context is: in very simple terms, context is essentially taking the sequence of previous values and predicting what will come next. For example, if we have the following sequence:

>>> tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In the above code, we have something called “context”. For example, in the context of (18, 47, 56), 57 should come next. And this is only true for this specific context. In the realm of natural language processing, understanding context becomes a pivotal aspect of leveraging this numerical representation for meaningful tasks. Just as we’ve established a sequence of numbers to represent textual data, we recognize that the context in which these numbers appear holds the key to making sense of language. When we apply machine learning techniques, such as training a model with our numerical tensor, we are essentially teaching the model to recognize patterns within this numerical sequence. The concept of context takes center stage here — it involves considering the sequence of preceding values to predict the subsequent ones accurately. Taking inspiration from our training data, where we split the text into a training set and a testing set, the model learns from the patterns in the training context. By exposing the model to a diverse range of sequences during training, it hones its ability to grasp the intricacies of language and predict what comes next in a given context. We can define the following function to perform this context prediction:

def get_batch(split):
# generate a small batch of data of inputs x and targets y
data = train if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,)) # get a random value
x = torch.stack([data[i:i+block_size] for i in ix]) # the first block size (context)
y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # the target
return x, y

Now time to jump into building the language model itself!

bigram language models.

Now let’s define the language model that we’re going to use to train our own GPT model. There are a bunch of language models that are open for use and implementation, but we’re going to be using something called a bigram language model. For some context: A bigram language model is a simple statistical approach to language modeling in natural language processing. It calculates the probability of a word based on the occurrence of its preceding word. Mathematically, it is represented as:

Where w_n would be the current word and w_(n-1) is the previous. The probability of a specific bigram is calculated by dividing the count of that bigram by the count of its preceding word. While bigram models are simple, they capture some local language structure, although they may not handle long-range dependencies as well as more advanced models like trigrams, n-grams, or neural network-based models. If this is still confusing, consider the following analogy:

The bigram language model is very similar to a game of chess. Just as each move in chess depends on the preceding move, a bigram model calculates the likelihood of a word based on the word that came before it. It’s as if each word on the chessboard has its own set of possible next moves, determined by the strategic arrangement of the words played so far. While a bigram model might not foresee every future move (similar to its limitation in handling long-range dependencies), it captures the immediate tactical considerations, much like a chess player evaluating the current board position.

Now that we’ve understood what is going on behind the scenes, let’s implement this language model in python:

import torch
import torch.nn as nn
from torch.nn import functional as F

class BigramLanguageModel(nn.Module):

def __init__(self, vocab_size):
self.token_embedding = nn.Embedding(vocab_size, vocab_size) # table of size vocab_size x vocab_size

def forward(self, idx, targets):

logits = self.token_embedding(idx) # (B,T,C) (batch, time, channel) tensor (4,8,vocab_size)

if targets is None:
loss = None
B, T, C = logits.shape
logits = logits.view(B*T, C) # reshapes the logits tensor into a 2D tensor (flatten it)
targets = targets.view(B*T) # flatten the target tensor
loss = F.cross_entropy(logits, targets) #calcualte the loss between the 2, measures how well the logits match the targets

return logits, loss

def generate(self, idx, max_new_tokens):
# this is used to generate new sequences of tokens
# also note that idx is (batch size, block size)

for i in range(max_new_tokens):

logits, loss = self(idx) # obtain the predictions for the given input sequence (idx)

logits = logits[:, -1, :] # this becomes (batch size, channels), focus on the last step

probs = F.softmax(logits, dim=-1) # convert into probabilities

idx_next = torch.multinomial(probs, num_samples=1) # give us 1 sample (1 prediction)

idx =, idx_next), dim=1) # whatever the prediction is, concatenate it with the current idx and use this to predict the next element

return idx


It may seem like there’s a lot going on, but it’s not all too complicated. Let’s break down each method. The first goal is to initialize a table with a size of vocab_size by vocab_size, allowing the model to embed and predict the next token based on the current token. Typically, embedding layers are constructed with dimensions of vocab_size by embedding_dim, where embedding_dim represents the size of the continuous vector representation for each token. However, in this model, the embedding layer serves a dual purpose—it both embeds discrete token indices into continuous vectors and predicts the next token based on the current one.

In simple terms, think of this process like setting up a kitchen where you not only prepare ingredients but also predict what dish you’ll cook next. Imagine you have a large table with rows and columns labeled with all the ingredients in your pantry (vocab_size by vocab_size). Now, each ingredient has a corresponding recipe (embedding_dim) that describes its taste, texture, and how it blends with others.Here, the table is like your embedding layer. When you pick an ingredient (token), you look at its corresponding recipe (continuous vector representation) on the table. This helps you understand its unique qualities. But here’s the twist: your table is not just for finding recipes; it’s also a magical table that suggests what ingredient might come next in your cooking adventure based on what you’re currently using.

forward method.

During the forward pass in the BigramLanguageModel, the forward method takes two inputs: idx, representing a sequence of tokens, and targets, which are the ground truth labels used during training. The heart of the forward pass lies in the token_embedding(idx) operation, where an embedding lookup is performed for each token in the input sequence. This operation results in a tensor with dimensions (batch_size, sequence_length, vocab_size), effectively embedding the discrete token indices into continuous vectors.

Analogously, consider a library where each book is represented by a unique identification number. The embedding process in the model is akin to looking up a shelf containing books based on their identification numbers. Each book on the shelf corresponds to a token in the sequence, and the embedding layer transforms these identification numbers into continuous vector representations, much like assigning a specific location to each book on the shelf. If the targets are provided, indicating that the model is in training mode, the forward pass proceeds to calculate the cross-entropy loss. This loss is computed between the flattened logits—reshaped into a 2D tensor—and the flattened targets. The cross-entropy loss serves as a measure of how well the model's predictions align with the actual targets, providing feedback to adjust the model's parameters during the training process.

generate method.

This is the most intuitive method out of the 3. The generate method in the BigramLanguageModel is designed for generating new sequences of tokens. Given an initial input sequence idx and a specified maximum number of new tokens (max_new_tokens), the method iteratively predicts the next token and appends it to the sequence. During each iteration, the model obtains predictions and performs operations to determine the next token in the sequence.

Firstly, logits, loss = self(idx, None) retrieves the model's predictions for the input sequence, with the None value indicating that no ground truth targets are used during generation. Next, logits[:, -1, :] extracts the logits corresponding to the last token in the current sequence. These logits are then transformed into probabilities using the softmax function (F.softmax(logits, dim=-1)).

The model then uses the torch.multinomial function to sample one token based on the computed probabilities. This sampled token is then concatenated with the current input sequence. This process is repeated for the specified number of iterations, gradually extending the sequence. In essence, the generation method simulates the model's creative process of predicting and appending tokens to construct a new sequence. This generated sequence, representing a continuation of the input, is ultimately returned as the output of the generate method.

quick recap.

Now that we’ve implemented this, let’s go back to the transformer architecture and understand a couple concepts. For a quick refresher, this is what the transformer architecture looks like:

In the transformer architecture, there are a bunch of sections outlined which we haven’t really touched on yet. For example, multi-head attention, feed forward, etc. In order to train the language model that we defined above, we need to implement these as classes in python. Before we jump into that, let’s understand a very concept called self-attention.

understand self attention.

Self-attention is a mechanism used in deep learning models to capture relationships and dependencies within a sequence of elements, such as words in a sentence. The fundamental idea behind self-attention is to assign different levels of importance or relevance to each element in the sequence based on its relationship with other elements. This allows the model to focus more on the contextually significant parts of the input when making predictions or generating outputs.

The self-attention mechanism involves three main components:

  1. Query (Q): This component involves transforming the input sequence into a set of query vectors. Each query vector represents a position in the sequence that needs to be attended to or focused on.
  2. Key (K): The key component transforms the input sequence into a set of key vectors. These key vectors represent positions in the sequence that will be compared with the queries to determine their relevance.
  3. Value (V): The value component transforms the input sequence into a set of value vectors. These vectors contain the information associated with each position in the sequence.

The attention weights are calculated by taking the dot product of the queries and keys, and the result is passed through a softmax function. These attention weights indicate how much focus each position (or element) in the sequence should receive with respect to the current position. The final output is obtained by aggregating the values, weighted by the attention weights. There are multiple ways that it has been implemented, but we’re going to be focusing on the method that was proposed in the transformers paper back in 2017.

Now consider the following python code:

B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x) # this becomes size (B, T, 16)
q = query(x) # this becomes (B,T,16)
v = value(x)

# now we want all the keys to perform a dot product with the queries

wei = q @ k.transpose(-2, -1) # (B,T,16) x (B,16,T) => (B,T,T)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

output = wei @ v # v is the elements we aggregate, this makes it so the output will be 16 dimensional (the headsize)

The first couple lines are there to initialize some constants and arandom tensor x is created with dimensions corresponding to batch size (B), sequence length (T), and number of channels (C). Following this, Linear transformations are applied to the input tensor x to obtain key (k), query (q), and value (v) tensors. Each linear layer (nn.Linear) transforms the input tensor from the original channel size (C) to a specified head_size without bias. We do this because we want to project the original input tensor x, with dimensions (B, T, C), into three separate spaces—keys (k), queries (q), and values (v)—each with a reduced dimensionality specified by the hyperparameter head_size. The choice to apply linear transformations through nn.Linear without introducing bias ensures that these transformations are controlled solely by the learned weights, promoting a more effective and nuanced representation of the input.

Following this, the expression wei = q @ k.transpose(-2, -1) calculates the dot product between the query tensor q and the transpose of the key tensor k. This operation produces a matrix where each element (i, j) represents the similarity or attention weight between the i-th query and the j-th key. The transpose is performed to align the dimensions properly for the dot product. Next, we create a lower triangular matrix tril filled with ones using torch.tril(torch.ones(T, T)). This matrix has ones in the lower triangle and zeros elsewhere. We use it to mask out the upper triangular part of the attention matrix by replacing those positions with negative infinity. This is achieved with wei = wei.masked_fill(tril == 0, float('-inf')).

Finally, we apply the softmax function along the last dimension of the attention matrix (dim=-1). The softmax operation ensures that the attention weights sum to 1 across each row, representing a probability distribution. This step is crucial for the attention mechanism, as it determines how much focus should be given to each element in the sequence when aggregating information. The resulting attention weights are stored in the variable wei.

The expression output = wei @ v calculates the weighted sum of the values tensor v based on the attention weights stored in the matrix wei. Each row of the matrix wei represents the attention distribution for a specific query, determining how much attention each element in the sequence should receive. The values tensor v contains the elements we want to aggregate.

By performing the matrix multiplication, we essentially combine the attention weights with the corresponding values. This operation produces a new tensor output, where each row represents a weighted sum of the values, with the weights determined by the attention mechanism. In the context of self-attention, this step allows the model to focus on different parts of the input sequence when creating the final representation. The resulting output tensor captures the contextual information based on the attention mechanism, and it is typically used as an input for subsequent layers in the neural network.

That was a lot of information, but to summarize TL;DR

This code sets up and transforms a random input tensor to create key, query, and value tensors for self-attention. Linear transformations without bias project the input into separate spaces. The expression wei = q @ k.transpose(-2, -1) calculates attention weights between queries and keys. A lower triangular matrix is used to mask out unnecessary attention, and softmax ensures proper attention distribution. Finally, output = wei @ v calculates a weighted sum of values based on attention weights, capturing contextual information for each element in the sequence. This output is crucial for subsequent layers in the neural network.

Now it’s time to extrapolate this simple self-attention code into implementing the building blocks of the transformer architecture while will in turn help us build our own version of GPT! Let’s begin by implementing a head block.

implementing the head.

Before we implement anything, let’s define the following constants for our model:

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 64
block_size = 256
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
C1 = 384
n_head = 6
n_layer = 6
dropout = 0.2

These constants don’t mean much now, but they will when we implement the Head block of the transformer architecture. The head block builds off the self-attention block we built above and to avoid redundancy, I’m going to provide the implementation and jump to the next portion of the architecture.

class Head(nn.Module):

def __init__(self, head_size):
self.key = nn.Linear(C1, head_size, bias=False)
self.query = nn.Linear(C1, head_size, bias=False)
self.value = nn.Linear(C1, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size,block_size))) # this creates the lower triangle matrix
self.dropout = nn.Dropout(dropout)

def forward(self, x): # copied from above
B,T,C = x.shape
k = self.key(x)
q = self.query(x)

wei = q @ k.transpose(-2, -1) * C ** 0.5
wei = wei.masked_fill(self.tril[:T,:T]==0, float('-inf'))
wei = F.softmax(wei,dim=-1)
wei = self.dropout(wei)
v = self.value(x)
out = wei @ v
return out

The Head block doesn’t mean much yet, but when we implement the MultiHeadAttention, we’re going to use the Head class. Speaking of, now let’s jump into implementing the Multi Head Attention component.

multi-head attention.

Multi-head attention is quite literally what it sounds like, there are multiple heads being used. In doing so, multihead attention allows the model to focus on different aspects of the input sequence simultaneously. In standard self-attention, a single set of learned linear projections (head) is applied to the input sequence, limiting its capacity to capture diverse relationships within the data. Multi-Head Attention addresses this limitation by utilizing multiple parallel attention heads. Each head independently projects the input into different spaces, allowing the model to attend to various aspects of the sequence concurrently. The outputs from all heads are then concatenated and linearly transformed to generate the final multi-head attention output. This process enables the model to capture intricate patterns, dependencies, and context information across different dimensions of the input, leading to more robust and nuanced representations.

This sounds a little confusing, but consider this analogy:

Imagine you’re trying to analyze a complex scene in a movie. In a regular self-attention mechanism, it’s like having a single camera angle to focus on the entire scene — you might miss some crucial details happening in different parts. Now, think of Multi-Head Attention as having multiple cameras capturing the scene simultaneously from different angles. Each camera (attention head) independently focuses on specific areas, ensuring that no important details are overlooked. By combining the views from all cameras (heads), you get a comprehensive understanding of the entire scene, making it easier to grasp the intricate relationships and nuances present. Similarly, in Multi-Head Attention, the model gains a more thorough understanding of the input sequence by considering various aspects simultaneously, leading to improved performance in understanding complex patterns within the data.

Now building off this theoretical understanding, let’s write out the python code for this implementation:

class MultiHeadAttention(nn.Module):

def __init__(self,num_heads,head_size):
self.heads = nn.ModuleList([Head(head_size) for i in range(num_heads)]) # create multiple heads
self.proj = nn.Linear(C1, C1)
self.dropout = nn.Dropout(dropout)

def forward(self,x):
out =[h(x) for h in self.heads], dim=-1)
out = self.proj(out)
return out # concatenate all of the output

In the __init__ method, an instance of nn.ModuleList is used to create a list of num_heads individual Head modules, each initialized with the specified head_size. These heads will operate independently, allowing the model to focus on different aspects of the input sequence. Additionally, the class includes a projection layer (self.proj) and a dropout layer (self.dropout), which are applied after the heads' outputs are concatenated.

During the forward pass, the input sequence x is passed through each head in the ModuleList, and the outputs of all heads are concatenated along the last dimension (dim=-1). This concatenation allows the model to consider diverse aspects of the input sequence captured by different heads. The concatenated output is then projected using the self.proj linear layer. The final output is returned, representing the multi-headed attention's aggregated understanding of the input sequence.

In simpler terms, the MultiHeadAttention class organizes multiple attention heads to collectively focus on different aspects of the input, and then combines their insights to provide a more comprehensive representation of the sequence. The projection layer helps refine and combine these representations, and the dropout layer aids in preventing overfitting by randomly dropping out some information during training.

Now we’re nearly done making our GPT model, we have 2 things left to build, the feed-forward network and the Blocks of the transformer.

feed-forward network.

In a transformer architecture, the feedforward network is a crucial component responsible for processing and transforming the information obtained from the self-attention mechanism. Its primary role is to apply a nonlinear transformation to the representations learned through self-attention, introducing complexity and enabling the model to capture intricate patterns in the data.

The feedforward network typically consists of two linear transformations separated by a non-linear activation function. The input to the feedforward network is the output of the self-attention layer. Each position in the input sequence is treated independently during this operation.

Here’s how the feedforward network works:

  1. First Linear Transformation: The input sequence from the self-attention layer is passed through a linear layer (fully connected layer), projecting the input into a higher-dimensional space. This introduces non-linearity and allows the model to capture complex relationships within the data.
  2. Activation Function: After the first linear transformation, a non-linear activation function, commonly a rectified linear unit (ReLU), is applied element-wise to introduce non-linearity. This step enables the model to learn complex patterns and relationships in the data that may not be captured by linear transformations alone.
  3. Second Linear Transformation: The output of the activation function is then passed through another linear layer, projecting it back into a lower-dimensional space. This transformation allows the model to compress and refine the information, focusing on the most relevant features.
  4. Final Output: The output of the second linear transformation is the final representation of the sequence after passing through the feedforward network. This representation is then used as input for subsequent layers in the transformer architecture.

Now let’s look at the python implementation of the feed-forward network:

class FeedForward(nn.Module):

def __init__(self,n_embd):
super().__init__() = nn.Sequential( # multiplication of 4 comes from the fact that the dimensionality of input is x, but the inner layer dimensionality is 4*x
nn.Linear(n_embd, 4*n_embd), # linear layer with n_embd input and n_embd output
nn.ReLU(),# activation function, allows for non linearity (we use ReLU to get over vanishing gradients) -> vanishing gradients is essentially when
nn.Linear(n_embd * 4, n_embd), # the gradients are propagated backward from the output layer to the input layer, they can become very small (vanish) as they pass through many layers.
nn.Dropout(dropout) # When the gradients become extremely small, the weights of the early layers are updated only by tiny amounts, if at all.

def forward(self, x):

Very similar to the process described above, the code follows the process of linear transformations, activation functions and doing a dropout on the layers.

  1. Sequential Layers: The nn.Sequential container allows for the sequential application of layers. The first linear layer (nn.Linear(n_embd, 4*n_embd)) projects the input from dimension n_embd to 4*n_embd. The choice of multiplying by 4 is a common practice in transformers to introduce higher-dimensional representations, enabling the model to learn more complex features.
  2. ReLU Activation: After the first linear transformation, a rectified linear unit (ReLU) activation function (nn.ReLU()) is applied element-wise. ReLU introduces non-linearity by setting negative values to zero, allowing the network to capture complex patterns and avoiding the vanishing gradient problem.
  3. Second Linear Transformation: The second linear layer (nn.Linear(n_embd * 4, n_embd)) projects the data back to the original dimensionality n_embd. This step helps in compressing and refining the information, focusing on the most relevant features.
  4. Dropout: The final layer is a dropout layer (nn.Dropout(dropout)), which randomly sets a fraction of input units to zero during training. Dropout is a regularization technique that helps prevent overfitting by introducing noise, improving the model's generalization performance.
  5. Forward Pass: The forward method implements the forward pass of the feedforward network. It takes an input tensor x and passes it through the defined sequential network. The output is the transformed representation of the input after going through the linear transformations and activation functions.

implementing the Block.

The Block isn’t formally defined, but when looking at the architecture of the transformer, the image you see below is considered a block.

With respect to the transformer architecture, a “block” refers to a fundamental building unit that performs a sequence of operations on the input data. A transformer block typically consists of two main components: a self-attention mechanism and a feedforward neural network. These components are augmented with layer normalization and residual connections to enhance training stability and information flow.

Here’s a breakdown of the key components within a transformer block:

  1. Self-Attention Mechanism: This mechanism allows the model to weigh different parts of the input sequence differently based on their relevance to each other. It captures dependencies between words in a sequence and helps the model focus on important relationships. The self-attention mechanism calculates attention scores for each element in the sequence with respect to all other elements.
  2. Feedforward Neural Network: Following the self-attention mechanism, the output is passed through a feedforward neural network. This network processes each element independently and provides a non-linear transformation, allowing the model to capture complex patterns and relationships in the data.
  3. Layer Normalization: Layer normalization is applied before and after each sub-layer (self-attention and feedforward network). It normalizes the inputs to the neural network, helping stabilize training and improve the convergence of the model.
  4. Residual Connections: Residual connections are used to create shortcuts for the flow of information. The original input to the block is added to the output of each sub-layer, facilitating the smooth propagation of gradients during training. This helps alleviate the vanishing gradient problem and enables more effective learning.

To summarize these past 2 concepts into a simple analogy, think of a transformer block like a team working on a project. Each person (word) talks about their part in the project, paying attention to what others are saying (self-attention). After the discussion, each person improves their work individually (feedforward network), refining details.

To keep everyone on the same page, they make sure everyone understands the project the same way (layer normalization). And to avoid confusion, they maintain a constant flow of information (residual connections) so that the original ideas stay intact while incorporating the improvements from the discussion.

Now let’s take a look at the python code for the Block:

class Block(nn.Module):

def __init__(self, n_embd, n_head): ## n_embd is the embedding dimension, n_head are the number of heads
head_size = n_embd // n_head = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)

def forward(self, x):
x = x +
x = x + self.ffwd(self.ln2(x))
return x

Expanding on the ideas above:

  1. Initialization: The constructor (__init__) takes two parameters, n_embd (embedding dimension) and n_head (number of attention heads). It initializes components of the block, including a multi-head self-attention mechanism (MultiHeadAttention), a feedforward network (FeedForward), and two layer normalization layers (nn.LayerNorm).
  2. Head Size Calculation: The head_size is calculated as the integer division of n_embd by n_head. This determines the size of each attention head, ensuring that the dimensions align properly.
  3. Components:
  • The self-attention mechanism is created using the MultiHeadAttention class with the specified number of heads and head size.
  • self.ffwd: The feedforward network is created using the FeedForward class with the specified embedding dimension.
  • self.ln1 and self.ln2: Two layer normalization layers are created using nn.LayerNorm with the specified embedding dimension. Layer normalization is applied before and after each sub-layer to normalize the input and stabilize training.
  1. Forward Pass: The forward method implements the forward pass of the block. It takes an input tensor x and processes it through the self-attention mechanism and the feedforward network. The output of each sub-layer is passed through layer normalization, and the original input x is added to the processed output. This residual connection helps in mitigating the vanishing gradient problem and aids in the flow of information.
  • x = x + Applies self-attention, normalizes the output with layer normalization, and adds it to the original input.
  • x = x + self.ffwd(self.ln2(x)): Applies the feedforward network, normalizes the output with layer normalization, and adds it to the intermediate result.
  1. Return: The final output of the block is returned, which represents the processed and enriched representation of the input tensor x. The layer normalization and residual connections contribute to the stability and effectiveness of training in the transformer architecture.

final touches.

Now we are almost complete the implementation of our own GPT from scratch. Now let’s take our implementation of the bigram language model from earlier and change it up to match the classes we’ve implemented.

class BigramLanguageModel_new(nn.Module):

def __init__(self):
self.token_embedding_table = nn.Embedding(vocab_size, C1)
self.position_embedding_table = nn.Embedding(block_size, C1)
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(C1)
self.lm_head = nn.Linear(C1, vocab_size)

def forward(self, idx, targets=None):
B,T = idx.shape
tok_emb = self.token_embedding_table(idx) #
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
x = tok_emb + pos_emb # (B,T,C) array, includes both the information about the tokens and their positions in the sequence
x = self.blocks(x)
x = self.ln_f(x)
logits = self.lm_head(x)

if targets is None:
loss = None
B,T,C = logits.shape
logits = logits.view(B*T,C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)

return logits, loss

def generate(self,idx,max_new_tokens):
for i in range(max_new_tokens):
idx_cond = idx[:, -block_size:]
logits, loss = self(idx_cond)
logits = logits[:, -1, :] # focus on the last time step
probs = F.softmax(logits, dim=-1) # probabilities
idx_next = torch.multinomial(probs, num_samples=1) # get the i +1th prediction
idx =, idx_next), dim=1) # concatenate the prediction with the current sequence
return idx

There are several changes that were made to align the model with the transformer architecture and enhance its capacity to capture complex patterns in input sequences. One notable modification is the introduction of separate embedding tables for tokens and positions (self.token_embedding_table and self.position_embedding_table, respectively). This adjustment allows the model to incorporate both token-specific information and positional information, contributing to a more comprehensive understanding of the input sequence. Additionally, the adoption of a stack of transformer blocks (Block) using the nn.Sequential container facilitates the incorporation of self-attention mechanisms, feedforward networks, layer normalization, and residual connections. This modular and hierarchical structure enhances the model's ability to capture intricate relationships and dependencies within the data. Furthermore, the inclusion of positional embeddings (pos_emb) and their addition to token embeddings (tok_emb) ensures that the model is aware of the sequential order of tokens. The layer normalization (self.ln_f) applied after the transformer blocks contributes to training stability. Finally, the linear head (self.lm_head) is employed to project the model's output to the vocabulary size during training and generation. Now that this model is defined, you can train the model on pretty much any text-based dataset that you would like. The last step is to define the optimizer and train.

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for iter in range(max_iters):


# every once in a while evaluate the loss on train and val sets

# sample a batch of data
xb, yb = get_batch('train')

# evaluate the loss
logits, loss = model(xb, yb)

For those who are curious, we used the Adam optimizer because it combines the benefits of both the AdaGrad and RMSProp optimizers by maintaining individual learning rates for each parameter and utilizing the moving average of squared gradients. This adaptive learning rate mechanism helps the model converge faster and more effectively.

The training loop involves iterating over a specified number of iterations (max_iters). During each iteration, a batch of data (xb, yb) is sampled from the training set using the get_batch function. The model is then fed with the input sequence (xb) and target sequence (yb), and the logits and loss are computed.

The optimizer (torch.optim.AdamW) is employed to update the model parameters based on the calculated gradients. The optimizer.zero_grad(set_to_none=True) operation clears the gradients in preparation for the next iteration, and loss.backward() computes the gradients of the loss with respect to the model parameters. Finally, optimizer.step() performs a parameter update using the computed gradients.

That marks the end of the implementation of a mini-GPT from scratch. I hope that reading this article added value to you and provided a clear understanding of how to build a transformer-based language model .That was a long article, but if you’ve read till the end, thank you so much for taking time to read my article and I hope you walked away with more knowledge than you came in with :)

If you have any questions regarding this article or just want to connect, you can find me on LinkedIn or my personal website :)