Simple Positional Encoding with PyTorch

Why do we even need it? How does it work? And most importantly, how do I code it myself? I decided to roll up my sleeves, write a minimal version of it in PyTorch, and really understand what’s going on.


What’s the Point of Positional Encoding?

Transformers don’t process sequences in order like RNNs or LSTMs. They look at the whole sequence at once (self-attention). But that also means they have no idea what the order of the tokens is. That’s a problem because:

“I love pizza” and “Pizza love I” shouldn’t mean the same thing.

So, we need to give the model some idea of position — enter positional encoding. There are different ways to do this. Some use sine and cosine functions (like in the original Transformer paper), but models like GPT-2 use something simpler: learnable positional embeddings.


The Code

Here’s the simple version I came up with:

import torch
import torch.nn as nn

class PositionalEncoding(nn.Module):
    def __init__(self, vocab_size, context_len, embeddingdim): 
        super().__init__() 
        # Token embedding layer
        self.embeddinglayer = nn.Embedding(vocab_size, embeddingdim)
        # Positional embedding layer (learnable embeddings for each position)
        self.positionalencoding = nn.Embedding(context_len, embeddingdim)

    def forward(self, data):
        # Assuming data is (batch_size, seq_len)
        batch_size, seq_len = data.shape
        
        # 1. Get token embeddings
        e_data = self.embeddinglayer(data)  
        
        # 2. Create and embed position indices (0, 1, 2, ..., seq_len-1)
        # torch.arange generates indices, .unsqueeze(0) makes it (1, seq_len)
        positions = torch.arange(0, seq_len, device=data.device).unsqueeze(0)
        p_data = self.positionalencoding(positions)
        
        # 3. Add token and positional embeddings
        embeddedData = e_data + p_data  
        return embeddedData

What Is It Actually Doing?

nn.Embedding(vocab_size, embeddingdim) # maps token IDs to vectors.
nn.Embedding(context_len, embeddingdim) # does the same thing — but for positions (like 0, 1, 2, ...).

In forward, we add the token embedding and the positional embedding. This gives us a final vector that contains both what the token is and where it is in the sequence.


A Dummy Example That Helped Me

Let’s say:

vocab_size = 10 context_len = 5 embeddingdim = 3

And we have one sample like this:

data = torch.tensor([[1, 4, 3, 2, 0]])  # shape = (1, 5)

Token Embeddings

Token ID Embeddings ($\mathbf{E}_{token}$)
1 [0.3, 0.1, 0.4]
4 [0.2, 0.5, 0.3]
3 [0.4, 0.2, 0.1]
2 [0.1, 0.3, 0.2]
0 [0.0, 0.1, 0.3]

Positional Embeddings

Position ID Embeddings ($\mathbf{E}_{pos}$)
0 [0.1, 0.0, 0.0]
1 [0.0, 0.1, 0.0]
2 [0.0, 0.0, 0.1]
3 [0.1, 0.0, 0.1]
4 [0.0, 0.1, 0.1]

Now we add the token embedding and positional embedding for each position ($\mathbf{E}{final} = \mathbf{E}{token} + \mathbf{E}_{pos}$):

Token ID Position ID Token Embeddings Positional Embeddings Combined Vector
1 0 [0.3, 0.1, 0.4] + [0.1, 0.0, 0.0] = [0.4, 0.1, 0.4]
4 1 [0.2, 0.5, 0.3] + [0.0, 0.1, 0.0] = [0.2, 0.6, 0.3]
3 2 [0.4, 0.2, 0.1] + [0.0, 0.0, 0.1] = [0.4, 0.2, 0.2]
2 3 [0.1, 0.3, 0.2] + [0.1, 0.0, 0.1] = [0.2, 0.3, 0.3]
0 4 [0.0, 0.1, 0.3] + [0.0, 0.1, 0.1] = [0.0, 0.2, 0.4]

Now the model knows that Token 1 appeared first, and Token 4 came second. That order information is now baked into the embedding vectors.