Simple Positional Encoding with PyTorch
Why do we even need it? How does it work? And most importantly, how do I code it myself? I decided to roll up my sleeves, write a minimal version of it in PyTorch, and really understand what’s going on.
What’s the Point of Positional Encoding?
Transformers don’t process sequences in order like RNNs or LSTMs. They look at the whole sequence at once (self-attention). But that also means they have no idea what the order of the tokens is. That’s a problem because:
“I love pizza” and “Pizza love I” shouldn’t mean the same thing.
So, we need to give the model some idea of position — enter positional encoding. There are different ways to do this. Some use sine and cosine functions (like in the original Transformer paper), but models like GPT-2 use something simpler: learnable positional embeddings.
The Code
Here’s the simple version I came up with:
import torch
import torch.nn as nn
class PositionalEncoding(nn.Module):
def __init__(self, vocab_size, context_len, embeddingdim):
super().__init__()
# Token embedding layer
self.embeddinglayer = nn.Embedding(vocab_size, embeddingdim)
# Positional embedding layer (learnable embeddings for each position)
self.positionalencoding = nn.Embedding(context_len, embeddingdim)
def forward(self, data):
# Assuming data is (batch_size, seq_len)
batch_size, seq_len = data.shape
# 1. Get token embeddings
e_data = self.embeddinglayer(data)
# 2. Create and embed position indices (0, 1, 2, ..., seq_len-1)
# torch.arange generates indices, .unsqueeze(0) makes it (1, seq_len)
positions = torch.arange(0, seq_len, device=data.device).unsqueeze(0)
p_data = self.positionalencoding(positions)
# 3. Add token and positional embeddings
embeddedData = e_data + p_data
return embeddedData
What Is It Actually Doing?
nn.Embedding(vocab_size, embeddingdim) # maps token IDs to vectors.
nn.Embedding(context_len, embeddingdim) # does the same thing — but for positions (like 0, 1, 2, ...).
In forward, we add the token embedding and the positional embedding. This gives us a final vector that contains both what the token is and where it is in the sequence.
A Dummy Example That Helped Me
Let’s say:
vocab_size = 10 context_len = 5 embeddingdim = 3
And we have one sample like this:
data = torch.tensor([[1, 4, 3, 2, 0]]) # shape = (1, 5)
Token Embeddings
| Token ID | Embeddings ($\mathbf{E}_{token}$) |
|---|---|
| 1 | [0.3, 0.1, 0.4] |
| 4 | [0.2, 0.5, 0.3] |
| 3 | [0.4, 0.2, 0.1] |
| 2 | [0.1, 0.3, 0.2] |
| 0 | [0.0, 0.1, 0.3] |
Positional Embeddings
| Position ID | Embeddings ($\mathbf{E}_{pos}$) |
|---|---|
| 0 | [0.1, 0.0, 0.0] |
| 1 | [0.0, 0.1, 0.0] |
| 2 | [0.0, 0.0, 0.1] |
| 3 | [0.1, 0.0, 0.1] |
| 4 | [0.0, 0.1, 0.1] |
Now we add the token embedding and positional embedding for each position ($\mathbf{E}{final} = \mathbf{E}{token} + \mathbf{E}_{pos}$):
| Token ID | Position ID | Token Embeddings | Positional Embeddings | Combined Vector |
|---|---|---|---|---|
| 1 | 0 | [0.3, 0.1, 0.4] | + [0.1, 0.0, 0.0] | = [0.4, 0.1, 0.4] |
| 4 | 1 | [0.2, 0.5, 0.3] | + [0.0, 0.1, 0.0] | = [0.2, 0.6, 0.3] |
| 3 | 2 | [0.4, 0.2, 0.1] | + [0.0, 0.0, 0.1] | = [0.4, 0.2, 0.2] |
| 2 | 3 | [0.1, 0.3, 0.2] | + [0.1, 0.0, 0.1] | = [0.2, 0.3, 0.3] |
| 0 | 4 | [0.0, 0.1, 0.3] | + [0.0, 0.1, 0.1] | = [0.0, 0.2, 0.4] |
Now the model knows that Token 1 appeared first, and Token 4 came second. That order information is now baked into the embedding vectors.