Setting up the Tic Tac Toe Environment

I needed to create an environment that simulates Tic Tac Toe. I used a 3×3 NumPy array to represent the board:

0 means the cell is empty.
1 means the agent (X).
-1 means the opponent (O).

The environment takes an action (which cell to put an X in), updates the board, then the opponent makes a random valid move. I also programmed it to check for wins, losses, draws, or invalid moves.

Here’s how the empty board looks initially:

Initial State:

[[0 0 0]
[0 0 0]
[0 0 0]]

When I took an action (for example, placing an X in cell 1), the board updated like this:

State after taking action 1:

[[0 1 0]
[0 0 0]
[0 0 0]]

The Actor-Critic Neural Network

To teach the agent how to play, I built a neural network with an Actor-Critic architecture. The network does two things:

The actor picks the next move.
The critic estimates how good the current board state is (the expected return).

I fed the board (flattened into 9 numbers) into the network. The actor outputs scores (called logits) for each possible move, but since some cells might be taken, I masked out invalid moves so the agent never chooses them.

Here’s an example output from the network before training:

Selected action: 1 Log probability: -2.20 Entropy: 2.20 Value estimate: 0.007

This shows the network’s initial guess, the uncertainty in its choice, and how it values the current board.

How PPO Helps the Agent Learn

Proximal Policy Optimization (PPO) tweaks the agent’s policy (its way of choosing moves) gently, avoiding big changes that can destabilize learning. It does this by clipping how much the policy can update based on advantages — basically how much better a move turned out than expected.

I also used Generalized Advantage Estimation (GAE) to compute smoother advantage values, balancing bias and variance.

PPO and GAE Equations

Temporal Difference (TD) Residual: \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\)
Generalized Advantage Estimation (GAE): \(\hat{A}_t = \delta_t + \gamma \lambda (1 - d_t) \hat{A}_{t+1}\)

or equivalently: \(\hat{A}_t = \delta_t + (\gamma \lambda)\delta_{t+1} + (\gamma \lambda)^2\delta_{t+2} + \cdots + (\gamma \lambda)^{T-t-1}\delta_{T-1}\)
Return Estimate for Value Function: \(R_t = \hat{A}_t + V(s_t)\)
PPO Clipped Objective: \(L^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \left[ \min \Big( r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t \Big) \right]\)

where the probability ratio is: \(r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\text{old}}(a_t | s_t)}\)

Training the Agent

I trained the agent by letting it play many episodes against a random opponent.

After each episode, I gathered data about states, actions, rewards, and estimated values. Then I computed advantages and returns using GAE. Finally, I updated the neural network using the PPO loss function.

Here’s a snippet of the training progress where I evaluated the agent’s win rate every 100 iterations:

Iteration	Win Rate
100	0.30
200	0.45
300	0.60
…	…

Slowly but surely, the agent learned to win more often!

Fig: Training Loss, Rewards, Entropy

Training Plot

Playing Against My Agent

Once trained, I added a simple interface to play against the agent myself. Here’s a sample interaction:

…

Your move (0–8): 4

Agent moves…

X . .
. O .
. . .

I enter a move by typing a number from 0 to 8 (mapping to board cells), then the agent makes its move. This was a fun little project that I did on my vacation. I got my hands dirty with reinforcement learning, and it was really interesting. :) ```