A complete PyTorch implementation of the Deep Q-Network (DQN) algorithm from DeepMind's groundbreaking paper "Playing Atari with Deep Reinforcement Learning" (Mnih et al., 2013). This project successfully trains an AI agent to play Atari Breakout using deep reinforcement learning.
After 6 million training steps (~15 hours on GPU), the agent achieved:
- Mean Score: 35.20 ยฑ 3.60
- Max Score: 49.00
- Min Score: 29.00
This represents a ~35x improvement over random play (typical score: ~1-2), demonstrating that the agent learned effective strategies for breaking bricks and maintaining ball control.
| Training Progress | Test Performance |
|---|---|
![]() |
![]() |
| Episode Details | Performance Summary |
|---|---|
![]() |
![]() |
The trained agent demonstrates strategic gameplay, effectively tracking the ball and positioning the paddle:
- ๐ฅ Gameplay Episode 2
- ๐ฅ Gameplay Episode 3
This implementation faithfully recreates the DQN algorithm that marked a breakthrough in deep reinforcement learning by combining:
- Deep Neural Networks for function approximation
- Q-Learning for value-based decision making
- Experience Replay for stable training
- Target Networks for convergence
The result is an agent that learns directly from raw pixel inputs, without any hand-crafted features or domain knowledge about the game.
The Deep Q-Network follows the exact architecture from the paper:
Input: 84ร84ร4 (4 stacked grayscale frames)
โ
โโ Conv Layer 1: 32 filters, 8ร8 kernel, stride 4, ReLU
โโ Conv Layer 2: 64 filters, 4ร4 kernel, stride 2, ReLU
โโ Conv Layer 3: 64 filters, 3ร3 kernel, stride 1, ReLU
โ
โโ Flatten
โ
โโ Fully Connected 1: 512 units, ReLU
โโ Fully Connected 2: 4 units (action outputs)
Total Parameters: ~1.6 million
RGB Frame (210ร160ร3)
โ
Grayscale Conversion
โ
Resize to 84ร84
โ
Stack 4 consecutive frames
โ
84ร84ร4 tensor input to network
This preprocessing:
- Reduces dimensionality from 100,800 to 28,224 values
- Captures motion through frame stacking
- Maintains game state (ball velocity, paddle position)
- Capacity: 1,000,000 transitions
- Purpose: Break temporal correlations in training data
- Sampling: Random minibatches of 32 transitions
Stores tuples: (state, action, reward, next_state, done)
- Update Frequency: Every 10,000 steps
- Purpose: Stabilize Q-value targets during training
- Mechanism: Periodic copy of policy network weights
- Initial ฮต: 1.0 (100% random actions)
- Final ฮต: 0.1 (10% random actions)
- Decay: Linear over 1,000,000 frames
- Strategy: Balance exploration vs. exploitation
for step in range(6_000_000):
1. Select action using ฮต-greedy policy
2. Execute action, observe reward and next state
3. Store transition in replay memory
if step > 50,000: # After initial exploration
4. Sample random minibatch (32 samples)
5. Compute Q-learning targets using target network
6. Perform gradient descent on policy network
if step % 10,000 == 0:
7. Update target network (copy policy weights)
8. Decay epsilon linearlyUses Huber Loss (smooth L1 loss) for robustness:
target = reward + ฮณ ร max_a' Q_target(next_state, a')
loss = HuberLoss(Q_policy(state, action), target)All hyperparameters follow the original DQN paper:
| Parameter | Value | Description |
|---|---|---|
| Replay Memory Size | 1,000,000 | Maximum transitions stored |
| Learning Starts | 50,000 | Random exploration before learning |
| Batch Size | 32 | Minibatch size for training |
| Discount Factor (ฮณ) | 0.99 | Future reward importance |
| Target Update Freq | 10,000 | Steps between target updates |
| Learning Rate | 0.00025 | RMSprop learning rate |
| Optimizer | RMSprop | Adaptive learning rate method |
| Initial Epsilon | 1.0 | Starting exploration rate |
| Final Epsilon | 0.1 | Minimum exploration rate |
| Epsilon Decay | 1,000,000 | Frames for epsilon annealing |
| Frame Stack | 4 | Consecutive frames stacked |
| Frame Size | 84ร84 | Preprocessed frame dimensions |
# Core dependencies
torch
gymnasium[atari]
gymnasium[accept-rom-license]
ale-py
# Utilities
numpy
opencv-python
matplotlib- Clone the repository:
git clone https://github.com/Aafimalek/atari_rl
cd atari_rl- Install dependencies:
pip install gymnasium[atari] gymnasium[accept-rom-license] ale-py
pip install torch opencv-python numpy matplotlib- Open the notebook:
jupyter notebook atari-dqn.ipynbTraining a new agent:
# Configure environment
class Config:
ENV_NAME = "ALE/Breakout-v5" # Change to any Atari game
TOTAL_TIMESTEPS = 6_000_000
# ... other configs
# Train
agent, rewards, losses = train_dqn_resume(config)Testing a trained agent:
# Load trained model
agent.load('dqn_final_6000000.pt')
# Test performance
test_rewards = test_agent(agent, num_episodes=10)Changing the game:
# Available Atari games:
ENV_NAME = "ALE/Pong-v5"
ENV_NAME = "ALE/SpaceInvaders-v5"
ENV_NAME = "ALE/Seaquest-v5"
ENV_NAME = "ALE/MsPacman-v5"
# ... 50+ games availableatari_rl/
โ
โโโ atari-dqn.ipynb # Main training notebook
โโโ atari-dqn-test.ipynb # Testing and evaluation
โ
โโโ dqn_checkpoint_4200000.pt # Checkpoint at 4.2M steps (19MB)
โโโ dqn_final_6000000.pt # Final trained model (19MB)
โ
โโโ training_results.png # Training curves
โโโ test_results.png # Test performance
โโโ detailed_ep.png # Episode-level analysis
โโโ performance summary.png # Overall performance metrics
โโโ final_summary.png # Final results summary
โโโ compare.png # Comparison visualizations
โ
โโโ gameplay_episode_2.mp4 # Gameplay video #1
โโโ gameplay_episode_3.mp4 # Gameplay video #2
โ
โโโ 1312.5602v1.pdf # Original DQN paper
โโโ README.md # This file
class DQN(nn.Module):
def __init__(self, input_channels, num_actions):
super().__init__()
self.conv1 = nn.Conv2d(input_channels, 32, kernel_size=8, stride=4)
self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
self.fc1 = nn.Linear(3136, 512)
self.fc2 = nn.Linear(512, num_actions)- Circular buffer with 1M capacity
- Efficient random sampling
- Numpy-based for speed
- Manages policy and target networks
- Implements epsilon-greedy action selection
- Handles Q-learning updates with Huber loss
- Checkpoint saving/loading
- Frame preprocessing pipeline
- Frame stacking logic
- Action/observation handling
- Total Training Steps: 6,000,000
- Training Duration: ~15 hours on Kaggle GPU (T4)
- Episodes Completed: 1,689
- FPS: ~115 frames/second
- Final Epsilon: 0.1
- Random Exploration (0-50K steps): Fill replay memory
- Initial Learning (50K-1M steps): Epsilon decays to 0.1
- Policy Refinement (1M-6M steps): Stable policy improvement
| Step | Mean Reward | Max Reward | Notes |
|---|---|---|---|
| 0 | ~1.0 | ~2.0 | Random policy |
| 1M | ~8.0 | ~15.0 | Basic paddle control |
| 3M | ~20.0 | ~35.0 | Strategic positioning |
| 6M | ~35.2 | ~49.0 | Near-optimal play |
Models are saved:
- Every 100,000 steps during training
- At completion:
dqn_final_6000000.pt - Includes: network weights, optimizer state, step count, epsilon
This implementation is optimized for Kaggle notebooks:
- Upload notebook to Kaggle
- Enable GPU (Settings โ Accelerator โ GPU T4 x2)
- Run all cells
- Monitor progress (logs every 10K steps)
Kaggle-specific features:
- Auto-installs dependencies
- Checkpoint resume support
- Memory-efficient replay buffer
- Progress tracking with ETA
For faster experimentation:
# Quick test (2-3 hours)
TOTAL_TIMESTEPS = 1_000_000
# Medium test (8-10 hours)
TOTAL_TIMESTEPS = 3_000_000
# Full training (50+ hours)
TOTAL_TIMESTEPS = 10_000_000If running into memory issues:
# Reduce replay memory
REPLAY_MEMORY_SIZE = 500_000 # Default: 1M
# Reduce batch size
BATCH_SIZE = 16 # Default: 32The implementation works with any Atari 2600 game. Popular choices:
| Game | Difficulty | Expected Performance |
|---|---|---|
| Breakout | Easy | 30-50 after 6M steps |
| Pong | Easy | 15-20 after 3M steps |
| Space Invaders | Medium | 500-1000 after 10M steps |
| Seaquest | Hard | Varies significantly |
| Ms. Pac-Man | Hard | Requires extended training |
Step: 5,990,000/6,000,000 (99.8%)
FPS: 115.1 | ETA: ~0.0 hours
Epsilon: 0.1000
Episodes: 1689
Mean Reward (last 10): 29.60
Max Reward: 49.00
Mean Loss: 0.0021
Memory Size: 1,000,000
- Episode Rewards: Total score per game
- Training Loss: Q-learning TD error
- Epsilon Decay: Exploration rate over time
- FPS: Training speed
10 evaluation episodes after training:
Episode 1: Reward = 36.00
Episode 2: Reward = 36.00
Episode 3: Reward = 29.00
Episode 4: Reward = 42.00
Episode 5: Reward = 30.00
Episode 6: Reward = 36.00
Episode 7: Reward = 37.00
Episode 8: Reward = 33.00
Episode 9: Reward = 38.00
Episode 10: Reward = 35.00
Test Results:
Mean Reward: 35.20 ยฑ 3.60
Min Reward: 29.00
Max Reward: 42.00
def test_agent(agent, num_episodes=10):
"""Test trained agent with greedy policy (ฮต = 0)"""
for episode in range(num_episodes):
state = env.reset()
episode_reward = 0
while not done:
action = agent.select_action(state, evaluate=True)
state, reward, done, _ = env.step(action)
episode_reward += reward
print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")1. Installation Errors
# If gymnasium[atari] fails:
pip install gymnasium ale-py --upgrade
pip install "gymnasium[accept-rom-license]"
# If ALE ROMs not found:
ale-import-roms --import-from-pkg atari_py.atari_roms2. Out of Memory
# Reduce replay memory size
REPLAY_MEMORY_SIZE = 500_000 # or 250_000
# Use smaller batch size
BATCH_SIZE = 163. Slow Training
- โ Ensure GPU is enabled
- โ
Check CUDA availability:
torch.cuda.is_available() - โ
Monitor GPU usage:
nvidia-smi - โ Reduce frame preprocessing overhead
4. Agent Not Learning
- โ
Wait until
LEARNING_STARTS(50K steps) - โ Verify epsilon is decaying
- โ Check loss values (should be > 0)
- โ Ensure replay memory is filling
The agent learns an action-value function Q(s, a) that estimates the expected return from taking action a in state s:
Q(s, a) = E[R_t + ฮณยทR_{t+1} + ฮณยฒยทR_{t+2} + ... | s_t = s, a_t = a]
The optimal policy is to always choose the action with highest Q-value:
ฯ*(s) = argmax_a Q*(s, a)
DQN approximates Q(s, a) with a neural network Q(s, a; ฮธ):
- Input: State (4 stacked frames)
- Output: Q-value for each action
- Training: Minimize TD error using gradient descent
Experience Replay breaks correlations:
- Consecutive frames are highly correlated
- Random sampling from replay buffer provides i.i.d. samples
- Enables multiple passes over rare transitions
Target Network stabilizes learning:
- Q-learning updates chase a moving target (bootstrap)
- Fixed target network provides stable regression targets
- Updated periodically (every 10K steps)
Frame Stacking captures motion:
- Single frame lacks velocity information
- 4 frames reveal ball direction and speed
- Essential for optimal decision-making
This implementation uses the original 2013 DQN. Several enhancements have since been proposed:
- Double DQN (2015): Reduces overestimation bias
- Dueling DQN (2016): Separate value/advantage streams
- Prioritized Replay (2016): Sample important transitions more
- Rainbow DQN (2017): Combines multiple improvements
- Noisy Networks (2017): Parameter noise for exploration
- ๐น Train on multiple games
- ๐น Compare different hyperparameters
- ๐น Implement curriculum learning
- ๐น Add multi-step returns (n-step Q-learning)
- ๐น Visualize learned features (activation maps)
- ๐น Analyze failure cases
-
Original DQN Paper:
Mnih, V., et al. (2013). "Playing Atari with Deep Reinforcement Learning"
[arXiv:1312.5602] -
Nature DQN Paper (Extended version):
Mnih, V., et al. (2015). "Human-level control through deep reinforcement learning"
Nature, 518(7540), 529-533. -
Double DQN:
van Hasselt, H., Guez, A., & Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning"
[arXiv:1509.06461] -
Dueling DQN:
Wang, Z., et al. (2016). "Dueling Network Architectures for Deep Reinforcement Learning"
[arXiv:1511.06581] -
Rainbow:
Hessel, M., et al. (2017). "Rainbow: Combining Improvements in Deep Reinforcement Learning"
[arXiv:1710.02298]
-
CleanRL: Clean implementations of RL algorithms
github.com/vwxyzjn/cleanrl -
OpenAI Baselines: Reference implementations
github.com/openai/baselines -
Stable-Baselines3: Production-ready RL algorithms
github.com/DLR-RM/stable-baselines3
- Sutton & Barto: "Reinforcement Learning: An Introduction" (free online)
- DeepMind x UCL Lecture Series: RL course videos
- OpenAI Spinning Up: RL education resource
Contributions are welcome! Areas for improvement:
- ๐ง Implement Rainbow DQN enhancements
- ๐ Add TensorBoard logging
- ๐จ Improve visualization tools
- ๐ Add more documentation
- ๐งช Add unit tests
- ๐ฎ Support more environments
This code is provided for educational purposes. Please cite the original DeepMind paper if you use this implementation in your research.
- DeepMind team for pioneering DQN research
- CleanRL for implementation inspiration
- Kaggle for providing free GPU resources
- OpenAI Gymnasium for Atari environments
- The open-source RL community for resources and support
For questions, issues, or discussions:
- Open an issue on GitHub
- Check existing issues for solutions
- Refer to the original paper for algorithm details
๐ฎ Happy Training! ๐ค
"The only way to do great work is to love what you do."



