Skip to content

Aafimalek/atari_rl

Repository files navigation

Deep Q-Network (DQN) for Atari Breakout ๐ŸŽฎ๐Ÿค–

A complete PyTorch implementation of the Deep Q-Network (DQN) algorithm from DeepMind's groundbreaking paper "Playing Atari with Deep Reinforcement Learning" (Mnih et al., 2013). This project successfully trains an AI agent to play Atari Breakout using deep reinforcement learning.

Training Results

๐Ÿ“Š Results

After 6 million training steps (~15 hours on GPU), the agent achieved:

  • Mean Score: 35.20 ยฑ 3.60
  • Max Score: 49.00
  • Min Score: 29.00

This represents a ~35x improvement over random play (typical score: ~1-2), demonstrating that the agent learned effective strategies for breaking bricks and maintaining ball control.

Performance Visualization

Training Progress Test Performance
Training Results Test Results
Episode Details Performance Summary
Detailed Episode Performance Summary

Gameplay Videos

The trained agent demonstrates strategic gameplay, effectively tracking the ball and positioning the paddle:


๐ŸŽฏ Project Overview

This implementation faithfully recreates the DQN algorithm that marked a breakthrough in deep reinforcement learning by combining:

  1. Deep Neural Networks for function approximation
  2. Q-Learning for value-based decision making
  3. Experience Replay for stable training
  4. Target Networks for convergence

The result is an agent that learns directly from raw pixel inputs, without any hand-crafted features or domain knowledge about the game.


๐Ÿ—๏ธ Architecture

Network Structure

The Deep Q-Network follows the exact architecture from the paper:

Input: 84ร—84ร—4 (4 stacked grayscale frames)
โ”‚
โ”œโ”€ Conv Layer 1:  32 filters, 8ร—8 kernel, stride 4, ReLU
โ”œโ”€ Conv Layer 2:  64 filters, 4ร—4 kernel, stride 2, ReLU
โ”œโ”€ Conv Layer 3:  64 filters, 3ร—3 kernel, stride 1, ReLU
โ”‚
โ”œโ”€ Flatten
โ”‚
โ”œโ”€ Fully Connected 1:  512 units, ReLU
โ””โ”€ Fully Connected 2:  4 units (action outputs)

Total Parameters: ~1.6 million

Frame Preprocessing Pipeline

RGB Frame (210ร—160ร—3)
    โ†“
Grayscale Conversion
    โ†“
Resize to 84ร—84
    โ†“
Stack 4 consecutive frames
    โ†“
84ร—84ร—4 tensor input to network

This preprocessing:

  • Reduces dimensionality from 100,800 to 28,224 values
  • Captures motion through frame stacking
  • Maintains game state (ball velocity, paddle position)

๐Ÿ”ฌ Algorithm Details

Core Components

1. Experience Replay Memory

  • Capacity: 1,000,000 transitions
  • Purpose: Break temporal correlations in training data
  • Sampling: Random minibatches of 32 transitions

Stores tuples: (state, action, reward, next_state, done)

2. Target Network

  • Update Frequency: Every 10,000 steps
  • Purpose: Stabilize Q-value targets during training
  • Mechanism: Periodic copy of policy network weights

3. Epsilon-Greedy Exploration

  • Initial ฮต: 1.0 (100% random actions)
  • Final ฮต: 0.1 (10% random actions)
  • Decay: Linear over 1,000,000 frames
  • Strategy: Balance exploration vs. exploitation

Training Process

for step in range(6_000_000):
    1. Select action using ฮต-greedy policy
    2. Execute action, observe reward and next state
    3. Store transition in replay memory
    
    if step > 50,000:  # After initial exploration
        4. Sample random minibatch (32 samples)
        5. Compute Q-learning targets using target network
        6. Perform gradient descent on policy network
    
    if step % 10,000 == 0:
        7. Update target network (copy policy weights)
    
    8. Decay epsilon linearly

Loss Function

Uses Huber Loss (smooth L1 loss) for robustness:

target = reward + ฮณ ร— max_a' Q_target(next_state, a')
loss = HuberLoss(Q_policy(state, action), target)

๐Ÿ“ˆ Hyperparameters

All hyperparameters follow the original DQN paper:

Parameter Value Description
Replay Memory Size 1,000,000 Maximum transitions stored
Learning Starts 50,000 Random exploration before learning
Batch Size 32 Minibatch size for training
Discount Factor (ฮณ) 0.99 Future reward importance
Target Update Freq 10,000 Steps between target updates
Learning Rate 0.00025 RMSprop learning rate
Optimizer RMSprop Adaptive learning rate method
Initial Epsilon 1.0 Starting exploration rate
Final Epsilon 0.1 Minimum exploration rate
Epsilon Decay 1,000,000 Frames for epsilon annealing
Frame Stack 4 Consecutive frames stacked
Frame Size 84ร—84 Preprocessed frame dimensions

๐Ÿš€ Getting Started

Prerequisites

# Core dependencies
torch
gymnasium[atari]
gymnasium[accept-rom-license]
ale-py

# Utilities
numpy
opencv-python
matplotlib

Installation

  1. Clone the repository:
git clone https://github.com/Aafimalek/atari_rl
cd atari_rl
  1. Install dependencies:
pip install gymnasium[atari] gymnasium[accept-rom-license] ale-py
pip install torch opencv-python numpy matplotlib
  1. Open the notebook:
jupyter notebook atari-dqn.ipynb

Quick Start

Training a new agent:

# Configure environment
class Config:
    ENV_NAME = "ALE/Breakout-v5"  # Change to any Atari game
    TOTAL_TIMESTEPS = 6_000_000
    # ... other configs

# Train
agent, rewards, losses = train_dqn_resume(config)

Testing a trained agent:

# Load trained model
agent.load('dqn_final_6000000.pt')

# Test performance
test_rewards = test_agent(agent, num_episodes=10)

Changing the game:

# Available Atari games:
ENV_NAME = "ALE/Pong-v5"
ENV_NAME = "ALE/SpaceInvaders-v5"
ENV_NAME = "ALE/Seaquest-v5"
ENV_NAME = "ALE/MsPacman-v5"
# ... 50+ games available

๐Ÿ“ Project Structure

atari_rl/
โ”‚
โ”œโ”€โ”€ atari-dqn.ipynb              # Main training notebook
โ”œโ”€โ”€ atari-dqn-test.ipynb         # Testing and evaluation
โ”‚
โ”œโ”€โ”€ dqn_checkpoint_4200000.pt    # Checkpoint at 4.2M steps (19MB)
โ”œโ”€โ”€ dqn_final_6000000.pt         # Final trained model (19MB)
โ”‚
โ”œโ”€โ”€ training_results.png         # Training curves
โ”œโ”€โ”€ test_results.png             # Test performance
โ”œโ”€โ”€ detailed_ep.png              # Episode-level analysis
โ”œโ”€โ”€ performance summary.png      # Overall performance metrics
โ”œโ”€โ”€ final_summary.png            # Final results summary
โ”œโ”€โ”€ compare.png                  # Comparison visualizations
โ”‚
โ”œโ”€โ”€ gameplay_episode_2.mp4       # Gameplay video #1
โ”œโ”€โ”€ gameplay_episode_3.mp4       # Gameplay video #2
โ”‚
โ”œโ”€โ”€ 1312.5602v1.pdf             # Original DQN paper
โ””โ”€โ”€ README.md                    # This file

๐Ÿ’ก Implementation Highlights

Key Code Components

1. DQN Network (DQN class)

class DQN(nn.Module):
    def __init__(self, input_channels, num_actions):
        super().__init__()
        self.conv1 = nn.Conv2d(input_channels, 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.fc1 = nn.Linear(3136, 512)
        self.fc2 = nn.Linear(512, num_actions)

2. Experience Replay (ReplayMemory class)

  • Circular buffer with 1M capacity
  • Efficient random sampling
  • Numpy-based for speed

3. DQN Agent (DQNAgent class)

  • Manages policy and target networks
  • Implements epsilon-greedy action selection
  • Handles Q-learning updates with Huber loss
  • Checkpoint saving/loading

4. Environment Wrapper (AtariEnv class)

  • Frame preprocessing pipeline
  • Frame stacking logic
  • Action/observation handling

๐Ÿ“Š Training Details

Training Timeline

  • Total Training Steps: 6,000,000
  • Training Duration: ~15 hours on Kaggle GPU (T4)
  • Episodes Completed: 1,689
  • FPS: ~115 frames/second
  • Final Epsilon: 0.1

Training Phases

  1. Random Exploration (0-50K steps): Fill replay memory
  2. Initial Learning (50K-1M steps): Epsilon decays to 0.1
  3. Policy Refinement (1M-6M steps): Stable policy improvement

Performance Milestones

Step Mean Reward Max Reward Notes
0 ~1.0 ~2.0 Random policy
1M ~8.0 ~15.0 Basic paddle control
3M ~20.0 ~35.0 Strategic positioning
6M ~35.2 ~49.0 Near-optimal play

Checkpointing

Models are saved:

  • Every 100,000 steps during training
  • At completion: dqn_final_6000000.pt
  • Includes: network weights, optimizer state, step count, epsilon

๐Ÿ”ง Customization

Training on Kaggle

This implementation is optimized for Kaggle notebooks:

  1. Upload notebook to Kaggle
  2. Enable GPU (Settings โ†’ Accelerator โ†’ GPU T4 x2)
  3. Run all cells
  4. Monitor progress (logs every 10K steps)

Kaggle-specific features:

  • Auto-installs dependencies
  • Checkpoint resume support
  • Memory-efficient replay buffer
  • Progress tracking with ETA

Reducing Training Time

For faster experimentation:

# Quick test (2-3 hours)
TOTAL_TIMESTEPS = 1_000_000

# Medium test (8-10 hours)
TOTAL_TIMESTEPS = 3_000_000

# Full training (50+ hours)
TOTAL_TIMESTEPS = 10_000_000

Memory Optimization

If running into memory issues:

# Reduce replay memory
REPLAY_MEMORY_SIZE = 500_000  # Default: 1M

# Reduce batch size
BATCH_SIZE = 16  # Default: 32

๐ŸŽฎ Supported Games

The implementation works with any Atari 2600 game. Popular choices:

Game Difficulty Expected Performance
Breakout Easy 30-50 after 6M steps
Pong Easy 15-20 after 3M steps
Space Invaders Medium 500-1000 after 10M steps
Seaquest Hard Varies significantly
Ms. Pac-Man Hard Requires extended training

๐Ÿ“ˆ Monitoring Training

Real-time Metrics (every 10K steps)

Step: 5,990,000/6,000,000 (99.8%)
FPS: 115.1 | ETA: ~0.0 hours
Epsilon: 0.1000
Episodes: 1689
Mean Reward (last 10): 29.60
Max Reward: 49.00
Mean Loss: 0.0021
Memory Size: 1,000,000

Evaluation Metrics

  • Episode Rewards: Total score per game
  • Training Loss: Q-learning TD error
  • Epsilon Decay: Exploration rate over time
  • FPS: Training speed

๐Ÿงช Testing & Evaluation

Test Results

10 evaluation episodes after training:

Episode 1: Reward = 36.00
Episode 2: Reward = 36.00
Episode 3: Reward = 29.00
Episode 4: Reward = 42.00
Episode 5: Reward = 30.00
Episode 6: Reward = 36.00
Episode 7: Reward = 37.00
Episode 8: Reward = 33.00
Episode 9: Reward = 38.00
Episode 10: Reward = 35.00

Test Results:
Mean Reward: 35.20 ยฑ 3.60
Min Reward: 29.00
Max Reward: 42.00

Evaluation Protocol

def test_agent(agent, num_episodes=10):
    """Test trained agent with greedy policy (ฮต = 0)"""
    for episode in range(num_episodes):
        state = env.reset()
        episode_reward = 0
        
        while not done:
            action = agent.select_action(state, evaluate=True)
            state, reward, done, _ = env.step(action)
            episode_reward += reward
        
        print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")

๐Ÿ› ๏ธ Troubleshooting

Common Issues

1. Installation Errors

# If gymnasium[atari] fails:
pip install gymnasium ale-py --upgrade
pip install "gymnasium[accept-rom-license]"

# If ALE ROMs not found:
ale-import-roms --import-from-pkg atari_py.atari_roms

2. Out of Memory

# Reduce replay memory size
REPLAY_MEMORY_SIZE = 500_000  # or 250_000

# Use smaller batch size
BATCH_SIZE = 16

3. Slow Training

  • โœ… Ensure GPU is enabled
  • โœ… Check CUDA availability: torch.cuda.is_available()
  • โœ… Monitor GPU usage: nvidia-smi
  • โœ… Reduce frame preprocessing overhead

4. Agent Not Learning

  • โœ… Wait until LEARNING_STARTS (50K steps)
  • โœ… Verify epsilon is decaying
  • โœ… Check loss values (should be > 0)
  • โœ… Ensure replay memory is filling

๐Ÿ“š Background & Theory

Q-Learning Fundamentals

The agent learns an action-value function Q(s, a) that estimates the expected return from taking action a in state s:

Q(s, a) = E[R_t + ฮณยทR_{t+1} + ฮณยฒยทR_{t+2} + ... | s_t = s, a_t = a]

The optimal policy is to always choose the action with highest Q-value:

ฯ€*(s) = argmax_a Q*(s, a)

Deep Q-Learning

DQN approximates Q(s, a) with a neural network Q(s, a; ฮธ):

  1. Input: State (4 stacked frames)
  2. Output: Q-value for each action
  3. Training: Minimize TD error using gradient descent

Why This Works

Experience Replay breaks correlations:

  • Consecutive frames are highly correlated
  • Random sampling from replay buffer provides i.i.d. samples
  • Enables multiple passes over rare transitions

Target Network stabilizes learning:

  • Q-learning updates chase a moving target (bootstrap)
  • Fixed target network provides stable regression targets
  • Updated periodically (every 10K steps)

Frame Stacking captures motion:

  • Single frame lacks velocity information
  • 4 frames reveal ball direction and speed
  • Essential for optimal decision-making

๐Ÿ”ฌ Research Extensions

Possible Improvements

This implementation uses the original 2013 DQN. Several enhancements have since been proposed:

  1. Double DQN (2015): Reduces overestimation bias
  2. Dueling DQN (2016): Separate value/advantage streams
  3. Prioritized Replay (2016): Sample important transitions more
  4. Rainbow DQN (2017): Combines multiple improvements
  5. Noisy Networks (2017): Parameter noise for exploration

Experimentation Ideas

  • ๐Ÿ”น Train on multiple games
  • ๐Ÿ”น Compare different hyperparameters
  • ๐Ÿ”น Implement curriculum learning
  • ๐Ÿ”น Add multi-step returns (n-step Q-learning)
  • ๐Ÿ”น Visualize learned features (activation maps)
  • ๐Ÿ”น Analyze failure cases

๐Ÿ“– References

Papers

  1. Original DQN Paper:
    Mnih, V., et al. (2013). "Playing Atari with Deep Reinforcement Learning"
    [arXiv:1312.5602]

  2. Nature DQN Paper (Extended version):
    Mnih, V., et al. (2015). "Human-level control through deep reinforcement learning"
    Nature, 518(7540), 529-533.

  3. Double DQN:
    van Hasselt, H., Guez, A., & Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning"
    [arXiv:1509.06461]

  4. Dueling DQN:
    Wang, Z., et al. (2016). "Dueling Network Architectures for Deep Reinforcement Learning"
    [arXiv:1511.06581]

  5. Rainbow:
    Hessel, M., et al. (2017). "Rainbow: Combining Improvements in Deep Reinforcement Learning"
    [arXiv:1710.02298]

Code References

Learning Resources

  • Sutton & Barto: "Reinforcement Learning: An Introduction" (free online)
  • DeepMind x UCL Lecture Series: RL course videos
  • OpenAI Spinning Up: RL education resource

๐Ÿค Contributing

Contributions are welcome! Areas for improvement:

  • ๐Ÿ”ง Implement Rainbow DQN enhancements
  • ๐Ÿ“Š Add TensorBoard logging
  • ๐ŸŽจ Improve visualization tools
  • ๐Ÿ“ Add more documentation
  • ๐Ÿงช Add unit tests
  • ๐ŸŽฎ Support more environments

๐Ÿ“ License

This code is provided for educational purposes. Please cite the original DeepMind paper if you use this implementation in your research.


๐Ÿ™ Acknowledgments

  • DeepMind team for pioneering DQN research
  • CleanRL for implementation inspiration
  • Kaggle for providing free GPU resources
  • OpenAI Gymnasium for Atari environments
  • The open-source RL community for resources and support

๐Ÿ“ž Contact & Support

For questions, issues, or discussions:

  • Open an issue on GitHub
  • Check existing issues for solutions
  • Refer to the original paper for algorithm details

๐ŸŽฎ Happy Training! ๐Ÿค–

"The only way to do great work is to love what you do."

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors