COMP 579 — Reinforcement Learning

Contributors: Nicolas Smits, James Randolph (final project), Abdullah Paraha (final project)
Projects completed for COMP 579: Reinforcement Learning (McGill University).
This repository contains implementations and analyses spanning bandit algorithms, tabular RL, deep RL, and a final project on DQN variants for clinical decision-making.

🎰 A1 — Multi‑Armed Bandits

Topics: Exploration–exploitation, regret minimization, non‑stationary environments

This assignment introduced core bandit algorithms using simulated Gaussian bandits. The work compares learning rules, exploration strategies, and performance under both stationary and drifting reward distributions.

Included methods:

Incremental updates & fixed learning‑rate updates
Decaying‑α updates for non‑stationary problems
ϵ‑greedy (constant & decaying ϵ)
Gradient bandits
Thompson Sampling

Outputs:

Reward trajectories
Estimated vs. true action values
Instantaneous and cumulative regret plots

🧊 A2 — Tabular RL & MDP Analysis

Topics: SARSA, Expected SARSA, policy evaluation, small MDPs

This assignment explored tabular reinforcement learning on FrozenLake-v1, studied softmax (Boltzmann) exploration, and analyzed a simple three‑state MDP both analytically and numerically.

Key components:

SARSA vs. Expected SARSA with Boltzmann policies
Hyperparameter sweeps (temperature, learning rate)
Train/test reward curves
Closed‑form solution of Bellman equations
Numerical policy evaluation via iterative updates
Derivation of the optimal policy

Training results: Tabular SARSA and Expected SARSA final training score vs learning rate and temperature.

Training results: Tabular SARSA and Expected SARSA rewards over training episodes.

Testing results: Tabular SARSA and Expected SARSA rewards over testing episodes.

🔥 A3 — Deep Reinforcement Learning

Topics: Value‑based & policy‑based deep RL

This assignment implemented neural approximators for both Q‑learning and policy gradients using environments such as Acrobot‑v1 and ALE/Assault‑ram‑v5.

Value‑Based Methods

Q‑learning with MLP approximators
Expected SARSA with function approximation
Replay buffers
Exploration strategies (ϵ‑greedy, decays)

Policy‑Based Methods

REINFORCE
Actor–Critic (separate policy & value networks)
Experiments with Boltzmann exploration (fixed & decaying temperature)

Outputs

Training curves (mean ± std) across seeds
Comparisons of stability and convergence across architectures

Results

Training curves for the Deep Q-Learning (with replay buffer) and Expected SARSA (with replay buffer) models on the Acrobot-v1 and ALE-Assault-Ram-v5 gymnasium environments for various learning rates, \alpha and exploration rates, \epsilon.

Training curves for the REINFORCE and Actor-Critic models on the Acrobot-v1 and ALE-Assault-Ram-v5 gymnasium environments for a fixed and decaying temperature.

🏥 Final Project — DQN Variants for ICU Sepsis Treatment

Environment: ICU‑Sepsis‑v2 — a clinically inspired RL benchmark built from MIMIC‑III data.

This project evaluated whether modern DQN extensions outperform the baseline DQN on sepsis‑treatment decision‑making. The MDP includes 716 health states and 25 discrete treatment actions involving fluid and vasopressor dosing.

Algorithms Implemented (only those completed by Nicolas Smits)

Base DQN (MLP + replay + target networks)
Prioritized DQN (TD‑error‑based sampling)
Double DQN (reduced overestimation bias)
Dueling DQN (value + advantage streams)

(Methods such as Rainbow, Distributional DQN, Noisy DQN, and Multi‑Step DQN were executed by teammates and are referenced for completeness but not included in this repo.)

Key Findings

Dueling DQN achieved the strongest final return (~0.8238).
Prioritized DQN converged faster but plateaued slightly lower (~0.8135).
Baselines (standard DQN, Double DQN) reached ~0.8089–0.8056.
More complex methods (e.g., Rainbow) were highly sensitive to limited hyperparameter searches.

Conclusion:
Architectural improvements like dueling architectures and prioritized replay show meaningful benefits for clinical decision‑making in this simulated ICU setting. Deeper hyperparameter tuning may unlock stronger performance for more complex variants.

Survival rate and episode length results over training episodes.

📂 Repository Structure

COMP-579-Reinforcement-Learning/
│
├── Assignment1/
│   └── ... bandit algorithms & analysis
├── Assignment2/
│   └── ... tabular RL & MDP work
├── Assignment3/
│   └── ... deep RL implementations (value & policy based)
├── Final_Project/
│   └── ... DQN variants for ICU-Sepsis-v2
└── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
A1		A1
A2		A2
A3		A3
Final Project		Final Project
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
pygame-2.1.3-cp39-cp39-macosx_11_0_arm64.whl		pygame-2.1.3-cp39-cp39-macosx_11_0_arm64.whl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMP 579 — Reinforcement Learning

🎰 A1 — Multi‑Armed Bandits

🧊 A2 — Tabular RL & MDP Analysis