Most reinforcement learning methods rely on dense, well-shaped rewards, which are often unavailable, biased, or expensive to engineer. Reward-free exploration (RFE) instead learns good state representations or exploratory policies without using rewards, and only introduces rewards later for downstream tasks. However, most existing RFE frameworks assume a static environment. In realistic settings, the data distribution can drift over time: goals move, dynamics change, or noise increases. This project explores how reward-free methods behave under such distributional drift.
- Build a synthetic MDP (GridWorld) with tunable drift.
- Pretrain reward-free*representations using UCRL-RFE and baselines.
- Introduce downstream tasks with rewards after pretraining.
- Compare how quickly / robustly different representations adapt under drift.
-
Environment: Drift-Enabled GridWorld
Implement a small GridWorld-style MDP with:
- Drift strength: how much the transition or reward structure changes.
- Drift schedule: when drift happens (e.g., sudden jump, gradual shift, periodic).
- Examples:
- Shifting goal locations.
- Changing transition noise.
- Altering blocked cells or wall layouts.
-
Reward-Free Exploration with UCRL-RFE
Apply a reward-free exploration algorithm (e.g. UCRL-RFE) to:
- Collect trajectories without rewards.
- Maximize state coverage.
- Produce a replay buffer or dataset for later representation learning.
-
Representation Pretraining
Train two families of state encoders:
- Fixed-environment encoder
- Pretrained on data from a single (or early) environment configuration.
- Ignores later drift during pretraining.
- Drift-aware encoder**
- Pretrained across time with drifting dynamics.
- May condition on time, drift index, or inferred context.
- Goal: learn stable or adaptable features under nonstationarity.
- Fixed-environment encoder
-
Downstream Rewarded Tasks
After pretraining:
- Introduce explicit reward functions
- Train simple RL agents
- Evaluate:
- Learning speed
- Final performance / Reward Earned
- Representation stability
-
Baselines and Comparisons
Metrics to track:
- Coverage of state space over time.
- Downstream sample efficiency.
- Performance drop when drift occurs.
- Representation similarity / drift across environments.
pip install -r requirements.txtpython run.pyThis runs the complete pipeline:
- Reward-free exploration
- Representation training (fixed and drift-aware encoders)
- Downstream RL training
- Evaluation with all metrics
Results are saved to results/final_results.png
rfe-drift-gridworld/
├── rfe_drift/ # Core implementation
│ ├── env/ # DriftGridWorld environment
│ ├── exploration/ # UCRL-RFE algorithm
│ ├── representations/# State encoders
│ ├── rl/ # RL agents (Q-Learning, DQN)
│ └── utils/ # Metrics, rewards, visualization
├── run.py # Main script
├── requirements.txt # Dependencies
└── README.md # This file
- State Coverage: Fraction of state space explored during RFE
- Sample Efficiency: Learning speed in downstream tasks
- Performance Drop: Reward before vs after drift (key metric!)
- Robustness: How well agents adapt to environmental changes
Edit the CONFIG dict in run.py to adjust parameters:
CONFIG = {
"grid_size": 10, # Grid dimensions
"drift_strength": 0.7, # How much environment changes (0-1)
"drift_time": 200, # When drift occurs (in steps)
"num_exploration_steps": 10000,# Reward-free exploration
"num_train_episodes": 300, # Downstream RL training
"num_eval_episodes": 50, # Evaluation episodes
}