Skip to content

sevba/nim-q-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Nim Misรจre Q-Learning Agent

A reinforcement learning agent that learns to play Nim Misรจre (reverse Nim) through self-play using the Q-Learning algorithm with reward shaping. Built in TypeScript with no external ML libraries - pure implementation of advanced Q-learning techniques.

๐ŸŽฎ What is Nim Misรจre?

Nim Misรจre is a variant of the classic game of Nim with a twist:

  • Setup: Start with 4 piles of stones: [1, 3, 5, 7]
  • Gameplay: Players alternate turns, removing any number of stones from a single pile
  • Win Condition: The player who takes the LAST stone LOSES (this is the "misรจre" twist)

This variant makes strategy somewhat more complex than standard Nim, where taking the last stone wins.

๐Ÿง  Q-Learning with Reward Shaping

This project implements an advanced Q-Learning algorithm with reward shaping for faster convergence:

Q-Learning Formula:

Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทmax_a' Q(s',a') - Q(s,a)]

Current Parameters (Optimized for Reward Shaping):

  • ฮฑ (alpha) = 0.05: Learning rate (lower for stability)
  • ฮณ (gamma) = 0.99: Discount factor (higher for long-term strategy)
  • ฮต (epsilon) = 1.0 โ†’ 0.05: Exploration rate with slower decay
  • Epsilon decay = 0.99995: Maintains exploration longer
  • Min epsilon = 0.05: Maintains 5% exploration throughout training

๐ŸŽฏ Reward Shaping Innovation

Unlike traditional Q-learning that only rewards winning/losing, this implementation uses Nim-sum based reward shaping:

Traditional Approach:

Move 1-5: reward = 0 (no feedback)
Final move: reward = ยฑ1 (win/loss)

Our Approach with Reward Shaping:

Each move: reward = 0.5 if Nim-sum = 0 (good strategic position)
           reward = 0 if Nim-sum โ‰  0 (suboptimal)
Final move: reward = ยฑ1 (win/loss)

This provides immediate feedback on every move, teaching the agent the optimal Nim-sum strategy implicitly through reinforcement rather than explicit programming.

๐Ÿ“ Project Structure

nim-q-learning/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ game/
โ”‚   โ”‚   โ””โ”€โ”€ NimGame.ts          # Game state, rules, and Nim-sum calculation
โ”‚   โ”œโ”€โ”€ agent/
โ”‚   โ”‚   โ””โ”€โ”€ QLearningAgent.ts   # Q-learning + file I/O
โ”‚   โ”œโ”€โ”€ training/
โ”‚   โ”‚   โ””โ”€โ”€ trainer.ts          # Self-play with reward shaping
โ”‚   โ””โ”€โ”€ demo/
โ”‚       โ””โ”€โ”€ play.ts             # Human vs Agent gameplay
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ agent1.json         # Saved trained models
โ”‚   โ””โ”€โ”€ agent2.json
โ”œโ”€โ”€ CLAUDE.md                    # Architecture documentation
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ package.json
โ””โ”€โ”€ tsconfig.json

๐Ÿš€ Getting Started

Prerequisites

  • Node.js (v16 or higher)
  • npm or yarn

Installation

# Install dependencies
npm install

# Build the project
npm run build

๐Ÿ“Š Training the Agent

Train the agent through self-play with reward shaping:

npm run train

Training Output:

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘         Nim Misรจre Q-Learning Training System            โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿค– Initializing agents...
โœ… Agents created with parameters:
   Learning rate (ฮฑ): 0.05
   Discount factor (ฮณ): 0.99
   Initial exploration (ฮต): 1.0
   Epsilon decay: 0.99995
   Min epsilon: 0.05
   ๐ŸŽฏ Reward Shaping: ENABLED (Nim-sum based)

๐ŸŽฎ Starting Training...

๐Ÿ“Š Progress Report - Game 500000/5000000
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โฑ๏ธ  Time: 125.3s | Speed: 3990 games/s
๐Ÿ† Agent 1 wins: 251234 (50.2%)
๐Ÿ† Agent 2 wins: 248766 (49.8%)
๐Ÿ“ˆ Avg game length: 8.1 moves
๐ŸŽฒ Epsilon: Agent1=0.0821 | Agent2=0.0821
๐Ÿ—‚๏ธ  Q-table: Agent1=2847 states | Agent2=2891 states
๐Ÿ”„ Updates: Agent1=4052341 | Agent2=4048123

Training Configuration:

  • Default: 5,000,000 games (adjustable in src/training/trainer.ts)
  • Progress reports: Every 500,000 games
  • Training time: ~2-3 minutes for 5M games (on Apple M3)
  • Alternating starts: Agents swap starting position each game for balanced learning
  • Auto-save: Models saved to models/agent1.json and models/agent2.json

Adjusting Training Duration

Edit src/training/trainer.ts:

const numGames = 5000000;  // Try 100000 (quick), 1000000 (standard), or 10000000 (expert)
const reportInterval = 500000;  // Adjust based on numGames

๐ŸŽฏ Playing Against the Agent

Play against the trained agent:

npm run play

Demo Mode Features:

  • Load pre-trained models from file
  • Quick training mode (30,000 games)
  • Interactive console interface with visual pile representation
  • Choose who starts each game (you or agent)
  • Agent shows its Q-values for transparency
  • Automatic winner detection (no manual input needed)
  • Score tracking across multiple games

Agent Setup Options:

๐Ÿค– Agent Setup Options:
   1. Load pre-trained agent from file (default)
   2. Train a new agent (quick training)
   3. Use untrained agent (random play)

Choose option (1/2/3, default: 1): 1

๐Ÿ“‚ Attempting to load from: models/agent1.json
โœ… Q-table loaded from: models/agent1.json
   States: 2847
   Q-values: 18234
   Epsilon: 0.0500

Example Gameplay:

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Who should start this game? (you/agent, default: you): you
๐Ÿ‘ค You will start this game.

๐ŸŽฎ Starting new game!
You are Player 0, Agent is Player 1

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐ŸŽฎ Nim Misรจre - Remember: Taking the LAST stone LOSES!
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Current piles:
  Pile 0: [ 1] โ—
  Pile 1: [ 3] โ—โ—โ—
  Pile 2: [ 5] โ—โ—โ—โ—โ—
  Pile 3: [ 7] โ—โ—โ—โ—โ—โ—โ—

Total stones remaining: 16
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

๐Ÿ‘ค Your turn!
Choose pile (0-3): 1
How many stones? (1-3): 2

โœ… You removed 2 stone(s) from pile 1

๐Ÿค– Agent is thinking...
๐ŸŽฏ Agent chose: Remove 1 stone(s) from pile 2
   (Q-value: 0.8734)

๐Ÿ’พ Model Persistence

Saving Models

Models are automatically saved after training to models/ directory:

  • agent1.json - First agent's Q-table
  • agent2.json - Second agent's Q-table

Loading Models

The demo automatically loads agent1.json when you choose option 1.

Manual Save/Load

import { QLearningAgent } from './src/agent/QLearningAgent';

// Save a trained agent
const agent = new QLearningAgent();
// ... train the agent ...
agent.saveToFile('models/my-expert-agent.json');

// Load from file
const loadedAgent = QLearningAgent.fromFile('models/my-expert-agent.json');

// Or load into existing agent
agent.loadFromFile('models/my-expert-agent.json');

๐Ÿ” How Reward Shaping Works

Mid-Game Rewards (piles with >1 stones)

if (nextNimSum === 0) {
    reward = 0.5;  // Good move! Left opponent in losing position
} else {
    reward = 0;    // Suboptimal move
}

Example:

Before: [1,3,5,7] โ†’ Nim-sum = 0
Move: Take 2 from pile 1
After: [1,1,5,7] โ†’ Nim-sum = 2

Reward = 0 (you gave opponent advantage)

Better move: Take 1 from pile 2
After: [1,2,5,7] โ†’ Nim-sum = 1
Still need to find the move that makes Nim-sum = 0...

Agent learns through these immediate rewards!

Endgame Rewards (all piles โ‰ค1)

const onesCount = piles.filter(p => p === 1).length;
if (onesCount % 2 === 1) {  // Odd number of 1s
    reward = 0.5;  // Good position for Misรจre
} else {
    reward = 0;
}

Optimal Strategy (Nim-Sum)

The mathematically optimal strategy uses XOR (Nim-sum):

Nim-sum = pile[0] โŠ• pile[1] โŠ• pile[2] โŠ• pile[3]

Example: [1,3,5,7]
Binary:   0001
          0011
          0101
        โŠ• 0111
        ------
          0000 = 0

Nim-sum = 0 โ†’ Losing position for current player

Strategy:

  • If Nim-sum = 0: You're in a losing position (with perfect opponent)
  • If Nim-sum โ‰  0: Make a move that leaves Nim-sum = 0 for opponent

The agent discovers this strategy through reward shaping!

๐Ÿ› ๏ธ Customization

Modify Training Parameters

Edit src/training/trainer.ts:

// Training parameters optimized for reward shaping
const alpha = 0.05;      // Learning rate: Try 0.01 (slower) or 0.1 (faster)
const gamma = 0.99;      // Discount: Try 0.95 (short-term) or 0.999 (long-term)
const epsilon = 1.0;     // Initial exploration
const epsilonDecay = 0.99995;  // Try 0.9999 (faster) or 0.99999 (slower)
const epsilonMin = 0.05;       // Try 0.01 (less) or 0.1 (more exploration)

Modify Game Rules

Edit src/game/NimGame.ts:

// Change initial piles
constructor(piles: number[] = [1, 3, 5, 7]) {  // Try [2,4,6,8] or [1,2,3]
  // ...
}

๐Ÿงช Testing the Implementation

Quick Game Test

npx ts-node -e "
import { NimGame } from './src/game/NimGame';
const game = new NimGame();
console.log(game.toString());
console.log('Nim-sum:', game.getNimSum());
console.log('Legal moves:', game.getLegalMoves().length);
"

Agent Action Test

npx ts-node -e "
import { QLearningAgent } from './src/agent/QLearningAgent';
import { NimGame } from './src/game/NimGame';

const agent = QLearningAgent.fromFile('models/agent1.json');
const game = new NimGame();
const move = agent.getBestAction(game);
console.log('Agent recommends:', move);
"

๐Ÿ“– Further Reading

๐Ÿ“„ License

MIT

About

โ™Ÿ๏ธ ๐ŸŽฏ ๐Ÿง  Nim Misรจre Q-Learning Agent

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors