@noi01
This is your current training loop structure :
for e in range(EPISODES):
state = env.reset()
state = np.reshape(state, [1, state_size])
for time in range(500):
# env.render() #no preview of environment
action = agent.act(state)
next_state, reward, done, _ = env.step(action)
reward = reward if not done else -10
next_state = np.reshape(next_state, [1, state_size])
agent.remember(state, action, reward, next_state, done)
state = next_state
-
Is the second loop to limit the maximum steps the agent can take before the episode is over? If yes, this should be moved inside the environment step function. This would allow truncating the episode and give a reward based on the fact the agent did not reach the goal in time.
-
The reward assignment here set reward to -10 when the environment is done. Is that really what you want to do? From my understanding, solving the environment should be positive for the agent. What is the goal of this line?
@noi01
This is your current training loop structure :
Is the second loop to limit the maximum steps the agent can take before the episode is over? If yes, this should be moved inside the environment step function. This would allow truncating the episode and give a reward based on the fact the agent did not reach the goal in time.
The reward assignment here set reward to -10 when the environment is done. Is that really what you want to do? From my understanding, solving the environment should be positive for the agent. What is the goal of this line?