Video

Project Summary

Solve an intricate maze with traps, lava, etc. Optimize by trying to improve the time the agent solves the maze or by finding the most optimal path. The input of the project would require a section of the map the agent would traverse. Output would be the most optimal path discovered by the agent. Lastly, we assume that every block is unknown and the agent must discover each path. Direct applications of this project would allow users to optimally beat multiple video games. At a high level, reinforcement learning discovered from this project can determine the ideal behavior within the manufacturing, delivery, and finance industries.

Different from what we proposed before, we added a new Level 0 to discover the relation between reward values and agent’s action.
Moreover, We also discarded our level 5(which include a zombie) since we found it was not necessary to add a zombie. The zombie does not affact what we want to study in this project. Instead, we made a larger and more complex map for level 4. Now, the agent can either Move 1 steps or Jump 2 steps. In this kind of map, the traditional shorest path algorithm wouldn’t work because of the complexity of the map and the number of actions. That is why we implemented reinforcement learning(eg. the Q-learning algorithm) in this project.

Here is a brief description of our test environment:

Level 0: Flat terrain, with edge boundary and hazards, two agents, agent1 must Jump to find the most optimal path (Actions: Walk, Jump)
Level 1: Flat terrain, with edge boundary (Actions: Walk, Jump)
Level 2: Flat terrain, with edge boundary and hazards in the middle of the map (Actions: Walk, Jump)
Level 3: 3D terrain, hills, hazards, blocks (Actions: Walk, Jump)
Level 4: 3D terrain, hills, hazards, blocks, a larger and more complex map (Actions: Walk, Jump)


Approaches

For final report, our approach used the Q-Learning algorithm. Here is the equation of Q-Learning algorithm and our parameter set-up.
The Q-Learning equation:

self.epsilon = 0.01 # chance of taking a random action instead of the best
self.alpha = 0.1 # learning rate
self.gamma = 1.0 # discount rate

# Set of actions
self.actions = ["movewest 1", "moveeast 1", "movenorth 1", "movesouth 1", "tpn", "tps", "tpe", "tpw"]
# Additional action costs to take into account (Each action costs 1 by default).
# Action cost = 1 + self.action_cost[i]
#agent 1
self.action_cost = [0, 0, 0, 0, 9, 9, 9, 9]
#agent 2
self.action_cost = [0, 0, 0, 0, 300, 300, 300, 300]
#Initial Q Values (this snippet is used everytime agent discovers a new area in the map)
if not self.q_table.has_key(current_s):
  self.q_table[current_s] = [0, 0, 0, 0, -2, -2, -2, -2]

Learning Rate:
Alpha represents the learning rate. It is value between 0 and 1 ( 0 < a < 1). It indicates how much the utility values will be updated every time the agent takes an action. alpha = 0 means the agent will not learn anything. alpha = 1 means the agent will not consider any feature states (the agent only consider most recent information). In stochastic environment, alpha is preferable closer to 0 than 1. In our approach, we decide alpha value to be 0.1

Discount Factor:
Gamma is the discount factor. It determines the importance of future information. Gamma closer to 0 will encourages the agent to seek out rewards sooner rather than later. It makes the agent assign a smaller reward to the feature action. Gamma closer to 1 will makes the agent seek for high reward in the feature. This value usually closer to 1. We set gamma value to 1 in our approach

Random Action: Epsilon is the possibility of taking a random action instead of the best one.

Immediate Reward Value: r is the immediate reward value.

Max Q Value: The action has the highest utility value in next state will become the new Q value of that states.

The reason we chose this parameter set:
We changed these parameter to see the effects on agent’s performance. We tested the agent’s performance with different parameter sets(total: 80 combinations). The test was based on level 1(we were planning to get data from level2,3,4 also, however, it took a long time to get all the data. Therefore, we only got data from level 1).

elist = [0.01,0.1,0.3,0.5] #list of epsilon, 4 values
alist = [0.05,0.1,0.5,0.75,1] #list of alpha, 5 values
glist = [0.25,0.5,0.75,1] #list of gamma, 4 values

For each of the combinations, we plotted a graph. We find out alpha = 0.1, gamma = 1.0 and epsilon = 0.01 fits all of our maps. A large alpha value(0.75) and gamma value(0.75) will let the agent find the best reward sooner in a simple map(level 1) but it will not work in a very complex map(level 4).

Initial Q Values:
We decided to give each state 8 initial Q Values where the first 4 values are 0 and the last 4 are -2. These values were set to make the agent prefer walking over jumping initially until walking’s Q value drops to -2. This was only used for levels 1, 2, 3.

Note: Notice how each action costs 1 by default. We decided to create an action_cost array to add additional cost to specific actions. A ‘jump’ has an additional cost of 9 on top of the default cost of 1 whereas a ‘move’ has no added cost.

Advantages of Q-Learning:

Disadvantages of Q-Learning:

Test Environments/Maps:

Currently, we have five maps, level 0 is based on level 1, level 2 is is based on level 1, level 3 is based on level 3, level 4 based on level 3. The figures show the grid layouts in two-dimensional. The figures specify the start and end blocks. Also,the figures show the terrian of mazes(the floor of the maze is cobblestone, block is built by glass_blocks, hill is built by cobblestone_blocks). The number in each grids represent the (x,z) value and each grids has an altitude value which is y.

Level 0:


Level 1:


Level 2:


Level 3:


Level 4:


Evaluation

For qualitative evaluation, we evaluate our project by checking how well the agent can solve all 4 level mazes. We observe the agent when it is solving the maze to verify it works correctly. Also, we can check our agent by using the Cumulative Rewards Table.

For quantitative evaluation, we plotted the reward values in a graph to see whether or not the reward found by the agent eventually converges near the optimal solution. We plotted the optimal solution as a dashed red line and the rewards found by the agent as a blue solid line.








References

Images