In this lab you will implement a simple Q-learning algorithm to solve a particular task: balancing a pole on a cart.
The pole balancing problem is a classic one in reinforcement learning. In it, the controller is attempting to keep a pole that is balanced on a cart from falling flat by applying pushes to the cart. The code for this problem can be found in the archive sbp.tar.Z. The code has a simple simulator that also has a window that shows the pole balance situation. To generate the window output I used a set of code called EZX. Due to this, the code for the simulation was written in C rather than in C++. If you like, you may convert this code to Java and write your simulator in Java.
The code provided includes a simple game form of the problem. The game asks you as the user to choose from among 5 actions to try to balance the pole: applying a small positive push, a medium positive push, a small negative push, a medium negative push, or no push. Try playing the game several times to get a feel for how it works.
Once you understand the simulator you should then implement a new version of the code that can learn a Q table to choose actions. You will then test how well your Q representation works by adding code to periodically see how it does after certain points in the training process.
The game version of this problem has five possible actions. You should use the same set of actions in your learned controller. You then need to figure out how to represent the current pole balancing situation as a state. You should consider representing the state of the system by looking at two or three values about the cart and pole:
If your representaions has 10 bins for angle, eight for angular velocity, and three for the cart position then there are 240 possible states. You can give each state a unique number by calculating a value like this:
state = (ANGLEBIN - 1) * 8 * 3 + (VELOCITYBIN - 1) * 3 + (POSITION - 1)
Then your Q table is simply a two dimensional array with the first dimension being the number of states and the second dimension the number of actions.
To learn the Q table you should run a large number of pole balancing games. A game should end either when the pole drops or if 500 steps are reached. The reward the controller receives is a large negative value when the pole drops or when the cart hits the wall. All other rewards should be 0.
For the discount factor I would suggest a high value such as 0.9 or 0.95 (you may want to make this an input to your system). The Q update rule you should use is the one from class (on the second "Nondeterministic Case" overhead). To select from amongst the actions I would suggest an approach involving a probability of selecting the "best" (highest Q value action). With some probability p at each step you should choose the best action, otherwise choose an action at random. To make this technique work I would start this probability at a low value during early learning (allowing lots of exploration) and then increasing the probability for later games (to allow more exploitation).
Your general learning approach should work as follows:
FOR game = 1 TO MaxNumberOfTrainingGames DO initialize a pole balancing game REPEAT determine the state select an action perform that action measure the reward update the Q function UNTIL the pole drops OR 500 steps have happened IF a certain number of games have passed THEN evaluate the current Q table
To evaluate the current Q table you should run a certain number of games (perhaps 50) and count for each game how many steps occur before the pole drops (or that 500 steps were reached). Note that this is slightly different from the learning process in that when you select an action during this testing process you should always choose the action with the highest Q value.
Conduct several experiments to see how well your state representation works. Note that you may want to make the definition of the states (i.e., how many bins, what the bin thresholds are, etc.) inputs to your system so that you can try different representations. Train each of your different representations several times for a large number of games (100000) periodically stopping (after each 1000 games) to evaluate how good the solution is so far. Graph your results for each experiment and discuss how well each of your representations works.
Print out a copy of all of your code files. You should hand in printouts demonstrating how your program works by running your program on several data sets, including your own.
You should also write up a short report (at least one page, no more than three) discussing your design decisions in implementing the Q learning algorithm and how your version of the code works.
You must also submit your code electronically. To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs8751-1-s2003/upload.cgi and follow the directions for uploading a file (you can do this multiple times, though it would be helpful if you would tar your files and upload one file archive).
To make your code easier to check and grade please use the following procedure for collecting the code before uploading it:
rmaclin/prog05_poleb_ccNote that the suffix of all C++ code files (not .h files) should be ".cc". Only code files (for example, in C++, only .cc and .h files) should be stored in this directory.
tar cf prog05_poleb.tar login/prog05_poleb_PLcode