Here is a demonstration of a Reinforcement Learning algorithm which efficiently learns to control an inverted pendulum.

The task is to swing up a pendulum attached to a cart, and to balance it in the (unstable) upright position. The control signal is a horizontal force applied to the cart.

In the diagrams below, the cart is represented as a box in black outline, with the pendulum attached to its center. The pendulum can rotate freely around the point where it is attached to the cart. The force applied to the cart is represented by the horizontal green bar and the immediate reward by a yellow bar. The blue plus sign at the top indicates the desired position of the point of the pendulum.

The physical properties of the system are: pendulum length 0.6 m, pendulum mass 0.5 kg (evenly distributed along its length), mass of cart 0.5 kg, coefficient of friction between cart and ground 0.1 N/m/s. The maximum allowed magnitude of the applied force is 10 N.

This (or variants hereof) is a standard problem in control. One
point
which makes the
problem less trivial is that to solve it, you sometimes have to take
actions which temporarily move you further away from the target. In
control theory, this task is considered easy. However, the solution
is
usually based on an intricate understanding of the dynamics of the
system. Here, the object is to try to *learn* how to control,
without
a prior understanding of the system.

The state of the system is given by the (horizontal) position of the cart, the (horizontal) speed of the cart, the angle of the pendulum and the angular velocity of the pendulum. The controller represents the state and action as continuous values, and acts in dicrete time. Every 200 ms, the controller selects a new force, and this force is applied for the next time window of 200 ms. (The control law is thus piece-wise constant over time.) The controller can observe the four state variables, corrupted by small amounts of measurement noise (additive, Gaussian with standard deviation 0.01).

Initially, the learner knows nothing about the behaviour of the system. The only built-in assumption is that the state variables evolve smoothly over time. Technically the only feed-back the system gets about the quality of its behaviour is the squared distance between the endpoint of the pendulum and its desidered position, measured every 200 ms.

Below is a series of short movies showing the learning process. The two first movies are for random forces, the reamining movies are trajectories generated by the learned policy, using all the experience so far available. The applied force and the immediate reward are indicated in colored bars.

first random trajectory

second random trajectory

optimized trajectory after 10 seconds experience

optimized trajectory after 15 seconds experience

optimized trajectory after 20 seconds experience

optimized trajectory after 25 seconds experience

optimized trajectory after 30 seconds experience

optimized trajectory after 35 seconds experience

optimized trajectory after 40 seconds experience

optimized trajectory after 45 seconds experience

optimized trajectory after 50 seconds experience

optimized trajectory after 55 seconds experience

optimized trajectory after 60 seconds experience

get the flash
player to see this player.

Initially, the learner knows nothing about the behaviour of the system. It needs to try to learn the system dynamics and to learn a useful control law. It tries to learn these by observing the behaviour of the system. Since initially nothing is known about the dynamics, we generate two 5 second sequences using just random forces. On the first sequence, the cart drives off the screen.

Then we are ready to learn from our experience. The learner updates its beliefs about the system dynamics and about what a good control law could be, based on its combined experience (which amounts to 10 seconds of recorded data at sampling rate of 200 ms, ie 50 observations). The performance is now better than random. However, it doesn't seem that it is going very directly towards the target, and when it gets close, it seems to stray off to the right. This is because the learner hasn't got very much experience yet, and in particular it has never experienced the pendulum in the close-to upright position, so how should it know what to do?

With even more experience, the controller learns to balance the pendulum. Already after 15 seconds it seems to balance, but after 25 seconds it falls over again. In later sessions, it never falls over. The actual strategy employed gradually changes to avoid the trip to the left, and swings up rapidly without much motion. The final policy is probably not optimal, but has a total loss which is not too far from that of the optimal policy.

Exactly the same experiment was repeated below, except that this time we used a deterministic model of the dynamics, rather than a probabilistic one. The experiment shows, that the controller is not able to learn under this condition.

first random trajectory

second random trajectory

optimized trajectory after 10 seconds experience

optimized trajectory after 15 seconds experience

optimized trajectory after 20 seconds experience

optimized trajectory after 25 seconds experience

optimized trajectory after 30 seconds experience

optimized trajectory after 35 seconds experience

optimized trajectory after 40 seconds experience

optimized trajectory after 45 seconds experience

optimized trajectory after 50 seconds experience

optimized trajectory after 55 seconds experience

optimized trajectory after 60 seconds experience

get the flash
player to see this player.

The Reinforcement Learning algorithm demonstrated here was developed by Carl Edward Rasmussen and Marc Deisenroth. The algorithm is based on Bayesian inference. A techincal paper describing the algorithm is forthcomming.

Last modified on July 24th 2008.