Marc Deisenroth's Homepage
Search Contact information
University of Cambridge Home Department of Engineering
Computational and Biological Learning Lab
University of Cambridge >  Department of Engineering >  Information Engineering  >  Computational and Biological Learning Lab  > Marc Deisenroth

Home Publications Downloads Talks Vita

Learning to control a dynamical system—from scratch

Optimal control is in theory one of the most general frameworks for designing control laws as well as studying the cost functions underlying human behavior. However, most control systems defy an analytical solution and approximate optimal control methods, better known as Reinforcement Learning (RL) are needed. Unlike classical optimal control solutions, RL relies upon samples obtained from the system.

Efficient use of data is essential for fast progress in artificial learning. Learning models of the underlying dynamics is one way to achieve this goal. However, when only few samples are available, the model uncertainty is large, and the model error results in a bias, which can have disastrous effects on the control performance. Bayesian (probabilistic) methods can naturally deal with uncertainty in data-driven models. To date, reinforcement learning still lacks fast learning methods and appropriate quantification of currently available knowledge. This seems essential when attacking a general reinforcement learning problem in a principled way. Bayesian inference averages over these uncertainties, which enables fast learning of feasible control laws even for inaccurate models.

Some applications:

Under-actuated cart-pole swing up

As an initial test case, we consider the under-actuated cart-pole swing-up problem.

Figure: Under-actuated cart-pole swing up.

The task is to swing the pendulum up and to stabilize it in the inverted position (blue cross) by just applying a horizontal force to the cart. We demonstrate that our approach is able to learn both to swing up and to stabilize the pole, using less than a minute of time interacting with the system. As far as we know this is an unprecedented speed of learning for this kind of problem, which has previously been considered very hard to learn.

We demonstrate that our approach is able to learn to both to swing up and to stabilize the pole, using fairly general assumptions and less than a minute of time interacting with the system. As far as we know this is an unprecedented speed for this kind of problem, which has previously been considered very hard to learn. The experiments have been carried out in simulation and in actual hardware, which demonstrates the viability and applicability of the proposed framework.

Simulation

Simulation results are described in the paper.

Hardware experiment

We applied our algorithm to a cart-pole system in hardware to see whether it also works in practice. The cart moves on a track of finite length (approximately 70 cm) and can be pulled up and down the track by a wire attached to a torque motor. The system input is the voltage to the power amplifier, which then produces a current in the torque motor. The four state variables (position and velocity of the cart, angle and angular velocity of the pendulum) can be measured. The results of learning a controller that solves the task are given below.


The video of the hardware experiment shows all training trials (each of length 2.5 seconds) and a final test trial of 20 seconds.

Description of the video

The video shows 7 trials each of length 2.5 seconds. After each trial, the system is re-set to its initial position, or at least close to it. The target position is the initial position, but the pendulum has to be inverted. The cost function only penalizes deviations of the tip of the pendulum from its goal position.

In the first two trials, we apply actions (horizontal forces to the cart) randomly since we do not know better. The 5 seconds of data collected in these trials is used to build a probabilistic dynamics model. Utilizing this model to internally simulate the dynamics, the parameters of a nonlinear controller are being optimized. The third trial in the episode is the result when applying this controller. The controller already manages to keep the cart in the middle of the belt. However, the problem is not yet solved due to the discrepancy of the real system and its model, which was based on state observations none of which was close to the goal state. However, we take the new observations into account, re-train the dynamics model and determine the controller corresponding to the updated model. Applying this new controller for 2.5 seconds results in the fourth trial shown in the video. For the first time, we have observed states close to the target state although the pendulum is spinning through. Taking these observations into account, the dynamics model is updated, and the corresponding controller is learned. In the fifth trial, the controller substantially reduces the angular velocity. After two more trials, the learned controller can solve the task based on 17.5 seconds of experience. It can even deal with small disturbances as shown in the final test trajectory.

Snapshots of the final trajectory are given below:

snapshot 1 of 6 snapshot 2 of 6
snapshot 3 of 6 snapshot 4 of 6
snapshot 5 of 6 snapshot 6 of 6
Figure: Snapshots from the cart-pole hardware experiment.

Hardware specifications

length of pendulum: 0.125 m
mass of pendulum: 0.32 kg
mass of cart: 0.7 kg
moment of inertia on motor shaft: 8.0 x 10-5 kg m2
torque motor constant: 0.08 Nm/A
amplifier constant: -0.5 A/V

Note that all (except the pendulum length) of the above listed parameters are not of interest when applying our learning framework.

Algorithmic parameters

In our RL framework, there are not many parameters to choose. All of them somewhat depend on the length of the pendulum.

time discretization: 100 ms (shorter than the characteristic frequency of the pendulum)
prediction horizon: 2.5 seconds (longer than the characteristic frequency of the pendulum)
width of the Gaussian cost function: ~ 0.07 m (the pendulum has to cross the horizontal to substantially reduce the cost)

Once we managed to communicate with the hardware, the algorithm worked right out of the box without changing any of the above parameters.

Acknowledgements

Many thanks to James Ingram, who helped us communicating with the physical cart-pole system.

Under-actuated cart-double pendulum

Applying exactly the same algorithm, we succeeded to learn a controller that swings a double pendulum up, which is attached to a cart, and stabilizes it in the upright position. A sketch of the problem is given in the figure below.

Figure: Under-actuated cart-double pendulum swing up.
Figure: The cost solely depends on the distance to the target (dashed, green).

The chaotic two-link system can be controlled only by pushing the cart to left or right. The blue cross is the target position. Note that it is not sufficient to stabilize the double pendulum somewhere but the cart has to be in a particular position, too, to cause the least cost. The 6D system state is given by the position and velocity of the cart and the angles and angular velocities of the two links. We assume that the state is fully observable but the measurements are noisy. Moreover, we know the lengths of the two links. Using solely this information, we fix the width of the cost function and length of the time discretization used for control. The cost function is based on the distance between the current position of the tip of the pendulum and the target position (blue cross). In particular, there is no velocity information or a penalty on the magnitude of the control signal involved. A learning system should be able to discover that slow (angular) velocities are crucial around the target position to solve the task.

Simulation

We simulated the cart-double-pendulum system exactly following the approach described in (Rasmussen and Deisenroth, 2008). Initially, we applied random forces to collect data (3 seconds in total), on which our first dynamics model is learned. The following videos shows the learning progress. The states visited by each new trajectory are incorporated in the update of the dynamics model in the next iteration.


This video shows the trajectory when applying the first optimized controller, i.e., the result using only the 3 seconds of data when applying random forces. The controller already keeps the cart close to the desired position, but the pendulum is still not above the horizontal.


After a couple of trials (experience of 12 seconds), the pendulum gets closer to the target state, i.e., the yellow reward bar gets closer to the right boundary. However, the pendulum cycles through.


Now, we have 42 seconds total experience. The pendulum does no longer cycle through and the controller attempts to keep the upright position, but does not totally succeed. However, we can already see the learning effect.


After 60 seconds experience, the system performs very well and collects much reward during the trial. Stabilization is notyet possible.


After about 100 seconds experience, the trial looks very promising, the pendulum does no longer fall over. The problem is basically solved.


This long test trajectory is based on about 150 seconds experience and shows that the learned controller succeeds in a) swinging up the double pendulum by only pushing the cart to the left or to the right and b) in stabilizing the system in the inverted position. The system gains (almost) full reward (the yellow bar is almost at the right border of the frame). Although we did not penalize the applied force, the controller learned that only small forces can be applied to balance the double pendulum.

Pendubot

The Pendubot is a two-link arm, where a torque can be applied to the first link only. The second link can be considered a simple pendulum whose motion can be controlled by actuation of the first link.


The Pendubot's dynamics can be chaotic. This flash-video shows a typical trajectory without actuation.


After a couple of trials, we can also solve the Pendubot task: swing it up and balance it.

References


© University of Cambridge, Department of Engineering
Information provided by Marc Deisenroth (mpd37)