A Simple Solution without Artificial Intelligence

CartPole is a game in the Open-AI Gym reinforced learning environment. It is widely used in many text-books and articles to illustrate the power of machine learning. However, all these machine learning methods require a decent amount of coding and lots of computing power to train. Is there any simpler solution?

The answer is yes. In this article, I will show an extremely simple solution. Although it is only 5 lines long, it performs better than any commonly found machine learning method and completely beats the CartPole game. Now let’s start!

Table of Contents

  1. Review of the Cart-Pole problem
  2. Analysis of some simple policies
  3. Arriving at the 5-Line solution
  4. Conclusion

Review of the CartPole problem

The CartPole problem is also known as the “Inverted Pendulum” problem. It has a pole attached to a cart. Since the pole’s center of mass is above its pivot point, it’s an unstable system. The official full description can be found here on the Open-AI website.

The pole starts in an upright position with a small perturbation. The goal is to move the cart left and right to keep the pole from falling.

Following is a graphical illustration of the system (if you want to know how to set up the OpenAI Gym environment and render this illustration, this article can help).

Image by author, rendered from OpenAI Gym CartPole-v1 environment

In the OpenAI CartPole environment, the status of the system is specified by an “observation” of four parameters (x, v, θ, ω), where

  • x: the horizontal position of the cart (positive means to the right)
  • v: the horizontal velocity of the cart (positive means moving to the right)
  • θ: the angle between the pole and the vertical position (positive means clock-wise)
  • ω: angular velocity of the pole (positive means rotating clock-wise)

Given an observation, a player can perform either one of the following two possible “actions”:

  • 0: pushing the cart to the left
  • 1: pushing the cart to the right

The game is “done” when the pole deviates more than 15 degrees from vertical (|θ| ≥ π/12 ≈0.26). In each time step, if the game is not “done”, then the cumulative “reward” increases by 1. The goal of the game is to have the cumulative reward as high as possible.

Let’s take a look at the following example in details:

The observation:

  • x=0.018: the cart is on the right side of the origin O
  • v=0.669: the cart is moving to the right
  • θ=0.286: the pole is at (0.286/2π*360≈16.4 degrees) clockwise from vertical
  • ω=0.618: the pole is rotating close-wise

Action=1: the player is pushing the cart to the right

Cumulative Reward=47: the player has successfully sustained 47 time steps in this game

Done=1: this game is already “done” (because |θ| > 15 degrees)

Now we understood the setup. Let’s see how to play this game to achieve high rewards.

Analysis of some simple policies

In the context of reinforced learning, a “policy” essentially means a function that takes an observation (or a series of observations) and outputs an action.

Random Policy

Before we try to be smart, let’s first imagine a monkey randomly pushes the cart left and right, and see how well it performs. This can help us to establish a baseline. Its implementation is, of course very simple:

def rand_policy(obs):
return random.randint(0, 1)

We played this “random policy” 1,000 times and plot the cumulative reward of each game. We can see the mean reward is 22.03, and the standard deviation is 8.00.

Theta Policy

Of course, we can do better than the monkey. In Géron, Aurélien’s book: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Chapter 18), there’s a very simple policy that only depends on the pole’s angle θ:

def theta_policy(obs):
theta = obs[2]
return 0 if theta < 0 else 1

In plain language, it says that if the pole is tilted to the left (θ<0), then push the cart to the left, and vice versa. Very intuitive, isn’t it? It’s indeed better than the random policy, with the mean reward almost doubled to 41.67.

And the following is one iteration to show how it performs. It indeed shows some intention to prevent the pole from falling.

One game played by the Theta Policy

Analysis of the Theta Policy

Although better a monkey, the Theta Policy is far from satisfactory. For those with some physics backgrounds, this policy is obviously flawed. Because when the cart is pushed to the left, the pole gets a clockwise angular acceleration, not a clock-wise angular velocity. The consequence of this action is that the pole can rotate clockwise as well as counter-clockwise. Also, when the pole is already moving towards the center, say θ>0 and ω < 0, the action (pushing to the right) will still accelerate the angular velocity towards the center rather than slow it down. Thus the pole over-shoots past the center.

Base on the above mechanical analysis, a much more reasonable proposition is that when the pole is moving away from the vertical position (ω<0), push the cart to the left (action = 0). And vice versa. Since it only depends on the angular velocity ω, let’s name it “Omega Policy”. Its implementation is just as simple as the Theta Policy:

def omega_policy(obs):
w = obs[3]
return 0 if w < 0 else 1

Surprise! Based on one simple law from physics, a one-line change turns the poor Theta Policy into a winner! This Omega Policy gets a mean reward of ~200!

To appreciate the 200 average rewards, let’s compare the average rewards of some commonly found machine learning policies. Please keep in mind that these machine learning policies are much more complicated, much harder to explain, and requires long training times to achieve these result:

You can see that our two-liner Omega Policy already performs on par or better than the AI-powered Deep Q-Learning policy.

The official CartPole webpage defines that the problem is “solved” if the average reward >195 over 100 consecutive trials. So our 2-line Omega Policy already solves the CartPole problem!

Arriving at the 5-Line solution

Although the simple Omega Policy already solved the CartPole problem, I am still not satisfied. A quick visualization reveals why:

One iteration played by the Omega Policy

We can see that the game ends not because the pole falls but because the cart deviates from the origin too far. This indicates the policy successfully “stabilizes” the pole (keeping angular velocity ω ≈ 0), but at a “tiled” position (angle θ ≠ 0). So the cart keeps moving in one direction. This is not surprising, because the Omega Policy does nothing about the angle θ.

After identifying the problem, it’s easy to propose an improved policy:

  • When the angle θ is “small”, we want to stabilize θ. This is the same as the Omega Policy.
  • When the angle θ is “large”, we want to correct θ, i.e., give an angular acceleration towards the center. This is the same as the Theta Policy.

As far as the criterion for “small” and “large”, it’s not well defined. But a reasonable starting point is 10% of the 15 degrees “done” threshold, i.e., ~0.026. In reality, the result is not very sensitive to this value. Anywhere from 0.02 to 0.04 can produce amazing results. Following is an example using 0.03 as the threshold:

def theta_omega_policy(obs):
theta, w = obs[2:4]
if abs(theta) < 0.03:
return 0 if w < 0 else 1
else:
return 0 if theta < 0 else 1

How good is this simple 5-line policy?

Bingo! The pole simply CANNOT fall! Not even once! The reason that the cumulative reward caps at 500 is just due to the limitation of the CartPol-v1 environment itself — when the game is played 500 time steps, it automatically stops. In other words, our Theta-Omega Policy not only “solves” the problem, but also “breaks” the game!

Following is how this simple 5-line policy performs in real action:

One iteration played by the Theta-Omega Policy

The full notebook of the system setup, analysis, and GIF generation is available here on GitHub.

Conclusion

Obviously, this is not an artificial intelligence exercise. But by showing how to breaks the CartPole game in 5 lines, I hope you can appreciate how condense the law of physics is. Essentially, we utilized results from thousands of years of human learning, to replace the machine learning code, and got a far better, far simpler result.

So the next time we apply any machine learning algorithm, it’s always better to check for existing knowledge first.


How to Beat the CartPole Game in 5 Lines was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.