A Simple Solution without Artificial Intelligence
CartPole is a game in the Open-AI Gym reinforced learning environment. It is widely used in many text-books and articles to illustrate the power of machine learning. However, all these machine learning methods require a decent amount of coding and lots of computing power to train. Is there any simpler solution?
The answer is yes. In this article, I will show an extremely simple solution. Although it is only 5 lines long, it performs better than any commonly found machine learning method and completely beats the CartPole game. Now let’s start!
Table of Contents
- Review of the Cart-Pole problem
- Analysis of some simple policies
- Arriving at the 5-Line solution
Review of the CartPole problem
The CartPole problem is also known as the “Inverted Pendulum” problem. It has a pole attached to a cart. Since the pole’s center of mass is above its pivot point, it’s an unstable system. The official full description can be found here on the Open-AI website.
The pole starts in an upright position with a small perturbation. The goal is to move the cart left and right to keep the pole from falling.
Following is a graphical illustration of the system (if you want to know how to set up the OpenAI Gym environment and render this illustration, this article can help).
In the OpenAI CartPole environment, the status of the system is specified by an “observation” of four parameters (x, v, θ, ω), where
- x: the horizontal position of the cart (positive means to the right)
- v: the horizontal velocity of the cart (positive means moving to the right)
- θ: the angle between the pole and the vertical position (positive means clock-wise)
- ω: angular velocity of the pole (positive means rotating clock-wise)
Given an observation, a player can perform either one of the following two possible “actions”:
- 0: pushing the cart to the left
- 1: pushing the cart to the right
The game is “done” when the pole deviates more than 15 degrees from vertical (|θ| ≥ π/12 ≈0.26). In each time step, if the game is not “done”, then the cumulative “reward” increases by 1. The goal of the game is to have the cumulative reward as high as possible.
Let’s take a look at the following example in details:
- x=0.018: the cart is on the right side of the origin O
- v=0.669: the cart is moving to the right
- θ=0.286: the pole is at (0.286/2π*360≈16.4 degrees) clockwise from vertical
- ω=0.618: the pole is rotating close-wise
Action=1: the player is pushing the cart to the right
Cumulative Reward=47: the player has successfully sustained 47 time steps in this game
Done=1: this game is already “done” (because |θ| > 15 degrees)
Now we understood the setup. Let’s see how to play this game to achieve high rewards.
Analysis of some simple policies
In the context of reinforced learning, a “policy” essentially means a function that takes an observation (or a series of observations) and outputs an action.
Before we try to be smart, let’s first imagine a monkey randomly pushes the cart left and right, and see how well it performs. This can help us to establish a baseline. Its implementation is, of course very simple:
return random.randint(0, 1)
We played this “random policy” 1,000 times and plot the cumulative reward of each game. We can see the mean reward is 22.03, and the standard deviation is 8.00.
Of course, we can do better than the monkey. In Géron, Aurélien’s book: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Chapter 18), there’s a very simple policy that only depends on the pole’s angle θ:
theta = obs
return 0 if theta < 0 else 1
In plain language, it says that if the pole is tilted to the left (θ<0), then push the cart to the left, and vice versa. Very intuitive, isn’t it? It’s indeed better than the random policy, with the mean reward almost doubled to 41.67.
And the following is one iteration to show how it performs. It indeed shows some intention to prevent the pole from falling.
Analysis of the Theta Policy
Although better a monkey, the Theta Policy is far from satisfactory. For those with some physics backgrounds, this policy is obviously flawed. Because when the cart is pushed to the left, the pole gets a clockwise angular acceleration, not a clock-wise angular velocity. The consequence of this action is that the pole can rotate clockwise as well as counter-clockwise. Also, when the pole is already moving towards the center, say θ>0 and ω < 0, the action (pushing to the right) will still accelerate the angular velocity towards the center rather than slow it down. Thus the pole over-shoots past the center.
Base on the above mechanical analysis, a much more reasonable proposition is that when the pole is moving away from the vertical position (ω<0), push the cart to the left (action = 0). And vice versa. Since it only depends on the angular velocity ω, let’s name it “Omega Policy”. Its implementation is just as simple as the Theta Policy:
w = obs
return 0 if w < 0 else 1
Surprise! Based on one simple law from physics, a one-line change turns the poor Theta Policy into a winner! This Omega Policy gets a mean reward of ~200!
To appreciate the 200 average rewards, let’s compare the average rewards of some commonly found machine learning policies. Please keep in mind that these machine learning policies are much more complicated, much harder to explain, and requires long training times to achieve these result:
- Sequential Neural-Network (in Géron’s book): ~46
- Deep Q-Learning (in article 1): ~130
- Deep Q-Learning (in article 2): ~200
You can see that our two-liner Omega Policy already performs on par or better than the AI-powered Deep Q-Learning policy.
The official CartPole webpage defines that the problem is “solved” if the average reward >195 over 100 consecutive trials. So our 2-line Omega Policy already solves the CartPole problem!
Arriving at the 5-Line solution
Although the simple Omega Policy already solved the CartPole problem, I am still not satisfied. A quick visualization reveals why:
We can see that the game ends not because the pole falls but because the cart deviates from the origin too far. This indicates the policy successfully “stabilizes” the pole (keeping angular velocity ω ≈ 0), but at a “tiled” position (angle θ ≠ 0). So the cart keeps moving in one direction. This is not surprising, because the Omega Policy does nothing about the angle θ.
After identifying the problem, it’s easy to propose an improved policy:
- When the angle θ is “small”, we want to stabilize θ. This is the same as the Omega Policy.
- When the angle θ is “large”, we want to correct θ, i.e., give an angular acceleration towards the center. This is the same as the Theta Policy.
As far as the criterion for “small” and “large”, it’s not well defined. But a reasonable starting point is 10% of the 15 degrees “done” threshold, i.e., ~0.026. In reality, the result is not very sensitive to this value. Anywhere from 0.02 to 0.04 can produce amazing results. Following is an example using 0.03 as the threshold:
theta, w = obs[2:4]
if abs(theta) < 0.03:
return 0 if w < 0 else 1
return 0 if theta < 0 else 1
How good is this simple 5-line policy?
Bingo! The pole simply CANNOT fall! Not even once! The reason that the cumulative reward caps at 500 is just due to the limitation of the CartPol-v1 environment itself — when the game is played 500 time steps, it automatically stops. In other words, our Theta-Omega Policy not only “solves” the problem, but also “breaks” the game!
Following is how this simple 5-line policy performs in real action:
The full notebook of the system setup, analysis, and GIF generation is available here on GitHub.
Obviously, this is not an artificial intelligence exercise. But by showing how to breaks the CartPole game in 5 lines, I hope you can appreciate how condense the law of physics is. Essentially, we utilized results from thousands of years of human learning, to replace the machine learning code, and got a far better, far simpler result.
So the next time we apply any machine learning algorithm, it’s always better to check for existing knowledge first.