cat sitting on pile of treats victorious. Revelry blog on reinforcement learning.

Understanding and Using Reinforcement Learning

The field of artificial intelligence (AI) has taken center stage in recent months; Large Language Models (LLMs) like ChatGPT and LLaMA have caught the world’s collective imagination. We’ve talked about what an LLM is and how it works in detail in other posts; but I’d like to talk about a different approach to machine learning that has always interested me – more due to my psychological origins (that is, my origins working in psychology, not my relationship with my mother).

Reinforcement Learning (RL) is a type of machine learning that owes its existence to behavioral theories of psychology, and is responsible for other magnificat feats of AI (such as AlphaGoZero dominating the world of chess). It is a technique that can even be used to train robotic arms to serve us coffee, as well as the final step of training LLMs, the current machine learning golden child. In this blog post, I’ll share some insights into the history of RL, and provide a general overview. Hopefully, you’ll come away thinking it’s as interesting as I do…

The Psychology Behind RL

The psychological roots of reinforcement learning can be traced back to several key concepts borrowed from behaviorism, an area of psychology that focuses on analyzing and modifying what can be observed (behavior) and ignoring what can’t (such as the internal processes of human cognition). One major concept here is the idea of operant conditioning. Note that this is not classical conditioning; it specifically relates to creating associations between unrelated stimuli and involuntary responses (think Pavlov’s dogs), whereas operant conditioning is about altering voluntary behaviors via rewards or punishments.

These associations are created through methods like Positive Reinforcement (an additive reward, like a treat or gift-card), Negative Reinforcement (taking away something undesirable, like when a bill is less if you pay it immediately), and Punishment (including both positive and negative punishments, like getting a speeding ticket or loss of driving privileges).

reinforcement learning in action

My cat Mowgli isn’t incredibly smart, but he is very food motivated and ,thus, the perfect stooge to demonstrate the power of operant conditioning. Despite not being in business, he knows how to shake.

Another key concept here is shaping: rewarding successive approximations of a desired behavior. In order to teach a more complex behavior to an animal who is completely unused to that behavior, like teaching a cat to run an entire dog show agility course, you have to get closer and closer – you can’t just expect the cat to know what to do. It is the process of reinforcing behaviors that are increasingly similar to the desired behavior.

These ideas are usually connected to the names of psychologists, like Edward Thorndike and B.F.F Skinner (the man who tried to convince the department of defense to use homing pigeons to guide homing missiles).  These behavioral conditioning principles laid an early foundation for what would later be crafted into RL algorithms.

The problem wasn’t that it didn’t work (it did), but that they thought people would laugh at them (they did).

Understanding Reinforcement Learning

Reinforcement Learning focuses on training algorithms, known as agents, to make a sequence of decisions. Unlike supervised learning, where the learning comes from a labeled dataset, and unsupervised learning, where patterns are inferred from unlabeled data, reinforcement learning depends on an environmental interactions to learn from the consequences of actions. The behavior of the agent is shaped via reinforcement and punishment as it gets closer and closer to the desired behavior.

This is very similar to the concepts from behaviorism as described above. Some of the terminology here is slightly different, and some of it is the same:

  • Policy: The strategy an agent employs to determine its actions.🕹️
  • Value Function: An estimation of future rewards given a state and action.🎯
  • Reward: A signal indicating when the agent’s actions are closer or further from the target behavior

Of course, something we still have to contend with is the underlying machinery. While biological organisms tend to maximize food (e.g. the cheese is at the end of the maze) above all else, when training an agent, we have a far freer hand when determining our strategy that will power our policy of action. We can choose which algorithm will be used to shape the behavior.

Algorithm Choice

There are many different algorithms out there to choose from, and choice largely depends on the problem we are solving, and what actions we can take (which we’ll look at a little later).

Two main types of algorithms to choose from are:

Model-Free RL: This flavor only uses the current state values to try to make a prediction. It focuses on learning from experience directly. It does not build an explicit understanding or representation of how the environment works.

Model-Based RL: Here, predictions about the future state of the model are used to try to generate a best possible action. It involves learning and utilizing an explicit model of the environment. This model is used to predict how the environment will behave in the future based on different actions taken. Then, this prediction is used to make decisions about which actions are likely to lead to the best outcomes.

Some examples of some of the algorithms to choose from, and a little description about them:

  • Proximal Policy Optimization (PPO): PPO promotes small, incremental updates to the policy — hence ‘proximal’ in its name. It achieves this by limiting the size of policy updates, leading to steady and stable learning progress.
  • Q-Learning: A model-free algorithm known for learning the quality of actions, indicating how beneficial an action is in terms of gaining future rewards.
  • Deep Q Networks (DQN): Integrate deep learning with Q-Learning, using neural networks to approximate Q-values.
  • Asynchronous Advantage Actor-Critic (A3C): Introduces the concept of parallel agent training to stabilize learning and reduce training time, using separate memory settings for policy (actor) and value (critic) to make decisions and evaluate them.

As well as the problem at hand, or what spaces your algorithm can work in, something key to know is what’s known as the exploration-exploitation tradeoff: whether to explore new possibilities (exploration) or choose options that are known to yield high rewards (exploitation).

Training Your Own Model

If all this has piqued your interest, and you are interested in training your own models, I’ve got you covered! Or rather, OpenAI has you covered. Or rather rather, the Farama Foundation has you covered. They maintain a wonderful library called “Gymnasium,” which gives you access to a diverse collection of reference environments where you can put your reinforcement learning skills (and algorithms) to the test. Using Python and Jupyter notebooks, you can spin up your own environments with relative ease and quite quickly.

Let’s train up a model to something simple: balancing a pole on a moving cart. It’s one of the simpler offerings from Gymnasium (along with classic Atari games), but it should demonstrate all the areas of reinforcement learning we’ve talked about so far.

A moving cart balancing a pole

Woooaaaa don’t drop it!

!pip install 'stable-baselines3[extra]'
!pip install gymnasium

import os
import gymnasium as gym # https://gymnasium.farama.org/
import pygame

from IPython import display
from stable_baselines3 import PPO 
from stable_baselines3.common.vec_env import DummyVecEnv 
from stable_baselines3.common.evaluation import evaluate_policy 

environment_name = 'CartPole-v0'

Action Spaces and Observation Spaces

Within a given environment, an agent will be able to glean certain observations (“My pole is falling!”) or take certain actions (“move the cart more to the right”). These two different dimensions are captured by the terms action space and observation space: what can I see and what I can do.

### Observation Space: Defines what you can see:
| Num | Observation           | Min                 | Max                |
|-----|-----------------------|---------------------|--------------------|
| 0   | Cart Position         | -4.8                | 4.8                |
| 1   | Cart Velocity         | -Inf                | Inf                |
| 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (-24°) |
| 3   | Pole Angular Velocity | -Inf                | Inf                |

These differ per environment. For example, in our case, our observation space comes in four vectors: The cart position, the cart velocity, the pole angle and the pole angular velocity.

This is what’s known as a box environment: the observations are arrays or vectors of numbers. These arrays are referred to as “boxes” because they contain numerical values in a specific format or range.

Our action space, on the other hand, is discrete – simply a 1 or a 0 to go left (or right).

The goal for this environment is to keep the pole upright as long as possible; and therefore, a reward is granted for every step taken where the pole stays upright. The episode will be terminated if:

  1. Termination: Pole Angle is greater than ±12°
  2. Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
  3. Truncation: Episode length is greater than 200

More code! Let’s run our environment:

env = gym.make(environment_name, render_mode="human")

episodes =  5 
for episode in range(1, episodes+1):
    state = env.reset() # Initial set of observations
    terminated = False 
    truncated = False
    score = 0
    
    while not terminated:
        env.render()
        action = env.action_space.sample() # taking a random action
        obs, reward, terminated, truncated, info = env.step(action)
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
pygame.display.quit()
env.close()

And that looks like…

The environment keeps terminating (four times) as the pole falls too far.

Taking random actions gets us nowhere fast. Rather than using an agent to take actions based on observations on a policy, here we are simply taking random choices (left or right).

Let’s instead initialize this environment for our agent, and train them up on it:

env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=20000)

We’ve initialized our environment, vectorized it, and have started training it.

We vectorized it because the algorithm we’re using (PPO) expects a vectorized environment by default (and also would allow us to run multiple instances, thus speeding up training). Here though, we’re just using one instance.

It’s going to train over 20,000 time steps; running through the environment multiple times, each time honing it’s policy (slowly) to rack up more and more rewards.

Once that’s finished (and it can take a while!), we can see it in action!

env = gym.make(environment_name, render_mode="human")

episodes =  5 
for episode in range(1, episodes+1):
    obs, info = env.reset()
    terminated = False 
    truncated = False
    score = 0
    
    while not terminated:
        env.render()
        action, next_state = model.predict(obs) # Now using model here
        obs, reward, terminated, truncated, info = env.step(action)
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
pygame.display.quit()
env.close()

And that looks like…

I should note this is not perfect: I didn’t train this agent very long at all, which is why it’s still failing slowly. With more training runs, you can achieve near perfection.

That’s much better! It really seems like our agent has got the hang of it!

And that’s really all we have time for! There’s much more to dive into – training metrics, evalulation metrics, callbacks, tensorboard…but I only have so much time!

Get out there and train! Maybe train up some Pokemon using reinforcement learning?

I am indebted to Nicholas Renotte on YouTube, as well as Brian Christian’s excellent book The Alignment Problem.

Want to chat more on this subject? Or any tech topic? Connect with us. (We love this stuff!)

We're building an AI-powered Product Operations Cloud, leveraging AI in almost every aspect of the software delivery lifecycle. Want to test drive it with us? Join the ProdOps party at ProdOps.ai.