Bellman equation explained

In this article, I am going to explain the Bellman equation, which is one of the fundamental elements of reinforcement learning. The equation tells us what long-term reward can we expect, given the state we are in and assuming that we take the best possible action now and at each subsequent step.

Obviously, the goal of reinforcement learning is to maximize the long-term reward, so the Bellman equation can be used to calculate whether we have achieved the goal. The equation below is the Bellman equation for deterministic environments. It gives the value of the current state when the best possible action is chosen in this (and all following steps).

\[V(s) = \max_{ a }( R(s,a) + \gamma V(s'))\]

The equation consists of three elements:

  • the max function which picks the action that maximizes the reward (max_a)
  • the discount factor - the hyperparameter that can be tuned to emphasize the long-term reward or make the model focus on low hanging fruits and encourage to pick the best short-term solution. (gamma)
  • the function that calculates the reward given the selected action and the current state (R(s,a))

We see that the Bellman equation is a recursive function because it calls itself (s’ is the state in the next step).

Why s’ is the next step?

It may seem counterintuitive that the function calculated inside the current step refers to the next step, not to the previous one.

It is like that because we can calculate the value of actions only when we have reached the terminal state. At this point, we go backward: we apply the discount factor and add the reward function at every step until we reach the initial step. Then we have the total reward.

Gamma parameter

The gamma hyperparameter is the discount factor which tells us how valuable is a reward which we will get in a future step.

A low value encourages the model to ignore long-term rewards and focus on getting a reward quickly (even if it is significantly lower than the possible long-term reward).

A high value of the discount factor emphasizes long-term rewards, so the model may take an action which results in punishment in the current step, but allows it to get a higher reward in the future.

According to the authors of the Move 37 “Reinforcement learning” course, in a typical case, the discount factor has values between 0.9 and 0.99.

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.