Bellman equation explained

In this article, I am going to explain the Bellman equation, which is one of the fundamental elements of reinforcement learning. The equation tells us what long-term reward can we expect, given the state we are in and assuming that we take the best possible action now and at each subsequent step.

Obviously, the goal of reinforcement learning is to maximize the long-term reward, so the Bellman equation can be used to calculate whether we have achieved the goal. The equation below is the Bellman equation for deterministic environments. It gives the value of the current state when the best possible action is chosen in this (and all following steps).

\[V(s) = \max_{ a }( R(s,a) + \gamma V(s'))\]

The equation consists of three elements:

  • the max function which picks the action that maximizes the reward (max_a)
  • the discount factor - the hyperparameter that can be tuned to emphasize the long-term reward or make the model focus on low hanging fruits and encourage to pick the best short-term solution. (gamma)
  • the function that calculates the reward given the selected action and the current state (R(s,a))

We see that the Bellman equation is a recursive function because it calls itself (s’ is the state in the next step).

Would you like to help fight youth unemployment while getting mentoring experience?

Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.

The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.

Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.

Check out their job description.

(free advertisement, no affiliate links)

Why s’ is the next step?

It may seem counterintuitive that the function calculated inside the current step refers to the next step, not to the previous one.

It is like that because we can calculate the value of actions only when we have reached the terminal state. At this point, we go backward: we apply the discount factor and add the reward function at every step until we reach the initial step. Then we have the total reward.

Gamma parameter

The gamma hyperparameter is the discount factor which tells us how valuable is a reward which we will get in a future step.

A low value encourages the model to ignore long-term rewards and focus on getting a reward quickly (even if it is significantly lower than the possible long-term reward).

A high value of the discount factor emphasizes long-term rewards, so the model may take an action which results in punishment in the current step, but allows it to get a higher reward in the future.

According to the authors of the Move 37 “Reinforcement learning” course, in a typical case, the discount factor has values between 0.9 and 0.99.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.