Bellman equation explained

In this article, I am going to explain the Bellman equation, which is one of the fundamental elements of reinforcement learning. The equation tells us what long-term reward can we expect, given the state we are in and assuming that we take the best possible action now and at each subsequent step.

Obviously, the goal of reinforcement learning is to maximize the long-term reward, so the Bellman equation can be used to calculate whether we have achieved the goal. The equation below is the Bellman equation for deterministic environments. It gives the value of the current state when the best possible action is chosen in this (and all following steps).

\[V(s) = \max_{ a }( R(s,a) + \gamma V(s'))\]

The equation consists of three elements:

  • the max function which picks the action that maximizes the reward (max_a)
  • the discount factor - the hyperparameter that can be tuned to emphasize the long-term reward or make the model focus on low hanging fruits and encourage to pick the best short-term solution. (gamma)
  • the function that calculates the reward given the selected action and the current state (R(s,a))

We see that the Bellman equation is a recursive function because it calls itself (s’ is the state in the next step).

Why s’ is the next step?

It may seem counterintuitive that the function calculated inside the current step refers to the next step, not to the previous one.

It is like that because we can calculate the value of actions only when we have reached the terminal state. At this point, we go backward: we apply the discount factor and add the reward function at every step until we reach the initial step. Then we have the total reward.

Gamma parameter

The gamma hyperparameter is the discount factor which tells us how valuable is a reward which we will get in a future step.

A low value encourages the model to ignore long-term rewards and focus on getting a reward quickly (even if it is significantly lower than the possible long-term reward).

A high value of the discount factor emphasizes long-term rewards, so the model may take an action which results in punishment in the current step, but allows it to get a higher reward in the future.

According to the authors of the Move 37 “Reinforcement learning” course, in a typical case, the discount factor has values between 0.9 and 0.99.

Older post

Dependencies between DAGs: How to wait until another DAG finishes in Airflow?

How to trigger Airflow DAG when another DAG is completed

Newer post

Deep Q-network terminology in plain English

The terminology used in the paper "Human-level control through deep reinforcement learning"