# Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models

In this article, I am going to show you how to use Boltzmann policy in TensorFlow-Agent, how to configure the policy, and what is the expected result of various configuration options.

# Use Boltzmann policy with DQN Agent

While using the deep Q-network agent as our reinforcement learning model, we can easily configure Boltzmann policy by specifying the boltzmann_temperature parameter in the DQNAgent constructor.

1
2
3
4
5
6
7
8
9
10

from tf_agents.agents.dqn import dqn_agent
#tf_env is the environment implementation, q_network is the neural network used as the model
agent = dqn_agent.DqnAgent(
tf_env.time_step_spec(),
tf_env.action_spec(),
q_network=q_net,
boltzmann_temperature = 0.8, #<-- this parameter configures Boltzmann policy
optimizer=tf.train.AdamOptimizer(0.001))

It is important to remember that **we cannot use both epsilon_greedy and boltzmann_temperature parameters at the same time** because those are two different exploration methods and cannot be used at the same time.

In the DQNAgent code, there is the following if statement:

1
2
3
4
5
6
7
8

# DQNAgent implementation in Tensorflow-Agents
# https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/agents/dqn/dqn_agent.py#L285
if boltzmann_temperature is not None:
collect_policy = boltzmann_policy.BoltzmannPolicy(
policy, temperature=self._boltzmann_temperature)
else:
collect_policy = epsilon_greedy_policy.EpsilonGreedyPolicy(
policy, epsilon=self._epsilon_greedy)

We see that the boltzmann_temperature is used to create the proper exploration policy object (called collect_policy in Tensorflow-Agent code).

# How does it work

While exploring, the agent creates an action distribution. This distribution **describes how optimal an action is according to the data gathered by the agent**. If you want, you can say that the action distribution describes the agent’s belief about the optimal action.

In the Boltzmann policy implementation, the **original action distribution gets divided by the temperature parameter**. Because of that, Boltzmann policy turns the agent’s exploration behavior into a **spectrum between picking the action randomly (random policy) and always picking the most optimal action (greedy policy)**.

1
2
3
4
5
6

# BoltzmannPolicy implementation in Tensorflow-Agents
# https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/policies/boltzmann_policy.py#L67
def _apply_temperature(self, dist):
"""Change the action distribution to incorporate the temperature."""
logits = dist.logits / self._get_temperature_value()
return dist.copy(logits=logits)

If we specify a very small temperature value, the differences between original action probabilities become more substantial, so the action with the highest probability is even more likely to be selected.
**If the temperature parameter is very close to zero, it turns the Boltzmann policy into a greedy policy** because the most probable action gets selected all the time.

On the other hand, a **huge value of the temperature parameter** dominates the original action distribution. As a result, there are almost no differences between probabilities, and we end up with a **random policy**.

Did you enjoy reading this article?

Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?

**Subscribe to the newsletter** or **add this blog to your RSS reader** (does anyone still use them?) to get a notification when I publish a new essay!

You may also like

## Bartosz Mikulski

- MLOps engineer by day
- AI and data engineering consultant by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz
- Mastodon: @mikulskibartosz@mathstodon.xyz