Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models
In this article, I am going to show you how to use Boltzmann policy in TensorFlow-Agent, how to configure the policy, and what is the expected result of various configuration options.
Use Boltzmann policy with DQN Agent
While using the deep Q-network agent as our reinforcement learning model, we can easily configure Boltzmann policy by specifying the boltzmann_temperature parameter in the DQNAgent constructor.
1 2 3 4 5 6 7 8 9 10 from tf_agents.agents.dqn import dqn_agent #tf_env is the environment implementation, q_network is the neural network used as the model agent = dqn_agent.DqnAgent( tf_env.time_step_spec(), tf_env.action_spec(), q_network=q_net, boltzmann_temperature = 0.8, #<-- this parameter configures Boltzmann policy optimizer=tf.train.AdamOptimizer(0.001))
It is important to remember that we cannot use both epsilon_greedy and boltzmann_temperature parameters at the same time because those are two different exploration methods and cannot be used at the same time.
In the DQNAgent code, there is the following if statement:
1 2 3 4 5 6 7 8 # DQNAgent implementation in Tensorflow-Agents # https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/agents/dqn/dqn_agent.py#L285 if boltzmann_temperature is not None: collect_policy = boltzmann_policy.BoltzmannPolicy( policy, temperature=self._boltzmann_temperature) else: collect_policy = epsilon_greedy_policy.EpsilonGreedyPolicy( policy, epsilon=self._epsilon_greedy)
We see that the boltzmann_temperature is used to create the proper exploration policy object (called collect_policy in Tensorflow-Agent code).
How does it work
While exploring, the agent creates an action distribution. This distribution describes how optimal an action is according to the data gathered by the agent. If you want, you can say that the action distribution describes the agent’s belief about the optimal action.
In the Boltzmann policy implementation, the original action distribution gets divided by the temperature parameter. Because of that, Boltzmann policy turns the agent’s exploration behavior into a spectrum between picking the action randomly (random policy) and always picking the most optimal action (greedy policy).
1 2 3 4 5 6 # BoltzmannPolicy implementation in Tensorflow-Agents # https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/policies/boltzmann_policy.py#L67 def _apply_temperature(self, dist): """Change the action distribution to incorporate the temperature.""" logits = dist.logits / self._get_temperature_value() return dist.copy(logits=logits)
If we specify a very small temperature value, the differences between original action probabilities become more substantial, so the action with the highest probability is even more likely to be selected. If the temperature parameter is very close to zero, it turns the Boltzmann policy into a greedy policy because the most probable action gets selected all the time.
On the other hand, a huge value of the temperature parameter dominates the original action distribution. As a result, there are almost no differences between probabilities, and we end up with a random policy.
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz