# How to use a custom metric with Tensorflow Agents

In this article, I am going to implement a custom Tensorflow Agents metric that calculates the maximal discounted reward.

First, I have to import the metric-related modules and the driver module (the driver runs the simulation). Additionally, I need an environment. I’m going to use the one I implemented in this article.

1
2
3
4

from tf_agents.metrics import tf_py_metric
from tf_agents.metrics import py_metric
from tf_agents.drivers import py_driver
from tf_agents.drivers import dynamic_episode_driver

My metric needs to store the rewards and discounts from the current episode and the maximal discounted total score. For that, I need two arrays (for the episode scores) and one variable to keep the maximal reward.

1
2
3
4
5
6
7

class MaxEpisodeScoreMetric(py_metric.PyStepMetric):
def __init__(self, name='MaxEpisodeScoreMetric'):
super(py_metric.PyStepMetric, self).__init__(name)
self.rewards = []
self.discounts = []
self.max_discounted_reward = None
self.reset()

The reset function is mandatory, and it allows the metric instance to be reused by separate driver runs.

1
2
3
4
5

#add it inside the MaxEpisodeScoreMetric class
def reset(self):
self.rewards = []
self.discounts = []
self.max_discounted_reward = None

In the call function, I am going to copy the reward and discount of the current step to the arrays. Then, if the current step is also the last step of an episode, I am going to calculate the discounted reward using the Bellman equation.

After that, I compare the total discounted reward of the current episode with the maximal reward. If I got a value larger than the current maximum, I would replace the maximum with the new value.

Because the instance is not reset between episodes, I need to clear the lists I use to keep the episode rewards and discounts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

#add it inside the MaxEpisodeScoreMetric class
def call(self, trajectory):
self.rewards += trajectory.reward
self.discounts += trajectory.discount
if(trajectory.is_last()):
adjusted_discounts = [1.0] + self.discounts # because a step has its value + the discount of the NEXT step (Bellman equation)
adjusted_discounts = adjusted_discounts[:-1] # dropping the discount of the last step because it is not followed by a next step, so the value is useless
discounted_reward = np.sum(np.multiply(self.rewards, adjusted_discounts))
print(self.rewards, adjusted_discounts, discounted_reward)
if self.max_discounted_reward == None:
self.max_discounted_reward = discounted_reward
if discounted_reward > self.max_discounted_reward:
self.max_discounted_reward = discounted_reward
self.rewards = []
self.discounts = []

In the result function, I don’t need to perform any additional operations, so I return the maximal discounted total reward.

1
2
3

#add it inside the MaxEpisodeScoreMetric class
def result(self):
return self.max_discounted_reward

I want to use my metric as a Tensorflow metric, so I had to wrap it with a class extending TFPyMetric.

1
2
3
4
5
6
7

class TFMaxEpisodeScoreMetric(tf_py_metric.TFPyMetric):
def __init__(self, name='MaxEpisodeScoreMetric', dtype=tf.float32):
py_metric = MaxEpisodeScoreMetric()
super(TFMaxEpisodeScoreMetric, self).__init__(
py_metric=py_metric, name=name, dtype=dtype)

Finally, I can add the metric to the driver’s observers and run the driver.

1
2
3
4
5
6
7
8
9
10
11
12

#tf_env is from the article mentioned in the second paragraph
tf_policy = random_tf_policy.RandomTFPolicy(action_spec=tf_env.action_spec(),
time_step_spec=tf_env.time_step_spec())
max_score = TFMaxEpisodeScoreMetric()
observers = [max_score]
driver = dynamic_episode_driver.DynamicEpisodeDriver(tf_env, tf_policy, observers, num_episodes=1000)
final_time_step, policy_state = driver.run()
print('Max score:', max_score.result().numpy())

Result:

1

Max score: 1.715

Did you enjoy reading this article?

Would you like to learn more about software craft in data engineering and MLOps?

**Subscribe to the newsletter** or **add this blog to your RSS reader** (does anyone still use them?) to get a notification when I publish a new essay!

You may also like

- How to use a behavior policy with Tensorflow Agents
- Understanding the Keras layer input shapes
- How to train a model in TensorFlow 2.0
- Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models
- Using keras-tuner to tune hyperparameters of a TensorFlow model

## Bartosz Mikulski

- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz
- Mastodon: @mikulskibartosz@mathstodon.xyz