How to use a custom metric with Tensorflow Agents

In this article, I am going to implement a custom Tensorflow Agents metric that calculates the maximal discounted reward.

First, I have to import the metric-related modules and the driver module (the driver runs the simulation). Additionally, I need an environment. I’m going to use the one I implemented in this article.

1
2
3
4
from tf_agents.metrics import tf_py_metric
from tf_agents.metrics import py_metric
from tf_agents.drivers import py_driver
from tf_agents.drivers import dynamic_episode_driver

My metric needs to store the rewards and discounts from the current episode and the maximal discounted total score. For that, I need two arrays (for the episode scores) and one variable to keep the maximal reward.

1
2
3
4
5
6
7
class MaxEpisodeScoreMetric(py_metric.PyStepMetric):
  def __init__(self, name='MaxEpisodeScoreMetric'):
    super(py_metric.PyStepMetric, self).__init__(name)
    self.rewards = []
    self.discounts = []
    self.max_discounted_reward = None
    self.reset()

The reset function is mandatory, and it allows the metric instance to be reused by separate driver runs.

1
2
3
4
5
#add it inside the MaxEpisodeScoreMetric class
  def reset(self):
    self.rewards = []
    self.discounts = []
    self.max_discounted_reward = None

In the call function, I am going to copy the reward and discount of the current step to the arrays. Then, if the current step is also the last step of an episode, I am going to calculate the discounted reward using the Bellman equation.

After that, I compare the total discounted reward of the current episode with the maximal reward. If I got a value larger than the current maximum, I would replace the maximum with the new value.

Because the instance is not reset between episodes, I need to clear the lists I use to keep the episode rewards and discounts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#add it inside the MaxEpisodeScoreMetric class
def call(self, trajectory):
    self.rewards += trajectory.reward
    self.discounts += trajectory.discount
    
    if(trajectory.is_last()):      
      adjusted_discounts = [1.0] + self.discounts # because a step has its value + the discount of the NEXT step (Bellman equation)
      adjusted_discounts = adjusted_discounts[:-1] # dropping the discount of the last step because it is not followed by a next step, so the value is useless
      discounted_reward = np.sum(np.multiply(self.rewards, adjusted_discounts))
      print(self.rewards, adjusted_discounts, discounted_reward)
      
      if self.max_discounted_reward == None:
        self.max_discounted_reward = discounted_reward
      
      if discounted_reward > self.max_discounted_reward:
        self.max_discounted_reward = discounted_reward
        
      self.rewards = []
      self.discounts = []

In the result function, I don’t need to perform any additional operations, so I return the maximal discounted total reward.

1
2
3
#add it inside the MaxEpisodeScoreMetric class
  def result(self):
    return self.max_discounted_reward

I want to use my metric as a Tensorflow metric, so I had to wrap it with a class extending TFPyMetric.

1
2
3
4
5
6
7
class TFMaxEpisodeScoreMetric(tf_py_metric.TFPyMetric):

  def __init__(self, name='MaxEpisodeScoreMetric', dtype=tf.float32):
    py_metric = MaxEpisodeScoreMetric()

    super(TFMaxEpisodeScoreMetric, self).__init__(
        py_metric=py_metric, name=name, dtype=dtype)

Finally, I can add the metric to the driver’s observers and run the driver.

1
2
3
4
5
6
7
8
9
10
11
12
#tf_env is from the article mentioned in the second paragraph
tf_policy = random_tf_policy.RandomTFPolicy(action_spec=tf_env.action_spec(),
                                            time_step_spec=tf_env.time_step_spec())

max_score = TFMaxEpisodeScoreMetric()

observers = [max_score]
driver = dynamic_episode_driver.DynamicEpisodeDriver(tf_env, tf_policy, observers, num_episodes=1000)

final_time_step, policy_state = driver.run()

print('Max score:', max_score.result().numpy())

Result:

1
Max score: 1.715

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.