How to use a behavior policy with Tensorflow Agents

In this blog post, I am going to show an example of configuring a basic behavior policy which will be used to interact with a Tensorflow Agents environment.

As examples, I will use two policies: the random policy and the scripted policy. Note that using a neural network to determine the behavior policy is not in the scope of this article.

I have already defined the environment while writing the previous article. Please copy/paste the imports, the environment class, and the code that initializes it from the previous text. Then, you should also copy the following imports:

from tf_agents.policies import random_py_policy
from tf_agents.policies import scripted_py_policy

The environment represents the bare basics of TicTacToe. I limited the game to only one player, and the only goal of the player is to fill the board without trying to pick the same spot more than once.

Random policy

In the previous text, I used a random policy to choose actions, but I was doing it manually. This time I will implement picking a random action as a Tensorflow policy.

action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
random_policy = random_py_policy.RandomPyPolicy(time_step_spec=None, action_spec=action_spec)

Now, I can use the policy to interact with the environment. The only difference between the previous implementation and the following code is that I replaced the tf.random_uniform function with a call to the action function of the random policy.

time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  tf_env.reset()
  while not tf_env.current_time_step().is_last():
    action = random_policy.action(tf_env.current_time_step())
    next_time_step = tf_env.step(action)
    episode_steps += 1
    episode_reward += next_time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)

num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)

Result:

num_episodes: 10000 num_steps: 44350
avg_length 4.435 avg_reward: -0.6551
max_length 9 max_reward: 1.8000001

Scripted policy

Because I have oversimplified the rules of the game, a simple scripted policy can get the maximal reward every time. It must pick the spots sequentially, one after another. This scripted policy can represent such behavior.

action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
action_script = [(1, np.int32(0)),
                 (1, np.int32(1)),
                 (1, np.int32(2)),
                 (1, np.int32(3)),
                 (1, np.int32(4)),
                 (1, np.int32(5)),
                 (1, np.int32(6)),
                 (1, np.int32(7)),
                 (1, np.int32(8))]

scripted_policy = scripted_py_policy.ScriptedPyPolicy(
    time_step_spec=None, action_spec=action_spec, action_script=action_script)

Note two things. The script is represented as an array of actions. Every action consists of a tuple. Its first element is the number of times the action should be repeated. I must perform every action exactly once, so I put 1 in every one of the tuples.

The second element is the array of actions that has the same shape as the action specification in the first line. I specified the shape as an empty tuple, which is a special value that denotes a scalar (a single numerical value). Therefore to create the scalar, I call a numpy function that returns a value of a given type (in this case, int32).

Using a scripted policy

The instance of a scripted policy class does not store any information about the current step. Because of that, I have to modify the code I use to test the policy.

time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  tf_env.reset()
  policy_state = scripted_policy.get_initial_state() #1
  while not tf_env.current_time_step().is_last():
    action = scripted_policy.action(tf_env.current_time_step(), policy_state) #2
    next_time_step = tf_env.step(action.action) #3
    policy_state = action.state #4
    episode_steps += 1
    episode_reward += next_time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)

num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)

First, I call the get_initial_state function and store the policy state in a variable (#1). In the inner loop, I pass both the current step and the policy state to the policy instance (#2). Its action function returns not only the next action but an object which contains the action (#3) and the new policy state (#4).

Older post

How to create an environment for a Tensorflow Agent?

Implementing a Tensorflow Agent environment to play a board game

Newer post

How to use a custom metric with Tensorflow Agents

How to define a new Tensorflow Agents metric and add it to the driver