How to use a behavior policy with Tensorflow Agents

In this blog post, I am going to show an example of configuring a basic behavior policy which will be used to interact with a Tensorflow Agents environment.

As examples, I will use two policies: the random policy and the scripted policy. Note that using a neural network to determine the behavior policy is not in the scope of this article.

I have already defined the environment while writing the previous article. Please copy/paste the imports, the environment class, and the code that initializes it from the previous text. Then, you should also copy the following imports:

1
2
from tf_agents.policies import random_py_policy
from tf_agents.policies import scripted_py_policy

The environment represents the bare basics of TicTacToe. I limited the game to only one player, and the only goal of the player is to fill the board without trying to pick the same spot more than once.

Random policy

In the previous text, I used a random policy to choose actions, but I was doing it manually. This time I will implement picking a random action as a Tensorflow policy.

1
2
action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
random_policy = random_py_policy.RandomPyPolicy(time_step_spec=None, action_spec=action_spec)

Now, I can use the policy to interact with the environment. The only difference between the previous implementation and the following code is that I replaced the tf.random_uniform function with a call to the action function of the random policy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  tf_env.reset()
  while not tf_env.current_time_step().is_last():
    action = random_policy.action(tf_env.current_time_step())
    next_time_step = tf_env.step(action)
    episode_steps += 1
    episode_reward += next_time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)
  
num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)

Result:

1
2
3
num_episodes: 10000 num_steps: 44350
avg_length 4.435 avg_reward: -0.6551
max_length 9 max_reward: 1.8000001


Scripted policy

Because I have oversimplified the rules of the game, a simple scripted policy can get the maximal reward every time. It must pick the spots sequentially, one after another. This scripted policy can represent such behavior.

1
2
3
4
5
6
7
8
9
10
11
12
13
action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
action_script = [(1, np.int32(0)), 
                 (1, np.int32(1)),
                 (1, np.int32(2)), 
                 (1, np.int32(3)),
                 (1, np.int32(4)), 
                 (1, np.int32(5)),
                 (1, np.int32(6)), 
                 (1, np.int32(7)),
                 (1, np.int32(8))]

scripted_policy = scripted_py_policy.ScriptedPyPolicy(
    time_step_spec=None, action_spec=action_spec, action_script=action_script)

Note two things. The script is represented as an array of actions. Every action consists of a tuple. Its first element is the number of times the action should be repeated. I must perform every action exactly once, so I put 1 in every one of the tuples.

The second element is the array of actions that has the same shape as the action specification in the first line. I specified the shape as an empty tuple, which is a special value that denotes a scalar (a single numerical value). Therefore to create the scalar, I call a numpy function that returns a value of a given type (in this case, int32).

Using a scripted policy

The instance of a scripted policy class does not store any information about the current step. Because of that, I have to modify the code I use to test the policy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  tf_env.reset()
  policy_state = scripted_policy.get_initial_state() #1
  while not tf_env.current_time_step().is_last():
    action = scripted_policy.action(tf_env.current_time_step(), policy_state) #2
    next_time_step = tf_env.step(action.action) #3
    policy_state = action.state #4
    episode_steps += 1
    episode_reward += next_time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)
  
num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)

First, I call the get_initial_state function and store the policy state in a variable (#1). In the inner loop, I pass both the current step and the policy state to the policy instance (#2). Its action function returns not only the next action but an object which contains the action (#3) and the new policy state (#4).


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * big data engineer * conference speaker * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group