# How to use a behavior policy with Tensorflow Agents

In this blog post, I am going to show an example of configuring a basic behavior policy which will be used to interact with a Tensorflow Agents environment.

As examples, I will use two policies: the random policy and the scripted policy. Note that using a neural network to determine the behavior policy is not in the scope of this article.

I have already defined the environment while writing the previous article. Please copy/paste the imports, the environment class, and the code that initializes it from the previous text. Then, you should also copy the following imports:

1
2
from tf_agents.policies import random_py_policy
from tf_agents.policies import scripted_py_policy


The environment represents the bare basics of TicTacToe. I limited the game to only one player, and the only goal of the player is to fill the board without trying to pick the same spot more than once.

# Random policy

In the previous text, I used a random policy to choose actions, but I was doing it manually. This time I will implement picking a random action as a Tensorflow policy.

1
2
action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
random_policy = random_py_policy.RandomPyPolicy(time_step_spec=None, action_spec=action_spec)


Now, I can use the policy to interact with the environment. The only difference between the previous implementation and the following code is that I replaced the tf.random_uniform function with a call to the action function of the random policy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
episode_reward = 0
episode_steps = 0
tf_env.reset()
while not tf_env.current_time_step().is_last():
action = random_policy.action(tf_env.current_time_step())
next_time_step = tf_env.step(action)
episode_steps += 1
episode_reward += next_time_step.reward.numpy()
rewards.append(episode_reward)
steps.append(episode_steps)

num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)


Result:

1
2
3
num_episodes: 10000 num_steps: 44350
avg_length 4.435 avg_reward: -0.6551
max_length 9 max_reward: 1.8000001


# Scripted policy

Because I have oversimplified the rules of the game, a simple scripted policy can get the maximal reward every time. It must pick the spots sequentially, one after another. This scripted policy can represent such behavior.

1
2
3
4
5
6
7
8
9
10
11
12
13
action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
action_script = [(1, np.int32(0)),
(1, np.int32(1)),
(1, np.int32(2)),
(1, np.int32(3)),
(1, np.int32(4)),
(1, np.int32(5)),
(1, np.int32(6)),
(1, np.int32(7)),
(1, np.int32(8))]

scripted_policy = scripted_py_policy.ScriptedPyPolicy(
time_step_spec=None, action_spec=action_spec, action_script=action_script)


Note two things. The script is represented as an array of actions. Every action consists of a tuple. Its first element is the number of times the action should be repeated. I must perform every action exactly once, so I put 1 in every one of the tuples.

The second element is the array of actions that has the same shape as the action specification in the first line. I specified the shape as an empty tuple, which is a special value that denotes a scalar (a single numerical value). Therefore to create the scalar, I call a numpy function that returns a value of a given type (in this case, int32).

## Using a scripted policy

The instance of a scripted policy class does not store any information about the current step. Because of that, I have to modify the code I use to test the policy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
episode_reward = 0
episode_steps = 0
tf_env.reset()
policy_state = scripted_policy.get_initial_state() #1
while not tf_env.current_time_step().is_last():
action = scripted_policy.action(tf_env.current_time_step(), policy_state) #2
next_time_step = tf_env.step(action.action) #3
policy_state = action.state #4
episode_steps += 1
episode_reward += next_time_step.reward.numpy()
rewards.append(episode_reward)
steps.append(episode_steps)

num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)


First, I call the get_initial_state function and store the policy state in a variable (#1). In the inner loop, I pass both the current step and the policy state to the policy instance (#2). Its action function returns not only the next action but an object which contains the action (#3) and the new policy state (#4).

Remember to share on social media!