How to use a behavior policy with Tensorflow Agents

In this blog post, I am going to show an example of configuring a basic behavior policy which will be used to interact with a Tensorflow Agents environment.

As examples, I will use two policies: the random policy and the scripted policy. Note that using a neural network to determine the behavior policy is not in the scope of this article.

I have already defined the environment while writing the previous article. Please copy/paste the imports, the environment class, and the code that initializes it from the previous text. Then, you should also copy the following imports:

1
2
from tf_agents.policies import random_py_policy
from tf_agents.policies import scripted_py_policy

The environment represents the bare basics of TicTacToe. I limited the game to only one player, and the only goal of the player is to fill the board without trying to pick the same spot more than once.

Random policy

In the previous text, I used a random policy to choose actions, but I was doing it manually. This time I will implement picking a random action as a Tensorflow policy.

1
2
action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
random_policy = random_py_policy.RandomPyPolicy(time_step_spec=None, action_spec=action_spec)

Now, I can use the policy to interact with the environment. The only difference between the previous implementation and the following code is that I replaced the tf.random_uniform function with a call to the action function of the random policy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  tf_env.reset()
  while not tf_env.current_time_step().is_last():
    action = random_policy.action(tf_env.current_time_step())
    next_time_step = tf_env.step(action)
    episode_steps += 1
    episode_reward += next_time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)
  
num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)

Result:

1
2
3
num_episodes: 10000 num_steps: 44350
avg_length 4.435 avg_reward: -0.6551
max_length 9 max_reward: 1.8000001

Scripted policy

Because I have oversimplified the rules of the game, a simple scripted policy can get the maximal reward every time. It must pick the spots sequentially, one after another. This scripted policy can represent such behavior.

1
2
3
4
5
6
7
8
9
10
11
12
13
action_spec = array_spec.BoundedArraySpec((), np.int32, minimum=0, maximum=8)
action_script = [(1, np.int32(0)), 
                 (1, np.int32(1)),
                 (1, np.int32(2)), 
                 (1, np.int32(3)),
                 (1, np.int32(4)), 
                 (1, np.int32(5)),
                 (1, np.int32(6)), 
                 (1, np.int32(7)),
                 (1, np.int32(8))]

scripted_policy = scripted_py_policy.ScriptedPyPolicy(
    time_step_spec=None, action_spec=action_spec, action_script=action_script)

Note two things. The script is represented as an array of actions. Every action consists of a tuple. Its first element is the number of times the action should be repeated. I must perform every action exactly once, so I put 1 in every one of the tuples.

The second element is the array of actions that has the same shape as the action specification in the first line. I specified the shape as an empty tuple, which is a special value that denotes a scalar (a single numerical value). Therefore to create the scalar, I call a numpy function that returns a value of a given type (in this case, int32).

Using a scripted policy

The instance of a scripted policy class does not store any information about the current step. Because of that, I have to modify the code I use to test the policy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10000

for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  tf_env.reset()
  policy_state = scripted_policy.get_initial_state() #1
  while not tf_env.current_time_step().is_last():
    action = scripted_policy.action(tf_env.current_time_step(), policy_state) #2
    next_time_step = tf_env.step(action.action) #3
    policy_state = action.state #4
    episode_steps += 1
    episode_reward += next_time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)
  
num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
max_reward = np.max(rewards)
max_length = np.max(steps)

print('num_episodes:', num_episodes, 'num_steps:', num_steps)
print('avg_length', avg_length, 'avg_reward:', avg_reward)
print('max_length', max_length, 'max_reward:', max_reward)

First, I call the get_initial_state function and store the policy state in a variable (#1). In the inner loop, I pass both the current step and the policy state to the policy instance (#2). Its action function returns not only the next action but an object which contains the action (#3) and the new policy state (#4).

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.