In this tutorial, we will go through the Proximal Policy Gradient algorithm in this specific implementation. In the following sections, we will introduce the algorithm, discuss some of its key features, outline the expected behaviors in terms of some metrics, and explain the configuration file in details.
PPO in its core is a soft trust region policy optimization approach. The classic TRPO approach by John Schulman formulates the problem by maximizing likelihood of actions subjected to the change in policy (KL divergence) must of lower than a certain threshold value.
PPO, on the other hand, takes a soft approach: namely the algorithm is encouraged to make update that lies within the “trust region” that is at most delta away from the original policy in terms of KL-divergence. Although KL divergence is not a strict distance metric theoretically. In this document, it will generally be indicative of how far policies are away from each other.
There are two major variants of PPO at the moment, one by OpenAI and the other by DeepMind. The first one, commonly refers to clip
mode, pessimistically clips the probability ratio (can be seen as important sampling weights). The other one, commonly referred to as adapt
, adaptively penalizes the KL divergence between the old and new policy (imagine the KL-penalty as a regularization term).
PPO in its vanilla form is an on-policy algorithm, which requires the learner's learned distribution to be the same as the one agent used to generate the data. However, in distributed setting such as surreal, we cannot guarantee such synchronization. Thus, we applied a few modification to the algorithm to stabilize and improve the performance.
-
Target Network: Due to our design's asynchronous nature and inevitable network latency, the actors' behavior policy that generates the experience trajectories can lag behind the learner's policy by several updates at the time of gradient computation. To address this issue, PPO learner keeps a target network that is broadcast to all actors at a lower frequency. This ensures that a much larger portion of the experience trajectories are on-policy, except for those generated within the policy lag. Empirically, we find that the target network mechanism reduces off-policyness and improves PPO's performance significantly.
-
Observation Z-filtering: we keep a running mean and variance of the observations. At each iteration, we pass the raw observation into this z-filter first to get a zero mean unit variance whitened observation, which is then inputted to the actor/critic network. The running mean and variance is released at the same time with the actor/critic network parameters. Despite the effectiveness of z-filtering on low dimensional input, z-filtering does not work with pixel based inputs (at least on HalfCheetah). This is because the camera is centered on the agent itself. After z-filtering, the input image will be black.On the note of normalization, Layernorm and Batchnorm are briefly considered to replace z-filter. However, these are actually both not ideal:
-
LSTM: instead of taking input only the observation at a timestep, our implementation also take input a tuple of hidden states from previous timesteps using a LSTM. Noticeably, the idea of horizon is introduced. Horizon is used as the depth of generalized advantage estimate for advantage calculation for every data point in a truncated rollout. We found that LSTM when used with action repeat is highly benefitial for longer horizon tasks such as our robotic benchmark.
-
Other modifcation includes using generalized advantage actor critic (details here and batch advantage normalization. For full implementation detail and pseudocode, please see manuscript.
Key Metrics:
-
_pol_kl
: KL divergence between target and new policy. This should fluctuate and should not exceed2 * kl_target
. If it does, a large penalty is incurred inadapt
method. You should expect a dip in performance -
_avg_return_target
: highly correlated with the performance of the agents. This metric denotes the average N-step return for each update. You can see when the agents stop learning by looking at this metric -
_ref_behave_diff
: tracks how different is the target network and the policy used to generate the data. With target network, this should be in range of1e-4
or lower. -
_avg_log_sig
,entropy
: these two will tell you the state of your algorithm. If_avg_log_sig
is increasing, that means the algorithm is zoning in on a specific policy. They should generally decrease as training progresses if the agent is learning useful information. You may see in the first iterations it may increase. But in long term they should decrease.
The code for Surreal PPO are contained in the following files
- Custom PPO Learner class
- Custom PPO Agent class
- Custom PPO Model class
- PPO network architectures (Note: we use TorchX for our networks)
- PPO configurations
In the configuration file, there are many many hyperparameters. In the following section, we will go through each section in detail. For each algorithm, we have 3 nested dictionary configs that separates learner, environment, and session configurations. Particularly for PPO, there are many relevant parameters. In this section we will explain in details the most important ones.
Learner Config:
learner_config.model
: contains model architecture design such asactor_fc_hidden_sizes
andcnn_feature_dim
learner_config.algo.network
: contains optimizer, learning rate, and annealing setup. Note thatnetwork.lr_scheduler
should specify a class in eithertorch.optim.lr_scheduler
ortorchx.nn.hyper_scheduler
learner_config.algo.rnn
: specifies LSTM layers and hidden unitslearner_config.algo.consts
: specifies some training constants, such as initial log standard deviationconsts.init_log_sig
and target KL divergence for each parameter releaseconsts.kl_target
learner_config.algo.adapt_consts
: specifies hyperparameters specifically foradapt
PPO. Important hyperparameters includeadapt_consts.kl_cutoff_coeff
which is the coefficient for KL penalty when the KL divergence of update exceeds twice the target KL divergencelearner_config.replay
: specifies the replay buffer used. For PPO we useFIFOQueue
forreplay.replay_class
and small queue lengthreplay.memory_size
. In this case we use 96.learner_config.parameter_publish.exp_interval
: specifies how often learner pushes parameter to Parameter Server. 4096 denotes the number of sub-trajectory processed until parameter is published. With batch size of 64, we publish parameters every4096/64=64
mini-batches.
Environment Config:
env_config.action_repeat
: specifies how many times the input action is repeated before allowing next action input from actor. We find this highly impactful for Robotic Manipulation benchmark tasksenv_config.stochastic_eval
: specifies whether agent uses deterministic (mean) or stochastic (mean sampled with standard deviation) policy for evaluation.env_config.demonstration
: specifies setup for sampling states from collected expert demonstrationdemonstration.adaptive
: whether we use adaptive or open-loop curriculum. In adaptive setup, the curriculum updates only if certain improvement criteria is met, whereas in open-loop curriculum setup, curriculum is updated based on fixed episode interval. Please read our manuscript and appendix for full detailsdemonstration.mixing
anddemonstration.mixing_ratio
:demonstration.mixing
specifies which curricula ofrandom
,uniform
,forward
,reverse
we want to use anddemonstration.mixing_ratio
specifies how often each should be used. Note that the values indemonstration.mixing_ratio
must sum to1.0
Session Config:
session_config.agent.fetch_parameter_mode
: specifies how should the agent poll parameters from Parameter Server. The choice is eitherstep
(every certain steps) orepisode
(every certain episodes)session_config.agent.fetch_parameter_interval
: specifies how often agents would poll parameter server for new parameters. If no new parameters are available, this will be no_op.session_config.checkpoint
: specifies the interval for checkpointing models.