This folder holds all of the environment definitions used for training. Reward functions are put here as well. Envs are unified for Cassie and Digit.
All environments should inherit from the abstractclass GenericEnv
. This will ensure that all of the necessary functions for training (reset
, step
, get_state
, compute_reward
, etc) are defined. GenericEnv
also defines the functions that are likely going to be the same across all robot envs, like resetting the simulation, stepping the simulation forward, dynamics randomization, and trackers.
The environment definitions are split up between "tasks" and "robots", so that the same task environment definition can be used for different robots. These "task" environments are in the env/tasks
folder. They each hold on to a BaseRobot
object that defines the robot specific attributes. Note that the envs are robot agnostic, and thus any task environment can be used with any robot (at least where applicable). For example, to train Digit walking instead of Cassie, one only needs to change the robot_name
arg, a separate environment definition is not required. The robots are defined in the env/robots
dir. Currently implemented robots are Cassie
and Digit
which both inherit BaseRobot
. They create the actual simulation object self.sim
that you'll interact with. Note that the environment should be blind to the simulator type, and should interact with it using only the functions defined in generic_sim.py
. This is so that any simulator type can be used with any environment seamlessly. See the sim documentation for more details. Additionally, each robot dir defines dynamics randomization ranges dynamics_randomization.json
. See below for more details.
The "child" environments like LocomotionEnvClock
and LocomotionEnv
are the final object that the PPO training will see. These are what actually define the functions required by GenericEnv
. Here you can define what the final inputs to the policy will be, how resets and commands are handled, and any extra things to handle in a policy step.
We define 2 main walking environments as examples: LocomotionClockEnv
and LocomotionEnv
. Both are compatible with both robots by simply setting the robot_name
arg. For most of our research we use the LocomotionClockEnv
Each child env will have the compute_reward
and compute_done
function that define the reward function and termination condition that the training will use. Note that the functions themselves are defined in the rewards
folder and are loaded in the __init__
when the env is created. Rewards consist of two files: a "function" file that defines the actual reward function and termination condition, and a "weighting" json file (always called reward_weight.json
) that defines the scaling and weighting of each reward term. Function files should be named the same as the folder they are in. For example, to make a new reward the file my_reward.py
and reward_weight.json
should be in the folder rewards/my_reward/
. This will allow for the reward to be automatically loaded by just specifying which reward function use with a string input to the env constructor.
The compute_rewards
itself will construct a dictionary of named "raw" cost/error components, where the lower the magnitude the higher the reward. For example, "x_vel" in locomotion_vonmises_clock_reward is the absolute difference between the current velocity and the target velocity. At the end all of the reward components are added up by scaling each component, putting it through a kernel function (usually exp(-x)) and then weighting it. These scalings and weights are defined in reward_weight.json
and get loaded into the self.reward_weight
dictionary at env construction. The maximum reward is almost always 1, so it's easier to compare rewards between runs and scales nicely with the trajectory length.
The difference between weightings and scalings is that scalings come before the kernel function and weightings come after. The scalings are meant to scale the error terms to a reasonable range that the kernel function can act on (0-2 ish). For example, there are some force based components that are in units of N, and thus can be in the 100s, which is way outside of the kernel function's effective range. Scaling them down by 0.01 allows the reward component to actually have an effect.
Weightings are meant to specify what portion of the reward function that reward component takes up, i.e. how important is this component. For this reason, it can be useful to make the sum of the reward weightings equal to 1, so then each individual weighting represents what percentage of the reward it makes up. For example "x_vel" is 15% of the total reward. Note that this is not strictly necessary, as the env will automatically scale the reward weights such that they sum up to 1 anyway.
Remember that the env is blind to the simulator type, so when getting state info from the simulator use self.sim
functions, do not assume that it is a Mujoco instance and you have access to self.sim.data
.
The ranges for dynamics randomization is specified in each env's dynamics_randomization.json
file. Ranges are specified in terms of percentages, so [-0.25, 0.25]
corresponds to +-25% of the nominal value. However, note that this is not the case for ipos
randomization, where the ranges instead correspond to absolute distance in m. So [-0.01, 0.01]
corresponds to +-1 cm randomization in the body's CoM location. Except for friction and encoder noise, each randomization type specifies a range for each of the bodies/joints to randomize. Each could have different ranges if you want, but we always keep them equal. If you don't want to randomize that body or joint, simply remove it from the list.
Our locomotion/walking envs utilize a clock in both the state input and reward to regulate the walking motion. These clocks are defined in periodicclock.py
with the PeriodicClock
class and any env that uses a clock should have it as a class variable. PeriodicClock
takes in as arguments:
cycle_time
: How long in seconds a single cycle should last.phase_add
: How much in seconds to increment the clock each step. This in conjunction with the cycle time defines the effective stepping frequency. This should be just set to the env's policy rate (since that's how much time passes each env step) and stepping frequency should be controlled by changing thecycle_time
.swing_ratios
: What percentage of the cycle should be spent in swing (foot in the air). Should be a 2 long list, one swing ratio of each foot.period_shifts
: The shift value for each foot, i.e. how much to offset each foot's clock by. Should be a 2 long list, one shift value for each foot. [0, 0] (or any list where both values are equal) corresponds to the feet perfectly synced up (hopping). [0, 0.5] corresponds to the feet being directly out of phase with each other (walking).
There are getter/setter functions for each of these variables. You should use the increment
function after each env step in order to increment the clock forward. This will just add phase_add
to the internal phase
variable as well as take care of any wrap around (phase is always between 0 and cycle_time
).
PeriodicClock
has a couple options to use for input clocks:
input_clock
: A basic clock signal that consists of just a sine and cosine pair.input_sine_only_clock
: Consists of two sine functions, one to represent each leg. Note that this clock emperically requires a LSTM to learn, since in the case when the two sine functions line up (when period shifts are equal) there is no way to know where you are in the cycle without a history/memory (which is why the basic clock uses a sine/cosine pair). This old clock type isn't great and should be avoided, use the below full clock instead.input_full_clock
: Same as the sine only clock in that it has separate representations for each leg, but has a full sine/cosine pair each leg to resolve the equal period shift degenerate case described above.
PeriodicClock
also has clock functions for use in reward functions, usually to describe actual motion of the feet whereas the input clock detailed above are just clock function to describe where you are in a periodic cycle. The reward clock functions here actually use the swing ratio values. The two types are
linear_clock
: Just a piecewise linear clock. Goes from 0 to 1 and linearly interpolates inbetween. Takes in apercent_transition
argument to control how quickly to transition between stance (0) and swing (1), i.e. the slope of the linear interpolation.percent_transition
is what percentage of swing time to use for the linear transition. Outputs a swing clock value for the left and right leg.von_mises
: A clock function using the Von Mises distribution for actually (mathematically) smooth transitions between 0 and 1. In this case thestd
argument controls the "slope" of the transition. Note that the von mises function can be slow to compute, so often times during training when the clock will be queried multiple times per episode it can be useful to precompute all of the von mises values once and then query them later. To facilitate this we provide theprecompute_von_mises
andget_von_mises_values
functions. To use these, first set thecycle_time
,swing_ratios
, andperiod_shifts
values as desired and then callprecompute_von_mises()
, which will compute the von mises values along the cycle (from 0 tocycle_time
, you can control the density with thenum_points
argument). You can then call theget_von_mises_values()
function which will give you von mises clock values for each leg by interpolating between the precomputed values for the current phase value.
The envs follow a similar naming scheme to the rewards. To add a new env called NewEnv
, make a new folder env/tasks/newenv/
which should contain the file newenv.py
. This will allow for out of the box usage with the training code. You can then just specify the env_name
argument to be "NewEnv" alongside a robot with the robot_name
arg and things should work and get loaded correctly automatically. newenv.py
will define the class NewEnv
and will inherit from GenericEnv
(or even LocomotionEnv
or LocomotionClockEnv
if your new env is based off of a walking task) and should define a reset
, step
, compute_reward
, compute_done
, get_state
, as well as the mirroring and interactive control functions described below. Alternatively, if you only need to make minor changes from an existing env, you can just inherit from that.
For example, let's say I want to make another walking env that doesn't care about sidestepping and thus doesn't need a y velocity input. Basically I want LocomotionClockEnv
with just a small change to the policy inputs. Then I can make a new env EnvForwardOnly
which inherits from LocomotionClockEnv
, and then the only thing I have to do is overload the _get_state
function with my modifications. One note about _get_state
and the policy input state: later for logging purposes we assume that the robot state always comes first, and thus when making your env you should remember to always append to self.get_robot_state()
. This is also necessary for the mirror indices as they assume the state starts with the robot state.
You may want to add additional arguments to the environment's constructor to allow for different options. In this case, in additional to adding them in the env's constructor, you also need to add them to the args
dictionary in the get_env_args
function. The elements of args
are the argument name mapped to a tuple of (default value, description of argument). This is to maintain compatibility with the env_factory (util/env_factory.py) and will handle the command line argument parser for both training and evaluating.
Some extra notes to pay attention to when making your own environment:
- You should define all needed class variables in the the
__init__
function, including whatever will be part of the input. Basically, by the end of__init__
you need to make sure you can callget_state
to at least get a dummy state to define the observation size - As part a result of above, remember to call
reset()
after making an environment object. Some inputs (like commands) may not be properly set just in object construction, and you need to callreset()
to randomize and set all inputs to valid values. - For use in hardware evaluation (
digit_udp.py
script) Digit envs require some additions that Cassie envs do not:- For logging purposes they require lists of names for the action output, proprioceptive "robot state" input, and other "extra" command inputs. These will be set/defined by the
set_logging_fields
function. You're not likely to change the output and robot state, so those name lists are defined inDigit
and can just be called in your custom Digit envs withsuper()
. Theextra_input_names
list is what you will need to make, with just a string name for each extra input beyond the robot state. Note that theset_logging_fields
function itself is called inDigit
's__init__
, so any class variables you might need in yourset_logging_fields
need to be defined before thesuper().__init__
call. SeeDigit
for an example. - Envs also require an additional function
hw_step
, which is intended to be a minimal environment step function to be called on hardware. Your regular envstep
function likely does some extra stuff beyond just stepping the simulation forward. There are often extra class variables to track and update (like incrementingorient_add
and the clock inLocomotionClockEnv
). Thehw_step
function exists to allow you to update whatever you need between each policy step on hardware. So forLocomotionEnvClock
hw_step
just increments theorient_add
andclock
.
- For logging purposes they require lists of names for the action output, proprioceptive "robot state" input, and other "extra" command inputs. These will be set/defined by the
Though the policies/envs usually run at 50Hz (you can set this to be whatever you want actually with the "policy-rate" argument) the underlying simulation itself runs at 2kHz. You'll notice that in the step
function we call step_simulation
with simulator_repeat_steps
, which will step the simulation forward simulator_repeat_steps
number of times. So if we want to env to run at 50Hz, since our simulation runs at 2kHz, we will step the simulation forward 40 steps in every env step
call.
For reward purposes, there are often 2kHz signals that you would like to track like foot forces, foot velocities, etc. In this case you can add what we call a "tracker" to the env. These are meant to track and average signals that happen faster than the policy rate.
To add your own, first define some tracker function update_my_tracker(self, weighting: float, sim_step: int)
. weighting
is the the weighting used for averaging, and is dependent on the policy rate and the tracker rate. sim_step
tells which simulation forward step you are currently at in the simulator_repeat_steps
loop. This is usually used just to reset the the value at the beginning of each loop, i.e. when sim_step=0
. You can look at update_tracker_grf
for an example. After defining your tracker function, you simply add it to the self.trackers
dictionary in your env's __init__
along with a frequency you want it to run at, anything from the policy rate to 2kHz. Note that these calls can add up, especially expensive computations that involve collisions or forces, and calling these trackers at a high rate will slow down your sampling speed. The tracker frequency should be an even multiple of the policy rate. If it is not, the env will throw a warning and will round down to the closest even multiple. Once you've added your tracker to the self.tracker
dictionary it will be automatically called in the step_simulation
loop and grab whatever class variable you updated in the reward function or wherever you want.
To better encourage symmetric motions we utilize and "mirror loss" in our training. To facilitate this each env needs to have a get_action_mirror_indices
and get_observation_mirror_indices
function that define the "mirror indicies" for the policy output and input respectively.
The mirror indicies function by indicating which elements in the action/obs array need to be swapped and negated. For example, let's say the elements of my obs array are [0, 1, 2, 3, 4, 5]
where the first 3 represent the left leg and the last 3 represent the right leg. To the get the "mirror state", where the left and right leg are swapped, we need to swap the first 3 indices with the last 3. To indicate this we define the observation mirror indicies as [3, 4, 5, 0, 1, 2]
, showing that for the mirror state the first observation is the 3rd index of the original state and so on.
Since we are always mirroring about the saggital plane and due to the definition of robot's kinematics, some indicies need to be negated as well. For example, hip roll and yaw need to be negated since the mirror of clockwise rotation on the left side is counter-clockwise rotation on the right side. The direction of rotation flips, so we need to negate as well. Let's say that in my original obs array indicies 0, 1, and 3, 4 are left/right hip roll/yaw respectively. So then the full observation mirror indicies are [-3, -4, 5, -0.1, 1, 2]
. Note that 0
turned into -0.1
so that the negative sign will actually be interpretable. Note that the Digit model definition is weird and has all indicies be negated, which is unintuitive.
The action mirror indicies you can just grab from the Cassie/Digit parent envs since you are unlikely to modify the action output. For the observation mirror indicies look at one of the existing envs for an example. You are unlikely to change the proprioceptive "robot state", so the mirror inds for those can be gotten from CassieEnv
/DigitEnv
. Then swap and negate the appended command indicies as needed. Remember if you don't need to mirror a command input then just keep its index as its actual index in the observation.
One of our policy visualization options (see eval.py
and evaluation_factory.py
) is "interactive" evaluation, where you can use the keyboard to command the policy. To use this, your env needs to implement 3 functions: _init_interactive_key_bindings
, _update_control_commands_dict
, and interactive_control
.
_init_interactive_key_bindings
adds entries to the self.input_keys_dict
dictionary which maps keyboard keys to a description and a lambda function of what the key should actually do (usually increment/decrement one of the command inputs). For example, look at LocomotionClockEnv
. It will map the the "w" key to incrementing the x velocity, using the lambda function to add 0.1 to self.x_velocity
.
_update_control_commands_dict
actually populates this dictionary which the current corresponding internal command inputs. This gets called after a keyboard press to update the dictionary with the new values to update the print outs later.
interactive_control
is defined in GenericEnv
and will be the same across all envs, and just takes in the parsed keyboard character and passes it to the above two functions. If there are other special cases that you'd like you add they would go here. For example the LocomotionClockEnv
env has an extra hidden key command "0" which resets all of the commands to zero.
You can also use a Xbox controller to change the commands during eval. To use a controller with a usb adapter, you will need to install the xone Linux driver. Otherwise only wired connection will work, bluetooth seems to not be supported by the inputs library that we use. For Mac users, you'll have to find another compatible driver for the Microsoft usb adapter.
Xbox interactive control functions very similarly to keyboard control, your environment just needs to define an additional _init_interactive_xbox_bindings
function which also creates a dictionary, except the keys are now Xbox controls. Look at the class variables of the XboxController class to see the available controls. Joysticks and triggers are continuous values from (-1, 1) and (0, 1) respectively, while all other controls are discrete button values 0 or 1. In the interactive xbox control implementation the bumpers are reserved for defining layers, similar to layers on a keyboard. This can be used to map the same button/joystick to multiple actions. For example, in LocomotionClockEnv
, the d-pad up and down controls cycle time, but holding down the right bumper will map to the second layer where d-pad up and down now controls the period shift. There are 4 total layers, no bumpers, left bumper, right bumper, and both bumpers held down. Note that beyond this case multiple inputs at once are not supported. Holding down the A button and moving the joystick or pressing two buttons at once will all be treated as separate distinct events.
There are a few button combinations that are reserved for special functions outside of the env in interactive_xbox_eval
in evaluation_factory.py
itself. These are:
- Start: Toggles pausing/unpausing the viewer
- Right Bumper + Start: Exits the script
- Right Bumper + Back: Reset the env
- Right Bumper + Left Bumper + Start: Re-print the controls menu
Another thing to note is that in the current environment examples the continuous joystick values are used additively rather than scaling the absolute joystick values. For example, moving the joystick increases the speed command and not touching anything will just hold the current speed command, rather than the joystick directly mapping to the speed command. This makes things a bit easier to use and deals with large command ranges better, and is also easier to implement (we can just clip values rather than having to remap them which is tricky when the command ranges are non-symmetric). However, this does make things dependent on the rate your script is running at; the faster your loop runs, the more times the current joystick values gets added to a command. To help equalize things across different run rates you can change the environment's xbox scaling factor, env.xbox_scale_factor
. By default this is 1, but if feel like the joystick/triggers are too sensitive you can make this smaller.