[Feature] multiagent data standardization: PPO advantages #2677
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello!
Context
So, in this library we frequently apply standardizations over data (also called normalization) where we subtract the mean from the data and divide it by a standard deviation.
A few places where we do this that come to mind are in the transforms (where it can be used over observations and rewards) and in the advantages in ppo. Please let me know about other places if you can think of more.
Issue
The issue is that normally the mean and std statistics are computed over all the dimensions. But in multiagent settings, where for example a reward tensor could have an "agent" dimension, this is not the proper way to do it as different agents might have different reward distributions and applying standardization naively across them will create issues.
Example
For example consider the case where we have 2 agents each with 2 rewards (agent_1_r = [100,200], agent_2_r = [0,-1]), structured in a 2x2 tensor
r = torch.tensor([[100,200],[0,-1]])
If we standardize using
mean(r)
andstd(r)
, we getr = torch.tensor([[0.3,1.5],[-0.9,-.91]])
, which is badIf we standardize using
mean(r, dim=1)
andstd(r,dim=1)
, hence excluding the agent dim from the statistics, we getr = torch.tensor([[-1,1],[1,-1]])
, which is better as it does not cross pollute over agentsProposal
We should adapt all places in the library where standardization is used to take a list of dimensions to exlude from the statistics. This way, multiagent users can provide the agent dimension as input and use the standardization feature
What I did here
I did this for the advantages in PPO
To do it, I made a nice standardization util function that takes as input a list of dimensions to exlude from the standardization (a more flexible implementation of a functional layer norm)