-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overestimated Value Function in Actor Critic Framework #124
Comments
Implementing SAC would resolve this I think. |
I opened this issue, partially to share my experience and to record the current limitations. But for sure working on SAC would be amazing way forward regardless. |
if I remember correctly, 'Distributional Q-learning" approach from researches at Google Brain addresses this issue: https://www.youtube.com/watch?v=ba_l8IKoMvU |
I'm playing with the idea of giving a try to implement SAC for btgym. It might be a bit of a stretch of my skill set in RL but can be an interesting challenge by itself. The benefits of working with well established RL frameworks are clear. So, a few questions that comes to mind in that regard:
|
@JacobHanouna,
just throw out embedded algorithms and use btgym as standalone gym-API environment; some refactoring maybe necessary (e.g. btgym uses own spaces but those can be easily rolled back to standard gym spaces) to use it with frameworks like RlLib;
those has been intentionally adapted to domain while general implementations already exist |
Seems this problem is deeper than I thought: |
@Kismuz,
I believe I have encountered a framework (A3C) limitation.
While training a few of my recent models I noticed a strange behavior. For the first part of training everything seems to work fine, as indicated by tensorboard matrices (total reward and value function increases while entropy decrease). After couple of thousand steps the total reward and value function matrices no longer correlate. At first in a modest way (value function continue to increase while total reward hovers in place), but then what happens can be describe as - Policy breakdown (both matrices crash, entropy shots up and Agent actions seem to be almost random).
I searched online to try and identify the problem. I now believe that the issue I'm experiencing is a well knows limitation of value function overestimation as described in the following paper (as well as a way to mitigate the problem):
Addressing Function Approximation Error in Actor-Critic Methods
The above solution seems to be used also in more advanced Actor Critic frameworks (which is really interesting by itself)
Soft Actor-Critic Algorithms and Applications
And a new paper (still unpublished) seems to expand the solution one step further
Dynamically Balanced Value Estimates for Actor-Critic Methods
The text was updated successfully, but these errors were encountered: