Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overestimated Value Function in Actor Critic Framework #124

Open
JaCoderX opened this issue Oct 30, 2019 · 7 comments
Open

Overestimated Value Function in Actor Critic Framework #124

JaCoderX opened this issue Oct 30, 2019 · 7 comments

Comments

@JaCoderX
Copy link
Contributor

@Kismuz,
I believe I have encountered a framework (A3C) limitation.
While training a few of my recent models I noticed a strange behavior. For the first part of training everything seems to work fine, as indicated by tensorboard matrices (total reward and value function increases while entropy decrease). After couple of thousand steps the total reward and value function matrices no longer correlate. At first in a modest way (value function continue to increase while total reward hovers in place), but then what happens can be describe as - Policy breakdown (both matrices crash, entropy shots up and Agent actions seem to be almost random).

I searched online to try and identify the problem. I now believe that the issue I'm experiencing is a well knows limitation of value function overestimation as described in the following paper (as well as a way to mitigate the problem):

Addressing Function Approximation Error in Actor-Critic Methods

In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic.

The above solution seems to be used also in more advanced Actor Critic frameworks (which is really interesting by itself)
Soft Actor-Critic Algorithms and Applications

And a new paper (still unpublished) seems to expand the solution one step further
Dynamically Balanced Value Estimates for Actor-Critic Methods

Reinforcement learning in an actor-critic setting relies on accurate value estimates of the critic. However, the combination of function approximation, temporal difference (TD) learning and off-policy training can lead to an overestimating value function. A solution is to use Clipped Double Q-learning (CDQ), which is used in the TD3 algorithm and computes the minimum of two critics in the TD-target. We show that CDQ induces an underestimation bias and propose a new algorithm that accounts for this by using a weighted average of the target from CDQ and the target coming from a single critic. The weighting parameter is adjusted during training such that the value estimates match the actual discounted return on the most recent episodes and by that it balances over- and underestimation.

@Kismuz
Copy link
Owner

Kismuz commented Oct 30, 2019

Implementing SAC would resolve this I think.

@JaCoderX
Copy link
Contributor Author

JaCoderX commented Oct 30, 2019

I opened this issue, partially to share my experience and to record the current limitations.
incorrect value function estimation still looks like an open RL research issue.

But for sure working on SAC would be amazing way forward regardless.

@Kismuz
Copy link
Owner

Kismuz commented Oct 30, 2019

incorrect value function estimation still looks like an open RL research issue.

if I remember correctly, 'Distributional Q-learning" approach from researches at Google Brain addresses this issue: https://www.youtube.com/watch?v=ba_l8IKoMvU
btw: spot the listener leaving at ~39:50 ))

@JaCoderX
Copy link
Contributor Author

JaCoderX commented Nov 6, 2019

Implementing SAC would resolve this I think.

I'm playing with the idea of giving a try to implement SAC for btgym. It might be a bit of a stretch of my skill set in RL but can be an interesting challenge by itself.

The benefits of working with well established RL frameworks are clear. So, a few questions that comes to mind in that regard:

  • How can external repos in RL algorithms be integrated with btgym? and if it is even feasible under the current RL algorithmic part of btgym?
  • how can we wrap current btgym-specific algorithms to become agnostic and modular so they could be easily integrated in external frameworks. (for example, btgym encoders implementations are modular and can be exported. but Stacked-LSTM implementations are deeply fused with btgym A3C implementations)

@Kismuz
Copy link
Owner

Kismuz commented Nov 7, 2019

@JacobHanouna,

How can external repos in RL algorithms be integrated with btgym? and if it is even feasible under the current RL algorithmic part of btgym?

just throw out embedded algorithms and use btgym as standalone gym-API environment; some refactoring maybe necessary (e.g. btgym uses own spaces but those can be easily rolled back to standard gym spaces) to use it with frameworks like RlLib;

how can we wrap current btgym-specific algorithms to become agnostic

those has been intentionally adapted to domain while general implementations already exist

@Kismuz
Copy link
Owner

Kismuz commented Feb 15, 2020

Seems this problem is deeper than I thought:
https://bair.berkeley.edu/blog/2019/12/05/bear/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants