Overestimated Value Function in Actor Critic Framework #124

JaCoderX · 2019-10-30T14:05:30Z

@Kismuz,
I believe I have encountered a framework (A3C) limitation.
While training a few of my recent models I noticed a strange behavior. For the first part of training everything seems to work fine, as indicated by tensorboard matrices (total reward and value function increases while entropy decrease). After couple of thousand steps the total reward and value function matrices no longer correlate. At first in a modest way (value function continue to increase while total reward hovers in place), but then what happens can be describe as - Policy breakdown (both matrices crash, entropy shots up and Agent actions seem to be almost random).

I searched online to try and identify the problem. I now believe that the issue I'm experiencing is a well knows limitation of value function overestimation as described in the following paper (as well as a way to mitigate the problem):

Addressing Function Approximation Error in Actor-Critic Methods

In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic.

The above solution seems to be used also in more advanced Actor Critic frameworks (which is really interesting by itself)
Soft Actor-Critic Algorithms and Applications

And a new paper (still unpublished) seems to expand the solution one step further
Dynamically Balanced Value Estimates for Actor-Critic Methods

Reinforcement learning in an actor-critic setting relies on accurate value estimates of the critic. However, the combination of function approximation, temporal difference (TD) learning and off-policy training can lead to an overestimating value function. A solution is to use Clipped Double Q-learning (CDQ), which is used in the TD3 algorithm and computes the minimum of two critics in the TD-target. We show that CDQ induces an underestimation bias and propose a new algorithm that accounts for this by using a weighted average of the target from CDQ and the target coming from a single critic. The weighting parameter is adjusted during training such that the value estimates match the actual discounted return on the most recent episodes and by that it balances over- and underestimation.

Kismuz · 2019-10-30T14:20:01Z

Implementing SAC would resolve this I think.

JaCoderX · 2019-10-30T14:48:12Z

I opened this issue, partially to share my experience and to record the current limitations.
incorrect value function estimation still looks like an open RL research issue.

But for sure working on SAC would be amazing way forward regardless.

Kismuz · 2019-10-30T15:24:30Z

incorrect value function estimation still looks like an open RL research issue.

if I remember correctly, 'Distributional Q-learning" approach from researches at Google Brain addresses this issue: https://www.youtube.com/watch?v=ba_l8IKoMvU
btw: spot the listener leaving at ~39:50 ))

JaCoderX · 2019-11-06T12:43:44Z

Implementing SAC would resolve this I think.

I'm playing with the idea of giving a try to implement SAC for btgym. It might be a bit of a stretch of my skill set in RL but can be an interesting challenge by itself.

The benefits of working with well established RL frameworks are clear. So, a few questions that comes to mind in that regard:

How can external repos in RL algorithms be integrated with btgym? and if it is even feasible under the current RL algorithmic part of btgym?
how can we wrap current btgym-specific algorithms to become agnostic and modular so they could be easily integrated in external frameworks. (for example, btgym encoders implementations are modular and can be exported. but Stacked-LSTM implementations are deeply fused with btgym A3C implementations)

Kismuz · 2019-11-07T15:44:10Z

@JacobHanouna,

How can external repos in RL algorithms be integrated with btgym? and if it is even feasible under the current RL algorithmic part of btgym?

just throw out embedded algorithms and use btgym as standalone gym-API environment; some refactoring maybe necessary (e.g. btgym uses own spaces but those can be easily rolled back to standard gym spaces) to use it with frameworks like RlLib;

how can we wrap current btgym-specific algorithms to become agnostic

those has been intentionally adapted to domain while general implementations already exist

Kismuz · 2020-02-15T19:06:43Z

Seems this problem is deeper than I thought:
https://bair.berkeley.edu/blog/2019/12/05/bear/

Kismuz · 2020-03-08T18:21:50Z

https://arxiv.org/pdf/1906.00949.pdf
https://arxiv.org/pdf/1906.08253.pdf
https://arxiv.org/pdf/1803.00101.pdf

Kismuz added algorithm enhancement proposal labels Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overestimated Value Function in Actor Critic Framework #124

Overestimated Value Function in Actor Critic Framework #124

JaCoderX commented Oct 30, 2019

Kismuz commented Oct 30, 2019

JaCoderX commented Oct 30, 2019 •

edited

Loading

Kismuz commented Oct 30, 2019

JaCoderX commented Nov 6, 2019

Kismuz commented Nov 7, 2019

Kismuz commented Feb 15, 2020

Kismuz commented Mar 8, 2020

Overestimated Value Function in Actor Critic Framework #124

Overestimated Value Function in Actor Critic Framework #124

Comments

JaCoderX commented Oct 30, 2019

Kismuz commented Oct 30, 2019

JaCoderX commented Oct 30, 2019 • edited Loading

Kismuz commented Oct 30, 2019

JaCoderX commented Nov 6, 2019

Kismuz commented Nov 7, 2019

Kismuz commented Feb 15, 2020

Kismuz commented Mar 8, 2020

JaCoderX commented Oct 30, 2019 •

edited

Loading