docs/index.html


<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <meta name="description" content="">
    <meta name="author" content="">
    <link rel="icon" href="favicon.ico">

    <title>Critic Guided Segmentation of Rewarding Objects in First-Person Views</title>

<meta http-equiv="Cache-Control" content="no-cache, no-store, must-revalidate" />
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="Expires" content="0" />


    <!-- Bootstrap core CSS -->
    <link href="bootstrap.min.css" rel="stylesheet">

    <!-- Custom styles for this template -->
    <link href="starter-template.css" rel="stylesheet">
  </head>

  <body>

    <main role="main" class="">


<div class="container-fluid page-element">

      <div class="starter-template">
        <h1>Critic Guided Segmentation of <br> Rewarding Objects in First-Person Views</h1>
              <p class="lead">
        </p>
      </div>

  <div class="row">
    <div class="col-md-4 my-auto" style="text-align: center;">
		  <div class="row"> <div class="col-md-12"><b>Andrew Melnik</b></div></div>
		  <div class="row"> <div class="col-md-12">Bielefeld University</div></div>
	</div>
	
    <div class="col-md-4 my-auto" style="text-align: center;">
		  <div class="row"> <div class="col-md-12"><b>Augustin Harter</b></div></div>
		  <div class="row"> <div class="col-md-12">Bielefeld University</div></div>
	</div>
	
    <div class="col-md-4 my-auto" style="text-align: center;">
		  <div class="row"> <div class="col-md-12"><b>Christian Limberg</b></div></div>
		  <div class="row"> <div class="col-md-12">Bielefeld University</div></div>
	</div>
	
  </div>

<br>

  <div class="row">
    <div class="col-md-4 my-auto" style="text-align: center;">
		  <div class="row"> <div class="col-md-12"><b>Krishan Rana</b></div></div>
		  <div class="row"> <div class="col-md-12">Queensland University of Technology</div></div>
	</div>
	
    <div class="col-md-4 my-auto" style="text-align: center;">
		  <div class="row"> <div class="col-md-12"><b>Niko Sünderhauf</b></div></div>
		  <div class="row"> <div class="col-md-12">Queensland University of Technology</div></div>
	</div>
	
    <div class="col-md-4 my-auto" style="text-align: center;">
		  <div class="row"> <div class="col-md-12"><b>Helge Ritter</b></div></div>
		  <div class="row"> <div class="col-md-12">Bielefeld University</div></div>
	</div>

  </div>

<br><br>

  <div class="text-center">
      <iframe width="560" height="315" src="https://www.youtube.com/embed/8by_5TKDvvE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>

     
<div class="container">
  <div class="text-center">
	    <h3><a href="https://ki2021.uni-luebeck.de/index.html">44<sup>th</sup> German Conference on Artificial Intelligence</a></h3>
<br>
  <div class="row">

    <div class="col-md-6">
    <h3>Paper</h3>
    <a href="https://arxiv.org/pdf/2107.09540"><img style="height:100px;" src="pdf.png"></a>
      </div>

          <div class="col-md-6 center justify-content-center">
          <h3>Code</h3>
              <a href="https://github.com/ndrwmlnk/critic-guided-segmentation-of-rewarding-objects-in-first-person-views"><img style="height:100px;" class="" src="Octocat.jpg"></a>
      </div>

  </div>
  
  </div>
</div>

<br><br>

</div>

<div class="container-fluid page-element">

<h1 id="introduction">Introduction</h1>
<p>Training a semantic segmentation network can be a difficult problem in the absence of label information. We propose using sparse reward signals from a Reinforcement Learning (RL) environment to train a semantic segmentation model for masking rewarding objects. Moreover, our approach allows training the semantic segmentation model entirely on an imitation learning dataset without interaction with the environment. This is of interest for a number of use cases. Such technique can contribute to explainable AI <span class="citation" data-cites="gunning2019darpa"></span>, symbolic and causal reasoning in the space of detected objects, robotics, or as auxiliary information <span class="citation" data-cites="jaderberg2016reinforcement"></span> for an RL setup. Learning the optimal policy in sparse reward environments <span class="citation" data-cites="guss2021towards"></span> is an important challenge in Deep Reinforcement Learning (DRL) <span class="citation" data-cites="bach2020learn 10.3389/frobt.2021.538773 harter2020solving"></span>. A masking network for rewarding objects can support an actor-critic RL setup in a way that is intuitive and explainable to a human. Semantic segmentation of rewarding objects can aid transfer learning, better sample efficiency of training, and better performance of a trained agent. In the following, we are considering how a trained critic model - value network trained on discounted reward values - can guide the direct segmentation of such rewarding objects in a feed-forward manner.</p>
<p>To this end, we use the critic to select two sets of images characterized by high vs. low expected reward values. Using these two sets of images from an imitation learning dataset, we can train a segmentation network for rewarding object regions by applying it to high expected reward images and exchanging the masked region with the low expected reward image. This makes it possible to evaluate how the mask changes the reward value prediction and therefore allows us to train the segmentation network to mask the rewarding objects (Figure <a href="#fig-results" data-reference-type="ref" data-reference="fig-results">1</a>, see Section <a href="#sec:experiments" data-reference-type="ref" data-reference="sec:experiments">3</a> for details).</p>
<p>This work was motivated by the “MineRL NeurIPS 2020 Competition: Sample Efficient Reinforcement Learning in Minecraft” <span class="citation" data-cites="guss2021towards"></span>, where our AI agent won the 1st place. See Section <a href="#sec:experiments" data-reference-type="ref" data-reference="sec:experiments">3</a> for details on the challenge, goals, and objects that an agent needs to learn in the world of Minecraft. While our approach is formulated in a domain-agnostic way, we evaluate it on first-person views in 3D Minecraft environments from the NeurIPS MineRL 2020 Competition.</p>
<figure>
<img src="images/fig-results.jpg" style="width: 100%;" alt="Figure 1. Segmentation results: The Hourglass model learns to segment rewarding objects (tree trunks) without any label information but only from reward signals. In the first four columns, showing high critic-value images, the trained Hourglass model detects different instances of rewarding objects (white and brown tree trunks). The model is resistant to generation of false-positive masks in low score images (columns 5-8). The first row shows the input images, the second row shows the segments extracted from the input images using the generated masks (not ground truth), and the third row shows the masks generated by the Hourglass model. Video demonstration and code: https://rebrand.ly/critic-guided-segmentation" id="fig-results" /><figcaption>Figure 1. Segmentation results: The <em>Hourglass</em> model learns to segment rewarding objects (tree trunks) without any label information but only from reward signals. In the first four columns, showing high critic-value images, the trained <em>Hourglass</em> model detects different instances of rewarding objects (white and brown tree trunks). The model is resistant to generation of false-positive masks in low score images (columns 5-8). The first row shows the input images, the second row shows the segments extracted from the input images using the generated masks (not ground truth), and the third row shows the masks generated by the <em>Hourglass</em> model. Video demonstration and code: <a href="https://rebrand.ly/critic-guided-segmentation">https://rebrand.ly/critic-guided-segmentation</a><span label="fig-results"></span></figcaption>
</figure>
<br><br><br>
<figure>
<img src="images/criticsegm-pseudocode.png" style="width: 100%;" alt="Figure 2. Training pipeline overview. Phase 2 is explained in more detail in Figure 3." id="pseudo-code" /><figcaption>Figure 2. Training pipeline overview. Phase 2 is explained in more detail in Figure <a href="#fig-unet" data-reference-type="ref" data-reference="fig-unet">3</a>.<span label="pseudo-code"></span></figcaption>
</figure>
<br><br><br>
<figure>
<img src="images/fig-unet.jpg" style="width: 100%;" alt="Figure 3. This figure shows the second phase of our pipeline containing the segmentation training, which consists of two stages: Stage 1 (highlighted in red): Image A (high critic value) passes through the Hourglass network, forming a mask M. Stage 2: the mask M is used to merge image A (high critic value) with image B (low critic value) resulting in image X (masked parts of A replaced with B) and image Y (masked parts of A injected in B). Images A, B, X, and Y are then passed through the encoder and critic. The losses penalize differences in critic values for image pairs A : Y, and B : X. A linear regularization term penalizes mask intensity and prevents collapse to a trivial solution where the mask M fills the entire image. Video explanation and code: https://rebrand.ly/critic-guided-segmentation" id="fig-unet" /><figcaption>Figure 3. This figure shows the second phase of our pipeline containing the segmentation training, which consists of two stages: Stage 1 (highlighted in red): Image <strong>A</strong> (high critic value) passes through the <em>Hourglass network</em>, forming a mask <strong>M</strong>. Stage 2: the mask <strong>M</strong> is used to merge image <strong>A</strong> (high critic value) with image <strong>B</strong> (low critic value) resulting in image <strong>X</strong> (masked parts of <strong>A</strong> replaced with <strong>B</strong>) and image <strong>Y</strong> (masked parts of <strong>A</strong> injected in <strong>B</strong>). Images <strong>A</strong>, <strong>B</strong>, <strong>X</strong>, and <strong>Y</strong> are then passed through the encoder and critic. The losses penalize differences in critic values for image pairs <strong>A</strong> : <strong>Y</strong>, and <strong>B</strong> : <strong>X</strong>. A linear regularization term penalizes mask intensity and prevents collapse to a trivial solution where the mask <strong>M</strong> fills the entire image. Video explanation and code: <a href="https://rebrand.ly/critic-guided-segmentation">https://rebrand.ly/critic-guided-segmentation</a><span label="fig-unet"></span></figcaption>
</figure>
<br><br><br>
<h1 id="methods">Methods</h1>
<p>Refer to Figure <a href="#pseudo-code" data-reference-type="ref" data-reference="pseudo-code">2</a> to see a high-level overview of our approach, which we will now describe in more detail: We first train a critic model to predict the discounted reward for images and then train an <em>Hourglass</em> model <span class="citation" data-cites="hourglass"></span> to infer a segmentation mask over rewarding objects while continuing to train the critic (Figure <a href="#fig-results" data-reference-type="ref" data-reference="fig-results">1</a>). The architecture is based on two sub-modules (Figure <a href="#fig-unet" data-reference-type="ref" data-reference="fig-unet">3</a>): A critic network trained to predict discounted reward for a given image, and an <em>Hourglass</em> model that produces a segmentation mask for rewarding objects in the image. Such masks, when used to replace parts from high critic score images, lowers the predicted score and, when used to inject these parts into low critic score images, increases the predicted score. To achieve this objective, the <em>Hourglass</em> model learns to segment the rewarding objects.</p>
<p>We evaluate our approach in the first-person Minecraft environments from the <em>MineRL 2020 Competition</em> <span class="citation" data-cites="guss2021towards"></span>. The provided imitation learning database consists of data recorded from human players, and we use only images and the recorded sparse reward signals from the database to train our model.</p>
<h2 id="pipeline">Pipeline</h2>
<p>The training is divided into two phases: <em>Phase 1</em> contains the initial critic training enabling us to use the critic in <em>phase 2</em> to train our segmentation model. <em>Phase two</em> happens in two stages and is shown in Fig <a href="#fig-unet" data-reference-type="ref" data-reference="fig-unet">3</a>: <em>Stage 1</em> is the pass through the segmentation model producing a segmentation mask. We then use this mask in <em>stage 2</em> to construct new merged images and pass them through the critic to calculate the gradient with respect to the mask-based merging.</p>
<h3 id="phase-1-initial-critic-training">Phase 1: Initial Critic Training</h3>
<p>We train the critic model to directly predict the time discounted reward value of states which are 64x64x3 RGB single image observations. We use the time discounted reward from the data set episodes as a supervised training signal together with the mean squared error as a loss function. After training converges, we use the critic to split the database into images with high critic values <strong>A</strong> and low critic values <strong>B</strong> for <em>phase 2</em>.</p>
<h3 id="phase-2-segmentation-training">Phase 2: Segmentation Training</h3>
<p><em>Phase 2</em> is subdivided into two stages and visualized in Fig <a href="#fig-unet" data-reference-type="ref" data-reference="fig-unet">3</a>): In <em>stage 1</em> we pass an image through the segmentation model to produce a mask. In <em>stage 2</em> we use this mask to swap out pixels in an image pair and pass the resulting merged images through the critic to infer how the mask-based pixel swap changed the original critic value prediction for the image pair. We do this because we do not have explicit ground truth masks segmenting the rewarding objects as a training target. Instead, we pick image pairs such that one (image <strong>A</strong>) has a high critic value and the other (image <strong>B</strong>) a low critic value and use these pairs to formulate a loss which enables learning to segment rewarding objects: The key idea is that reward related parts of images should decrease the critic value when removed and increase the critic value when injected into another image.</p>
<h4 id="implementation"><strong>Implementation:</strong></h4>
<p>We implement this idea by first using the segmentation model in <em>stage 1</em> to produce a mask <strong>M</strong> based on the high critic value image <strong>A</strong>. Next, in <em>stage 2</em> we use the mask to swap the highlighted pixels in <strong>A</strong> and replace them with the corresponding pixels in low critic value image <strong>B</strong>. This generates a second pair of images: <strong>X</strong> (image <strong>A</strong> with masked pixels replaced by content of <strong>B</strong>) and <strong>Y</strong> (image <strong>B</strong> with content substituted by masked pixels from image <strong>A</strong>). For a “perfect” mask that captures all reward related structures, the critic values between the pairs should be swapped: High in <strong>Y</strong> since it received all reward related content of <strong>A</strong>, and low in <strong>X</strong> since all reward-related content has been replaced by contents from low-value image <strong>B</strong>). The <em>inject loss</em> penalizes the squared difference between critic values of <strong>A</strong> and <strong>Y</strong>: <span class="math inline"><em>L</em><sub><em>R</em></sub> = (<em>V</em><sub><em>A</em></sub> − <em>V</em><sub><em>Y</em></sub>)<sup>2</sup></span> And the <em>replace loss</em> penalizes the squared difference between critic values of <strong>B</strong> and <strong>X</strong>: <span class="math inline"><em>L</em><sub><em>I</em></sub> = (<em>V</em><sub><em>B</em></sub> − <em>V</em><sub><em>X</em></sub>)<sup>2</sup></span></p>
<p>To further illustrate this, we can consider a bad mask that leaves all essential reward elements in <strong>X</strong> contributing to a high <em>replace loss</em>, since then <span class="math inline"><em>V</em><sub><em>X</em></sub></span> remains high, while <span class="math inline"><em>V</em><sub><em>B</em></sub></span> is low; similarly, <strong>Y</strong> would hardly receive any reward related content from <strong>A</strong>, keeping <span class="math inline"><em>V</em><sub><em>Y</em></sub></span> low whereas <span class="math inline"><em>V</em><sub><em>A</em></sub></span> is high and therefore producing a high <em>inject loss</em>.</p>
<h4 id="regularization"><strong>Regularization:</strong></h4>
<p>In order to avoid trivial solutions like a full image mask replacing the complete images, we apply a linear regularization to enforce minimal masks: <span class="math inline"><em>L</em><sub><em>N</em></sub> = |<em>M</em>|</span></p>
<h4 id="continued-critic-training"><strong>Continued Critic Training:</strong></h4>
<p>Training the mask has an influence on the encoder weights, which are shared across critic and segmentation models and therefore can mess up the critic predictions. A straightforward solution would be to freeze the encoder weights after the initial critic training in <em>phase 1</em> is over. This works, but we can allow the segmentation model to influence the encoder weights if we also continue to train the critic like in <em>phase 1</em>, which keeps the encoder functional for reward prediction. This increases performance; see <em>frozen weights</em> for comparison in Section <a href="#sec:experiments" data-reference-type="ref" data-reference="sec:experiments">3</a>.</p>
<h4 id="handling-false-positives"><strong>Handling False Positives:</strong></h4>
<p>So far, we have explained how we pass high critic value images through the segmentation model to produce reward related segmentation masks. However, this means that the model would only see high-value images and therefore be heavily biased towards producing a mask even for low reward images producing “false positives”. Therefore we take low-value images for <strong>A</strong> in half of the time. This means that a batch containing 128 images would contain 32 high-value and 32 low-value images as <strong>A</strong> and 64 low-value images for <strong>B</strong>. With this, we can keep the above-stated losses without producing “bad” gradients since <span class="math inline">(<em>V</em><sub><em>A</em></sub> − <em>V</em><sub><em>Y</em></sub>)<sup>2</sup></span> and <span class="math inline">(<em>V</em><sub><em>B</em></sub> − <em>V</em><sub><em>X</em></sub>)<sup>2</sup></span> are small because both images have similar (low) critic values already before merging. This allows the gradient from the regularization term that favors sparse masks to drive the segmentation response towards the desired empty mask output when receiving a low critic image as input. Segmentation results for both high and low-value images are shown in <a href="#fig-results" data-reference-type="ref" data-reference="fig-results">1</a>.</p>
<h1 id="sec:experiments">Experiments</h1>
<h4 id="data-set"><strong>Data Set:</strong></h4>
<p>We apply our method on an imitation learning data set from the NeurIPS2020 MineRL Challenge. It is based on Minecraft, a game providing a 3D world where players can interact with the world in the first-person view. We used the <em>TreeChop</em> environment, which contains episodes of human players chopping trees and collecting wood: The players can repeatedly use the <em>attack action</em> on a tree trunk to destroy it which produces the item <em>log</em> which is automatically collected when in proximity. Only upon collection the player receives <em>reward = 1</em>; all other images give <em>reward = 0</em>. Less than 1% of images get a reward signal, while roughly a third of the images contain views of trees in close proximity and almost all images contain some tree features in sight but possibly further away.</p>
<p>After chopping the base trunk of a tree, players usually stand below the “floating” tree crown to chop the remaining wood in the tree crown. We remove most of this tree crown chopping which makes up roughly 20% of images, to focus the reward signal on approaching and chopping the tree trunks. This is automatically done by removing the 35 images after a reward signal. From that, we assemble a data set containing 100k images where we clip reward values higher than 1 (when collecting multiple <em>log items</em> at once). We then use a factor of 0.98 to discount the reward every time step and use the resulting discounted reward as the training label for that image. This results in a data set with a quite balanced histogram of discounted reward values, meaning that there are roughly equal amounts of images for every reward value in the range from 0 to 1, which stabilized the critic training. Additionally, we apply a small data augmentation that shifts the images randomly up to 12 pixels to the right or left, which works against the strong bias in the data set that trees are almost always in the middle of the image when receiving a reward signal.</p>
<h4 id="architecture"><strong>Architecture</strong></h4>
<p>We use a convolutional encoder-decoder architecture with skip connections inspired by <em>Hourglass Networks</em>, which has two outputs: The first output is the critic score estimating the value of images. Its implemented through two additional linear layers after the decoder bottleneck (Figure <a href="#fig-unet" data-reference-type="ref" data-reference="fig-unet">3</a>). The second output is the segmentation mask which is produced by the decoder. Our simple custom-made network has an encoder with 5 convolution layers with 40, 40, 40, 80 and 160 channels respectively, and kernel size 3x3. The last layer results in a non-spatial bottleneck (dimensions: 1x1x160). Each layer is followed by a LeakyReLU and we use max pooling after the first 4 layers. The decoder has a mirrored structure, but we switch the pooling layers with upsampling. Its output layer is passed through the Sigmoid function to produce the mask. The critic shares the encoder, and after the bottleneck additionally consists of two fully connected layers with 160 units each. Further, we use a 50% dropout after the third and fourth encoder layers and after the first fully connected critic layer.</p>
<p>Reward prediction and segmentation demand similar features, so instead of using separate models they both share the encoder: In the first phase, the encoder is trained to create a meaningful feature representation of images that makes it possible for the critic to predict the reward. In the second phase, the decoder can use skip connections to access the encoder representation, which we found greatly improves mask quality in comparison to a separate critic. See Section <a href="#sec:results" data-reference-type="ref" data-reference="sec:results">4</a>.</p>
<h4 id="evaluation"><strong>Evaluation:</strong></h4>
<p>To test the performance of our model, we collect a test set containing 18k images from 10 episodes of a player chopping tree trunks. We modified the environment simulator enabling us to extract the ground truth segmentation masks of tree trunks; such masks were never seen during training. With this, we can measure the <em>Intersection over Union</em> (IoU) score as a performance metric to evaluate our segmentation model. We report the performance in Table <a href="#tab:iou" data-reference-type="ref" data-reference="tab:iou">[tab:iou]</a> and compare it to a saliency map baseline.</p>
<h4 id="training-details"><strong>Training Details:</strong></h4>
<p>We train the critic in <em>phase 1</em> on the 100k images for 15 epochs using a batch size of 64 until convergence and use it to split the data set into images with a predicted reward higher than 0.7 (resulting in 20k images) and images with predicted reward lower than 0.3 (resulting in roughly 30k images). These values were obtained through a hyper-parameter grid-search with the evaluation dataset. We then use these two subsets to continue our training in <em>phase 2</em> for one epoch with a batch size of 128.</p>
<h1 id="sec:results">Results</h1>
<p>We report the intersection-over-union (IoU) scores of the segmentation mask achieved when compared to the ground truth data. To better understand how our model works, we compare four different model variants with two baseline approaches which we describe in detail below before presenting our final results in Table <a href="#tab:iou" data-reference-type="ref" data-reference="tab:iou">[tab:iou]</a>. Further, we report the performance when post-processing with Conditional Random Fields (CRF) <span class="citation" data-cites="CRF CRFrepo"></span>, which is a common method to improve inaccurate or noisy segmentation masks. We additionally provide some visual segmentation results attained by our approach in Figure <a href="#fig-results" data-reference-type="ref" data-reference="fig-results">1</a> for high and low critic score images. Once again, these masks are learned only from a sparse reward signal in the challenging 3D Minecraft environment with different lighting conditions, tree types, and tree colors.</p>
<table>
<caption><em>IoU</em> mean value over 10 training seeds.<span label="tab:iou"></span></caption>
<thead>
<tr class="header">
<th style="text-align: left;"><strong>Model Variant</strong></th>
<th style="text-align: center;"><strong>IoU</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Baseline Full Mask</td>
<td style="text-align: center;">0.12</td>
</tr>
<tr class="even">
<td style="text-align: left;">Baseline Saliency Map</td>
<td style="text-align: center;">0.22</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Baseline Saliency Map + CRF</td>
<td style="text-align: center;">0.11</td>
</tr>
<tr class="even">
<td style="text-align: left;">Separate Critic</td>
<td style="text-align: center;">0.27</td>
</tr>
<tr class="odd">
<td style="text-align: left;">No Inject Loss</td>
<td style="text-align: center;">0.35</td>
</tr>
<tr class="even">
<td style="text-align: left;">Frozen Encoder</td>
<td style="text-align: center;">0.38</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Full Model</td>
<td style="text-align: center;">0.41</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>Full Model + CRF</strong></td>
<td style="text-align: center;"><strong>0.45</strong></td>
</tr>
</tbody>
</table>
<h4 id="baseline-full-mask"><strong>Baseline Full Mask:</strong></h4>
<p>Full mask covering everything in every image. We report this value for better comparison and interpretation of the other scores. This value can also be interpreted as the percentage of ground truth tree trunk pixels in our test set.</p>
<h4 id="baseline-saliency-map"><strong>Baseline Saliency Map:</strong></h4>
<p>Following <span class="citation" data-cites="saliency"></span> we compute the Jacobian of the input image with respect to the critic’s prediction and weight the Jacobian of each image with the critic’s predicted score. We then produce a mask by thresholding each pixels weighted gradient based on a value that is a multiple of the mean pixel gradient values. The exact multiple is determined through a hyper-parameter search. This method can act as a baseline for the task, since calculating saliency maps to visualize where a model is “looking” at is common practice and requires no training of an additional model. However, the resulting masks are noisy and do not allow to focus on the area of the rewarding object in the image.</p>
<h4 id="baseline-saliency-map-crf"><strong>Baseline Saliency Map + CRF:</strong></h4>
<p>Post-processing the saliency maps with CRF. The CRF hyperparameters were obtained through a grid search with the test set. For some images, it leads to a improved segmentation but overall the saliency maps are too chaotic and noisy for this method to work properly.</p>
<h4 id="separate-critic"><strong>Separate Critic:</strong></h4>
<p>To test whether our hypothesis is valid that the encoder features learned during critic training are also useful for the segmentation training and therefore motivated the idea of sharing the encoder, we trained a model with a separate encoder for the critic. This means that a separate encoder has to be learned from scratch during segmentation training. The resulting decrease in IoU performance strengthens our hypothesis and favours weight sharing.</p>
<h4 id="no-inject-loss"><strong>No Inject Loss:</strong></h4>
<p>This model is trained without the <em>inject loss</em>, only using the mask to replace parts of high reward images and not injecting them into low reward images. The performance results show a clear decrease in performance, emphasizing the usefulness of the inject objective. It seems like only having to decrease the critic value can lead to more degenerate solutions instead of also having to increase the critic value with the same mask.</p>
<h4 id="frozen-encoder"><strong>Frozen Encoder:</strong></h4>
<p>Here we train the model with frozen encoder weights, since the segmentation gradients will influence the encoder and through that mess up the critic predictions needed for our replace and inject losses. With a frozen encoder, the decoder can access the encoder features but not change them to better fit the segmentation task.</p>
<h4 id="full-model"><strong>Full Model:</strong></h4>
<p>The alternative to a frozen encoder is to let the decoder gradients influence the encoder but at the same time continuing to train the critic which prevents the encoder features to become dysfunctional for the critic’s reward prediction. This setup results in the best-performing variant of our approach without post-processing. It combines the inject loss with a shared encoder and continued critic training in phase 2 into our “Full Model”.</p>
<h4 id="full-model-crf"><strong>Full Model + CRF:</strong></h4>
<p>Post processing the model output with CRF. The CRF hyperparameters were obtained through a grid search with the test set. In contrast to the saliency maps, here, the CRF improves performance.</p>
<h1 id="discussion">Discussion</h1>
<p>Humans explicitly learn notions of objects. There has been extensive research inspired by psychology and cognitive science <span class="citation" data-cites="melnik2018world konig2018embodied konen2019biologically"></span> on explicitly learning object-centric and reward-centric representations from pixels <span class="citation" data-cites="simonyan2013deep"></span>. Focusing on the reward quality of objects also adds an interesting perspective on the question of how NN build up their understanding of images. These questions have been studied by feature visualization methods <span class="citation" data-cites="olah2017feature"></span> and interpretability techniques <span class="citation" data-cites="olah2018building"></span>.</p>
<p>Understanding RL Vision is a challenging problem. Hilton et al. <span class="citation" data-cites="hilton2020understanding"></span> proposed to analyze, diagnose and edit deep reinforcement learning models using attribution. Although this technique allows highlighting objects related to positive and negative rewards in 2D environments, it requires human domain knowledge to select the proper attributions, and it was not shown that this technique will work in more complicated 3D first-person-view environments. To understand how an agent learns and executes a policy, a method for generating salience maps was introduced <span class="citation" data-cites="greydanus2018visualizing"></span>. In this method, a change in the value function was used when sampling a grid of perturbations.</p>
<p>It has been empirically observed that RL from raw pixels is sample-inefficient <span class="citation" data-cites="kaiser2019model melnik2019modularization"></span>. Learning policies from state-based features is significantly more sample-efficient than learning from pixels. An approach called CURL <span class="citation" data-cites="srinivas2020curl"></span> shows that by extracting high-level features from raw pixels using contrastive learning and then using these features as state in the RL setting results in a superior sample efficiency during training of an RL agent. In contrast, our approach allows learning of explicit masks over rewarding objects. Extracting these segmentation masks which are derived directly from reward can help prepossess the most relevant information from the raw pixel state representation. This is of interest for a number of use cases in explainable AI, symbolic and causal reasoning <span class="citation" data-cites="melnik2019combining"></span> in the space of detected objects, robotics, as well as in RL setups. Moreover, our approach can learn the segmentation model solely from imitation learning dataset.</p>
<p>DRL has been applied successfully in various domains for training high-performing agents <span class="citation" data-cites="schilling2018approach"></span>. Our approach could be used as an auxiliary module to train the agent from demonstrations to improve sample efficiency of learning. Our contribution is a novel and an intuitive use of joint training of a critic network and an image segmentation approach to highlight rewarding object segments in reinforcement learning environments. In this contribution, we showed that it is possible to train a model to generate high-quality masks depicting rewarding objects in images without explicit label information, but only using feedback from the critic model. Our approach was part of the 1st place winning solution in the “MineRL NeurIPS 2020 Competition: Sample Efficient Reinforcement Learning in Minecraft”.</p>
<p>Future work may include further experiments with RL or imitation learning setups extended with our model that provides rewarding-object masks. Identifying a mask or heatmap for negative rewards as well as non-reward entities may be a possible continuation of this work. Self-supervised learning of embedded classification of reward-centric objects can facilitate development of causal and symbolic reasoning models.</p>

</div>


<!--<div class="container-fluid page-element black-bg">-->
<div class="container-fluid page-element">

Bibtex:
<div class="cite-div">
<code style="font-size:12px;">
@inproceedings{melnik2021critic,<br>
title={Critic Guided Segmentation of Rewarding Objects in First-Person Views},<br>
author={Melnik, Andrew and Harter, Augustin and Limberg, Christian and Rana, Krishan and Sünderhauf, Niko and Ritter, Helge},<br>
booktitle={Proceedings of the German Conference on Artificial Intelligence},<br>
year={2021}<br>
}<br>
</code>
</div>
<br><br>

</div>


    </main><!-- /.container -->

    <!-- Bootstrap core JavaScript
    ================================================== -->
    <!-- Placed at the end of the document so the pages load faster -->
    <script src="https://code.jquery.com/jquery-3.2.1.slim.min.js" integrity="sha384-KJ3o2DKtIkvYIK3UENzmM7KCkRr/rE9/Qpg6aAZGJwFDMVNA/GpGFF93hXpG5KkN" crossorigin="anonymous"></script>
    <script src="popper.min.js"></script>
    <script src="bootstrap.min.js"></script>
  </body>
</html>