PyTorch implementation of ''Category-aware Allocation Transformer for Weakly Supervised Object Localization''.
Category-aware Allocation Transformer for Weakly Supervised Object Localization (link)
- Authors: Zhiwei Chen, Jinren Ding, Liujuan Cao, Yunhang Shen, Shengchuan Zhang, Guannan Jiang, Rongrong Ji
- Institution: Xiamen University, Xiamen, China. Tencent Youtu Lab, Shanghai, China. CATL, China
Weakly supervised object localization (WSOL) aims to localize objects based on only image-level labels as supervision. Recently, transformers have been introduced into WSOL, yielding impressive results. The self-attention mechanism and multilayer perceptron structure in transformers preserve long-range feature dependency, facilitating complete localization of the full object extent. However, current transformer-based methods predict bounding boxes using category-agnostic attention maps, which may lead to confused and noisy object localization. To address this issue, we propose a novel Category-aware Allocation TRansformer (CATR) that learns category-aware representations for specific objects and produces corresponding category-aware attention maps for object localization. First, we introduce a Category-aware Stimulation Module (CSM) to induce learnable category biases for self-attention maps, providing auxiliary supervision to guide the learning of more effective transformer representations. Second, we design an Object Constraint Module (OCM) to refine the object regions for the category-aware attention maps in a self-supervised manner. Extensive experiments on the CUB-200-2011 and ILSVRC datasets demonstrate that the proposed CATR achieves significant and consistent performance improvements over competing approaches.
The architecture of the proposed CATR. It consists of a vision transformer backbone, a category-aware stimulation module (CSM), and an object constraint module (OCM).- PyTorch==1.10.1
- torchvision==0.11.2
- timm==0.4.12
git clone [email protected]:zhiweichen0012/CATR.git
cd CATR
- CUB (http://www.vision.caltech.edu/datasets/cub_200_2011/)
- ILSVRC (https://www.image-net.org/challenges/LSVRC/)
The directory structure is the standard layout for the torchvision datasets.ImageFolder
, and the training and validation data is expected to be in the train/
folder and val
folder respectively:
/path/to/imagenet/
train/
class1/
img1.jpeg
class2/
img2.jpeg
val/
class1/
img3.jpeg
class/2
img4.jpeg
We provide the trained CATR models.
Name | Loc. Acc@1 | Loc. Acc@5 | URL |
---|---|---|---|
CATR_CUB (This repository) | 80.066 | 91.992 | model |
CATR_ILSVRC (This repository) | 56.976 | 66.794 | model |
To train CATR on CUB with 4 GPUs run:
bash scripts/train.sh deit_small_patch16_224_CATR_cub CUB 80 output_ckpt/CUB
To train CATR on ILSVRC with 4 GPUs run:
bash scripts/train.sh deit_small_patch16_224_CATR_imnet IMNET 14 output_ckpt/IMNET
NOTE: Please check the paths to the "torchrun" command, the dataset, and the pre-training weights in the scripts/train.sh
.
To test the CUB models, you can run:
bash scripts/test.sh deit_small_patch16_224_CATR_cub CUB /path/to/CATR_CUB_model
To test the ILSVRC models, you can run:
bash scripts/test.sh deit_small_patch16_224_CATR_imnet IMNET /path/to/LCTR_IMNET_model
NOTE: Please check the paths to the "python" command and the dataset in the scripts/test.sh
.
@inproceedings{chen2023category,
title={Category-aware Allocation Transformer for Weakly Supervised Object Localization},
author={Chen, Zhiwei and Ding, Jinren and Cao, Liujuan and Shen, Yunhang and Zhang, Shengchuan and Jiang, Guannan and Ji, Rongrong},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={6643--6652},
year={2023}
}
We use deit and their pre-trained weights as the backbone. Many thanks to their brilliant works!