Skip to content

Latest commit

 

History

History
165 lines (141 loc) · 8.79 KB

README.md

File metadata and controls

165 lines (141 loc) · 8.79 KB

MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin

Source code for our ACL 2024 paper : MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin

Click the links below to view our papers and checkpoints

If you find this work useful, please cite our paper and give us a shining star 🌟

@inproceedings{zhou2024marvel,
 title={MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin},
 author={Zhou, Tianshuo and Mei, Sen and Li, Xinze and Liu, Zhenghao and Xiong, Chenyan and Liu, Zhiyuan and Gu, Yu and Yu, Ge},
 booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics},
 year={2024}
}

Overview

MARVEL unlocks the multi-modal capability of dense retrieval via visual module plugin. It encodes queries and multi-modal documents with a unified encoder model to bridge the modality gap between images and texts, conducts retrieval, modality routing, and result fusion within a unified embedding space.

MARVEL

Requirement

1. Install the following packages using Pip or Conda under this environment

Python==3.7
Pytorch
transformers
clip
faiss-cpu==1.7.0
tqdm
numpy
base64
Install the pytrec_eval from https://github.com/cvangysel/pytrec_eval

We provide the version file requirements.txt of all our used packages, if you have any problems configuring the environment, please refer to this document.

2. Prepare the pretrained CLIP and T5-ANCE

MARVEL is built on CLIP and T5-ANCE model.

Reproduce MARVEL

Download Code & Dataset

  • First, use git clone to download this project:
git clone https://github.com/OpenMatch/MARVEL
cd MARVEL
  • Download link for our WebQA: WebQA. (❗️Note: For the imgs.tsv, you need to download the data from this link and run 7z x imgs.7z.001).
  • Please refer to ClueWeb22-MM to obtain pretrain data and retrieval benchmark.
  • Place the downloaded dataset in the data folder:
data/
├──WebQA/
│   ├── train.json
│   ├── dev.json
│   ├── test.json
│   ├── test_qrels.txt
│   ├── all_docs.json
│   ├── all_imgs.json
│   ├── imgs.tsv
│   └── imgs.lineidx.new
├──ClueWeb22-MM/
│   ├── train.parquet
│   ├── dev.parquet
│   ├── test.parquet
│   ├── test_qrels.txt
│   ├── text.parquet
│   └── image.parquet
└──pretrain/
    ├── train.parquet
    └── dev.parquet

Train MARVEL-ANCE

Using the WebQA dataset as an example, I will show you how to reproduce the results in the MARVEL paper. The same is true for the ClueWeb22-MM dataset. Also, we provide the checkpoint for each step. You can skip a step and continue training.

  • First step: Go to the pretrain folder and pretrain MARVEL's visual module checkpoint:
cd pretrain
bash train.sh
  • Second step: Go to the DPR folder and train MARVEL-DPR using inbatch negatives checkpoint:
cd DPR
bash train_webqa.sh
  • Third step: Then using MERVEL-DPR to generate hard negatives for training MARVEL-ANCE:
bash get_hn_webqa.sh
  • Final step: Go to the ANCE folder and train MARVEL-ANCE using hard negatives checkpoint:
cd ANCE
bash train_ance_webqa.sh

Evaluate Retrieval Effectiveness

  • These experimental results are shown in Table 2 of our paper.
  • Go to the DPR or ANCE folder and evaluate model performance as follow:
bash gen_embeds.sh
bash retrieval.sh

Results

The results are shown as follows.

  • WebQA
Setting Model MRR@10 NDCG@10 Rec@100
Single Modality\(Text Only) BM25 53.75 49.60 80.69
DPR (Zero-Shot) 22.72 20.06 45.43
CLIP-Text (Zero-Shot) 18.16 16.76 39.83
Anchor-DR (Zero-Shot) 39.96 37.09 71.32
T5-ANCE (Zero-Shot) 41.57 37.92 69.33
BERT-DPR 42.16 39.57 77.10
NQ-DPR 41.88 39.65 42.44
NQ-ANCE 45.54 42.05 69.31
Divide-Conquer VinVL-DPR 22.11 22.92 62.82
CLIP-DPR 37.35 37.56 85.53
BM25 & CLIP-DPR 42.27 41.58 87.50
UnivSearch CLIP (Zero-Shot) 10.59 8.69 20.21
VinVL-DPR 38.14 35.43 69.42
CLIP-DPR 48.83 46.32 86.43
UniVL-DR 62.40 59.32 89.42
MARVEL-DPR 55.71 52.94 88.23
MARVEL-ANCE 65.15 62.95 92.40
  • ClueWeb22-MM
Setting Model MRR@10 NDCG@10 Rec@100
Single Modality\(Text Only) BM25 40.81 46.08 78.22
DPR (Zero-Shot) 20.59 23.24 44.93
CLIP-Text (Zero-Shot) 30.13 33.91 59.53
Anchor-DR (Zero-Shot) 42.92 48.50 76.52
T5-ANCE (Zero-Shot) 45.65 51.71 83.23
BERT-DPR 38.56 44.41 80.38
NQ-DPR 42.35 61.71 83.50
NQ-ANCE 45.89 51.83 81.21
Divide-Conquer VinVL-DPR 29.97 36.13 74.56
CLIP-DPR 39.54 47.16 87.25
BM25 & CLIP-DPR 41.58 48.67 83.50
UnivSearch CLIP (Zero-Shot) 16.28 18.52 40.36
VinVL-DPR 35.09 40.36 75.06
CLIP-DPR 42.59 49.24 87.07
UniVL-DR 47.99 55.41 90.46
MARVEL-DPR 46.93 53.76 88.74
MARVEL-ANCE 55.19 62.83 93.16

Contact

If you have questions, suggestions, and bug reports, please email: