Skip to content

Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey

Notifications You must be signed in to change notification settings

Chen-Yang-Liu/Awesome-RS-Temporal-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Awesome PR's Welcome

Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey


Chenyang Liu Β· Jiafan Zhang Β· Keyan Chen Β· Man Wang Β· Zhengxia Zou Β· Zhenwei Shi

arXiv PDF


This repo is used for recording, and tracking recent Remote Sensing Temporal Vision-Language Models (RS-TVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.

⭐ Share us a ⭐

Share us a ⭐ if you're interested in this repo. We will continue to track relevant progress and update this repository.

πŸ™Œ Add Your Paper in our Repo and Survey!

  • You are welcome to give us an issue or PR for your RS-TVLM work !!!!! We will record it for next version update of our survey

πŸ₯³ New

πŸ”₯πŸ”₯πŸ”₯ Updated on 2024.12.04 πŸ”₯πŸ”₯πŸ”₯

  • 2024.12.04: The first version is available.

✨ Highlight!!

  • The first survey for Remote Sensing Temporal Vision-Language models.

  • Some public datasets and code links are provided.

πŸ“– Introduction

Timeline of representative RS-TVLMs:

Alt Text

πŸ“– Table of Contents

πŸ“š Methods: A Survey

Change Captioning

Model Name Paper Title Visual Encoder Language Decoder Code/Project
CNN-RNN Captioning changes in bi-temporal remote sensing images VGG-16 RNN N/A
CC-RNN/SVM Change captioning: A new paradigm for multitemporal remote sensing image analysis VGG-16 RNN,SVM N/A
RSICCformer Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset ResNet-101 Transformer Decoder code
PSNet Progressive Scale-aware Network for Remote sensing Image Change Captioning ViT-B/32 Transformer Decoder code
PromptCC A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning ViT-B/32 GPT-2 code
Chg2Cap Changes to Captions: An Attentive Network for Remote Sensing Change Captioning ResNet-101 Transformer Decoder code
ICT-Net Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder code
SITS-CC Change Caption for Satellite Images Time Series ResNet-101 Transformer Decoder code
RSCaMa RSCaMa: Remote Sensing Image Change Captioning with State Space Model ViT-B/32 Mamba, Transformer Decoder, GPT-2 code
SparseFocus A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder code
SEN Single-stream Extractor Network with Contrastive Pre-training for Remote Sensing Change Captioning ResNet with 6-channel Transformer Decoder code
Diffusion-RSCC Diffusion model for learning cross-modal data distribution ResNet-101 Diffusion code
CARD Context-aware Difference Distilling for Multi-change Captioning ResNet-101 Transformer Decoder code
ChangeRetCap Towards a multimodal framework for remote sensing image change retrieval and captioning ResNet-101 Transformer Decoder code
Intelli-Change Intelli-Change Remote Sensing - A Novel Transformer Approach ResNet-101 Transformer Decoder N/A
ChangeExp Towards Temporal Change Explanations from Bi-Temporal Satellite Images LLaVA-1.5 LLaVA-1.5 N/A
MAF-Net Multi-scale Attentive Fusion Network for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder N/A
SFEN Scale-wised feature enhancement network for change captioning of remote sensing images WideResNet Transformer Decoder N/A
MfrNet MfrNet: A New Multi-Scale Feature Refining Method for Remote Sensing Image Change Captioning ResNet-18 Transformer Decoder N/A
SEIFNet Inter-Temporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning ResNet-101 Transformer Decoder code
MV-CC MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption InternVideo2 Transformer Decoder code
Chareption Chareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning CLIP ViT-L/14 LLaMA-7B N/A
MADiffCC Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model Diffusion Transformer Decoder N/A
CCExpert CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset Diffusion Transformer Decoder code
......

Multitask Learning of Change Detection and Captioning

Model Name Paper Title Visual Encoder Language Decoder Code/Project
Pix4Cap Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning ViT-B/32 Transformer Decoder code
Change-Agent Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis ViT-B/32 Transformer Decoder code
Semantic-CC Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance SAM Vicuna N/A
DetACC Detection Assisted Change Captioning for Remote Sensing Image ResNet-101 Transformer Decoder N/A
KCFI Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning ViT Qwen code
ChangeMinds ChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote Sensing Swin Transformer Transformer Decoder code
CTMTNet A Multi-Task Network and Two Large Scale Datasets for Change Detection and Captioning in Remote Sensing Images ResNet-101 Transformer Decoder N/A
......

Change Visual Question Answering

Model Name Paper Title Visual Encoder Language Decoder Code/Project
change-aware VQA Change-Aware Visual Question Answering CNN RNN N/A
CDVQA-Net Change Detection Meets Visual Question Answering CNN RNN code
ChangeChat ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning CLIP-ViT Vicuna-v1.5 code
CDchat CDChat: A Large Multimodal Model for Remote Sensing Change Description CLIP ViT-L/14 Vicuna-v1.5 code
TEOChat TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data CLIP ViT-L/14 LLaMA-2 code
GeoLLaVA GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing Video encoder LLaVA-NeXT and Video-LLaVA code
CDQAG Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection CLIP image Encoder CLIP Text Encoder code
......

Text2Change Retrieval

Model Name Paper Title Code/Project
ChangeRetCap Towards a multimodal framework for remote sensing image change retrieval and captioning code
......

Change Grounding

Model Name Paper Title Code/Project
ChangeChat ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning code
CDchat CDChat: A Large Multimodal Model for Remote Sensing Change Description code
TEOChat TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data code
CDQAG Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection code
......

Large Language Models Meets Temporal Images

Method Release Time LLM Fine-tuning Task Paper Title Code/Project
PromptCC 2023.06 GPT-2 Prompt Learning CC A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning code
Change-Agent 2024.07 Chatgpt -- CC, CD Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis code
Semantic-CC 2024.07 Vicuna LoRA CC Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance
ChangeChat 2024.09 Vicuna-v1.5 LoRA CVQA, CG ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning code
KCFI 2024.09 Qwen Prompt CC Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning code
CDChat 2024.09 Vicuna-v1.5 LoRA CVQA CDChat: A Large Multimodal Model for Remote Sensing Change Description code
TEOChat 2024.10 LLaMA-2 LoRA CVQA, CG TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data code
GeoLLaVA 2024.10 LLaVA-NeXT LoRA CVQA GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing code
Chareption 2024.10 LLaMA-7B Adapter CC Chareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning
CCExpert 2024.11 Qwen-2 LoRA CC CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset code
......

πŸ“Š Dataset

  • Dataset Matching Temporal Images and Text:
Dataset Image Size/Resolution Image pairs Captions Annotation Download Link
DUBAI CCD 50Γ—50 (30m) 500 2,500 Manual Link
LEVIR CCD 256Γ—256 (0.5m) 500 2,500 Manual Link
LEVIR-CC 256Γ—256 (0.5m) 10,077 50,385 Manual Link
WHU-CDC 256Γ—256 (0.075m) 7,434 37,170 Manual Link
  • Dataset Matching Temporal Images, Text, and Masks:
Dataset Image Size/Resolution Image pairs Captions Pixel-level Masks Annotation Download Link
LEVIR-MCI 256Γ—256 (0.5m) 10,077 50,385 44,380 (building, road) Manual Link
LEVIR-CDC 256Γ—256 (0.5m) 10,077 50,385 -- (building) Manual Link
WHU-CDC 256Γ—256 (0.075m) 7,434 37,170 -- (building) Manual Link
  • Dataset Matching Temporal Images and Question-Answer Instructions:
Dataset Temporal Images Image Resolution Instruction Samples Change-related Task Annotation Download Link
CDVQA 2,968 pairs (bi-temporal) 0.5m~3m 122,000 CVQA Manual Link
ChangeChat-87k 10,077 pairs (bi-temporal) 0.5m 87,195 CVQA, Grounding Automated Link
GeoLLaVA 100,000 pairs (bi-temporal) -- 100,000 CVQA Automated Link
TEOChatlas -- (variable temporal length) -- 554,071 Classification, CVQA, Grounding Automated Link
QVG-360K 6,810 pairs (bi-temporal) 0.1m~3m 360,000 CVQA, Grounding Automated Link

......

πŸ‘¨β€πŸ« Other Survey

Year Paper Title
2023 An Agenda for Multimodal Foundation Models for Earth Observation
2023 Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works
2023 Large Remote Sensing Model: Progress and Prospects
2023 Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey
2023 On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications
2024 Vision-Language Models in Remote Sensing: Current Progress and Future Trends
2024 On the Foundations of Earth and Climate Foundation Models
2024 Towards Vision-Language Geo-Foundation Model: A Survey
2024 Language Integration in Remote Sensing: Tasks, datasets, and future directions
2024 Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
2024 An LLM Agent for Automatic Geospatial Data Analysis
2024 COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models

πŸ–ŠοΈ Citation

If you find our survey and repository useful for your research, please consider citing our paper:

@misc{liu2024remotesensingtemporalvisionlanguage,
      title={Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey}, 
      author={Chenyang Liu and Jiafan Zhang and Keyan Chen and Man Wang and Zhengxia Zou and Zhenwei Shi},
      year={2024},
      eprint={2412.02573},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02573}, 
}

🐲 Contact

About

Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published