Awesome-Remote-Sensing-Cross-Modal-Image-Text-Retrieval

A collection of papers, datasets, benchmarks, code, and model weights for Remote Sensing Cross-Modal Image-Text Retrieval (RSCMIT).

📢 Latest Updates

🔥🔥🔥 Last Updated on 2025.01.09 🔥🔥🔥

2025.01.10: Update TeoChat、RingMoGPT and VHM.
2025.01.09: Update PERSVL、GeoChat.
2024.12.24: Update CFITR.
2024.12.09: Update CDMAN、MSA、KTIR、CMPAGL、CCLS2T、SARCI、FSISR and SCAT.
2024.12.05: Update SIRS and HVSA.

Remote Sensing Cross-Modal Image-Text Survey

Paper	Title	Publication	Affiliation
Paper	Advancements in Vision–Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques	Remote Sensing 2025	Northwestern Polytechnical University
Paper	Foundation Models for Remote Sensing and Earth Observation: A Survey	Arxiv 2024	University of Tokyo and the RIKEN Center for Advanced Intelligence Project, Japan
Paper	When Geoscience Meets Foundation Models: Toward a general geoscience artificial intelligence system	GRSM 2024	Nanjing University of Aeronautics and Astronautics
Paper	Vision-Language Models in Remote Sensing: Current progress and future trends	GRSM 2024	King Abdullah University of Science and Technology
Paper	Language Integration in Remote Sensing: Tasks, datasets, and future directions	GRSM 2023	King Saud University
Paper	Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works	TGRS 2023	Central South University
Paper	The Potential of Visual ChatGPT For Remote Sensing	Arxiv 2023	University of Western São Paulo
Paper	遥感大模型：进展与前瞻	武汉大学学报 (信息科学版) 2023	Wuhan University

Remote Sensing Image-Text Datasets

Dataset Name	Image size	Image Resolution	VLMs	Note
UCM-Captions	613	256 × 256	-	-
Sydney-Captions	2,100	500 × 500	-	-
RSICD	10,921	224 × 224	-	-
RSITMD	4,743	256 × 256	-	-
NWPU-Captions	31,500	256 × 256	-	-
RS5M	5 million+	All Resolutions	GeoRSCLIP	-
SkyScript	5.2 million+	All Resolutions	SkyCLIP	-
ChatEarthNet	163,488 + 10,000	-	ChatEarthNet	The dataset will be made publicly available.
GEOBench-VLM	over 10,000	-	GEOBench-VLM	-
VRSBench	29,614	-	VRSBench	29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs (NeruIPS 2024 Dataset and Benchmark Track)

Remote Sensing Cross-Modal Image-Text Retrieval Models

Paper	Title	Publication	Affiliation	Code
CFITR	Toward Efficient and Accurate Remote Sensing Image–Text Retrieval With a Coarse-to-Fine Approach	GRSL 2024	Beijing Foreign Studies University	Github
PERSVL	Prior-Experience-based Vision-Language Model for Remote Sensing Image-Text Retrieval	TGRS 2024	Xidian	Github
CDMAN	Thread the Needle: Cues-Driven Multi-Association for Remote Sensing Cross-Modal Retrieval	TGRS 2024	Wuhan University of Technology	-
MSA	Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval	TGRS 2024	Xidian University	Github
KTIR	Knowledge-aware Text-Image Retrieval for Remote Sensing Images	TGRS 2024	EPFL	-
CMPAGL	Cross-Modal Prealigned Method With Global and Local Information for Remote Sensing Image and Text Retrieval	TGRS 2024	Shanghai Maritime University	Github
FGIS	Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval	JSTARS 2024	Chongqing University	-
EBAKER	Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning	ACMMM 2024	Tianjin University	-
CUP	Cross-Modal Remote Sensing Image–Text Retrieval via Context and Uncertainty-Aware Prompt	TNNLS 2024	Xidian University	Github
CCLS2T	Cross-Modal Contrastive Learning With Spatiotemporal Context for Correlation-Aware Multiscale Remote Sensing Image Retrieval	TGRS 2024	Xidian University	-
MIIA	Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval	TGRS 2024	Northwestern Polytechnical University	-
SARCI	Scale-Aware Adaptive Refinement and Cross-Interaction for Remote Sensing Audio-Visual Cross-Modal Retrieval	TGRS 2024	Wuhan University of Technology	Github
GLISA	Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning	TGRS 2024	China University of Mining and Technology	-
SCAT	Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text Retrieval	TGRS 2024	Northwestern Polytechnical University	-
FSISR	Cross-Modal Hashing With Feature Semi-Interaction and Semantic Ranking for Remote Sensing Ship Image Retrieval	TGRS 2024	Harbin Institute of Technology	-
SkyEyeGPT	Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	Arxiv 2024	Northwestern Polytechnical University	Github
MFF-SFE	Cross-modal retrieval method based on MFF-SFE for remote sensing image-text	中国科学院大学学报 2024	Aerospace Information Research Institute, Chinese Academy of Sciences	-
RemoteCLIP	RemoteCLIP: A Vision Language Foundation Model for Remote Sensing	TGRS 2024	Hohai University	Github
C2F-ITR	From Coarse To Fine: An Offline-Online Approach for Remote Sensing Cross-Modal Retrieval	IGARSS 2024	Beijing Foreign Studies University	-
MGRM-EL	Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval	TGRS 2024	Northwestern Polytechnical University	-
SIRS	Multitask Joint Learning for Remote Sensing Foreground-Entity Image–Text Retrieval	TGRS 2024	Soochow University	Github
PIR	A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval	ACMMM 2023 oral	Zhejiang University of Technology	Github
PE-RSITR	Parameter-Efficient Transfer Learning for Remote Sensing Image–Text Retrieval	TGRS 2023	Northwestern Polytechnical University	Github
HVSA	Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum Learning	TGRS 2023	Aerospace Information Research Institute, Chinese Academy of Sciences	Github
SWAN	Reducing Semantic Confusion Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval	ICMR 2023 oral	Zhejiang University of Technology	Github
KAMCL	Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval	TGRS 2023	Tianjin University	Github
IEFT	Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval	TGRS 2023	Xidian University	Github
-	A Texture and Saliency Enhanced Image Learning Method For Cross-Modal Remote Sensing Image-Text Retrieval	IGARSS 2023	Xidian University	-
Multilanguage Transformer	Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval	JSTARS 2022	King Saud University	-
GaLR	Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information	TGRS 2022	Aerospace Information Research Institute, Chinese Academy of Sciences	Github
-	Cross-modal retrieval of remote sensing images and text based on self-attention unsupervised deep common feature space	IJRS 2022	National University of Defense Technology	-
AMFMN	Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval	TGRS 2021	Aerospace Information Research Institute, Chinese Academy of Sciences	Github
LW-MCR	A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing	TGRS 2021	Aerospace Information Research Institute, Chinese Academy of Sciences	Github
VSE++	VSE++: Improving Visual-Semantic Embeddings with Hard Negatives	BMVC 2018 spotlight	University of Toronto	Github

Remote Sensing Vision-Language Modal Model for More Tasks

Paper	Title	Publication	Affiliation	Code	Note
FIANet	Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation	TGRS 2024	Southwest Jiaotong University	Github	Image Segmentation
FedRSClip	FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models	Arxiv 2024	China Academy of Electronics and Information Technology	-	Scene Classification

Remote Sensing Vision-Language Large & Foundation Models

Abbreviation	Title	Publication	Paper	Code & Weights
VHM	VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis	Arxiv 2024	VHM	link
TeoChat	TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data	Arxiv 2024	TeoChat	link
RingMoGPT	RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks	TGRS 2024	RingMoGPT	-
EarthMarker	EarthMarker: A Visual Prompting Multi-modal Large Language Model for Remote Sensing	TGRS 2024	EarthMarker	link
GeoChat	GeoChat: Grounded Large Vision-Language Model for Remote Sensing	CVPR 2024	GeoChat	link
EarthGPT	EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain	TGRS 2024	EarthGPT	link
RemoteCLIP	RemoteCLIP: A Vision Language Foundation Model for Remote Sensing	TGRS 2024	RemoteCLIP	link
RSGPT	RSGPT: A Remote Sensing Vision Language Model and Benchmark	Arxiv 2023	RSGPT	link
GeoRSCLIP	RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model	Arxiv 2023	GeoRSCLIP	link
GRAFT	Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment	ICLR 2024	GRAFT	-
CSP	CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations	ICML 2023	CSP	link
GeoCLIP	GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization	NeurIPS 2023	GeoCLIP	link
SatCLIP	SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery	Arxiv 2023	SatCLIP	link

问题、反馈和对此存储库的贡献

我欢迎各种反馈，最好通过GitHub Issues 分享。同样，如果您有任何疑问或只是想与他人交流想法，请随时发布这些内容。

致谢

感谢相关论文、相关项目

引用

如果您发现本项目对您的研究有用，请考虑引用它。

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Papers		Papers
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Remote-Sensing-Cross-Modal-Image-Text-Retrieval

A collection of papers, datasets, benchmarks, code, and model weights for Remote Sensing Cross-Modal Image-Text Retrieval (RSCMIT).

📢 Latest Updates

Table of Contents

Remote Sensing Cross-Modal Image-Text Survey

Remote Sensing Image-Text Datasets

Remote Sensing Cross-Modal Image-Text Retrieval Models

Remote Sensing Vision-Language Modal Model for More Tasks

Remote Sensing Vision-Language Large & Foundation Models

问题、反馈和对此存储库的贡献

致谢

引用

About

Releases

Packages

BaolanChen/Awesome-Remote-Sensing-Cross-Modal-Image-Text-Retrieval

Folders and files

Latest commit

History

Repository files navigation

Awesome-Remote-Sensing-Cross-Modal-Image-Text-Retrieval

A collection of papers, datasets, benchmarks, code, and model weights for Remote Sensing Cross-Modal Image-Text Retrieval (RSCMIT).

📢 Latest Updates

Table of Contents

Remote Sensing Cross-Modal Image-Text Survey

Remote Sensing Image-Text Datasets

Remote Sensing Cross-Modal Image-Text Retrieval Models

Remote Sensing Vision-Language Modal Model for More Tasks

Remote Sensing Vision-Language Large & Foundation Models

问题、反馈和对此存储库的贡献

致谢

引用

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages