A collection of papers, datasets, benchmarks, code, and model weights for Remote Sensing Cross-Modal Image-Text Retrieval (RSCMIT).
🔥🔥🔥 Last Updated on 2025.01.09 🔥🔥🔥
- 2025.01.10: Update TeoChat、RingMoGPT and VHM.
- 2025.01.09: Update PERSVL、GeoChat.
- 2024.12.24: Update CFITR.
- 2024.12.09: Update CDMAN、MSA、KTIR、CMPAGL、CCLS2T、SARCI、FSISR and SCAT.
- 2024.12.05: Update SIRS and HVSA.
- Models
- Datasets & Benchmarks
- Survey
- Projects
Paper | Title | Publication | Affiliation | Note |
---|---|---|---|---|
Paper | Advancements in Vision–Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques | Remote Sensing 2025 | Northwestern Polytechnical University | |
Paper | Foundation Models for Remote Sensing and Earth Observation: A Survey | Arxiv 2024 | University of Tokyo and the RIKEN Center for Advanced Intelligence Project, Japan | |
Paper | When Geoscience Meets Foundation Models: Toward a general geoscience artificial intelligence system | GRSM 2024 | Nanjing University of Aeronautics and Astronautics | |
Paper | Vision-Language Models in Remote Sensing: Current progress and future trends | GRSM 2024 | King Abdullah University of Science and Technology | |
Paper | Language Integration in Remote Sensing: Tasks, datasets, and future directions | GRSM 2023 | King Saud University | |
Paper | Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works | TGRS 2023 | Central South University | |
Paper | The Potential of Visual ChatGPT For Remote Sensing | Arxiv 2023 | University of Western São Paulo | |
Paper | 遥感大模型:进展与前瞻 | 武汉大学学报 (信息科学版) 2023 | Wuhan University |
Dataset Name | Image size | Image Resolution | VLMs | Note |
---|---|---|---|---|
UCM-Captions | 613 | 256 × 256 | - | - |
Sydney-Captions | 2,100 | 500 × 500 | - | - |
RSICD | 10,921 | 224 × 224 | - | - |
RSITMD | 4,743 | 256 × 256 | - | - |
NWPU-Captions | 31,500 | 256 × 256 | - | - |
RS5M | 5 million+ | All Resolutions | GeoRSCLIP | - |
SkyScript | 5.2 million+ | All Resolutions | SkyCLIP | - |
ChatEarthNet | 163,488 + 10,000 | - | ChatEarthNet | The dataset will be made publicly available. |
GEOBench-VLM | over 10,000 | - | GEOBench-VLM | - |
VRSBench | 29,614 | - | VRSBench | 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs (NeruIPS 2024 Dataset and Benchmark Track) |
Paper | Title | Publication | Affiliation | Code | Note |
---|---|---|---|---|---|
CFITR | Toward Efficient and Accurate Remote Sensing Image–Text Retrieval With a Coarse-to-Fine Approach | GRSL 2024 | Beijing Foreign Studies University | Github | |
PERSVL | Prior-Experience-based Vision-Language Model for Remote Sensing Image-Text Retrieval | TGRS 2024 | Xidian | Github | |
CDMAN | Thread the Needle: Cues-Driven Multi-Association for Remote Sensing Cross-Modal Retrieval | TGRS 2024 | Wuhan University of Technology | - | |
MSA | Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval | TGRS 2024 | Xidian University | Github | |
KTIR | Knowledge-aware Text-Image Retrieval for Remote Sensing Images | TGRS 2024 | EPFL | - | |
CMPAGL | Cross-Modal Prealigned Method With Global and Local Information for Remote Sensing Image and Text Retrieval | TGRS 2024 | Shanghai Maritime University | Github | |
FGIS | Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval | JSTARS 2024 | Chongqing University | - | |
EBAKER | Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning | ACMMM 2024 | Tianjin University | - | |
CUP | Cross-Modal Remote Sensing Image–Text Retrieval via Context and Uncertainty-Aware Prompt | TNNLS 2024 | Xidian University | Github | |
CCLS2T | Cross-Modal Contrastive Learning With Spatiotemporal Context for Correlation-Aware Multiscale Remote Sensing Image Retrieval | TGRS 2024 | Xidian University | - | |
MIIA | Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval | TGRS 2024 | Northwestern Polytechnical University | - | |
SARCI | Scale-Aware Adaptive Refinement and Cross-Interaction for Remote Sensing Audio-Visual Cross-Modal Retrieval | TGRS 2024 | Wuhan University of Technology | Github | |
GLISA | Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning | TGRS 2024 | China University of Mining and Technology | - | |
SCAT | Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text Retrieval | TGRS 2024 | Northwestern Polytechnical University | - | |
FSISR | Cross-Modal Hashing With Feature Semi-Interaction and Semantic Ranking for Remote Sensing Ship Image Retrieval | TGRS 2024 | Harbin Institute of Technology | - | |
SkyEyeGPT | Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model | Arxiv 2024 | Northwestern Polytechnical University | Github | |
MFF-SFE | Cross-modal retrieval method based on MFF-SFE for remote sensing image-text | 中国科学院大学学报 2024 | Aerospace Information Research Institute, Chinese Academy of Sciences | - | |
RemoteCLIP | RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | TGRS 2024 | Hohai University | Github | |
C2F-ITR | From Coarse To Fine: An Offline-Online Approach for Remote Sensing Cross-Modal Retrieval | IGARSS 2024 | Beijing Foreign Studies University | - | |
MGRM-EL | Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval | TGRS 2024 | Northwestern Polytechnical University | - | |
SIRS | Multitask Joint Learning for Remote Sensing Foreground-Entity Image–Text Retrieval | TGRS 2024 | Soochow University | Github | |
PIR | A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval | ACMMM 2023 oral | Zhejiang University of Technology | Github | |
PE-RSITR | Parameter-Efficient Transfer Learning for Remote Sensing Image–Text Retrieval | TGRS 2023 | Northwestern Polytechnical University | Github | |
HVSA | Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum Learning | TGRS 2023 | Aerospace Information Research Institute, Chinese Academy of Sciences | Github | |
SWAN | Reducing Semantic Confusion Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval | ICMR 2023 oral | Zhejiang University of Technology | Github | |
KAMCL | Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval | TGRS 2023 | Tianjin University | Github | |
IEFT | Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval | TGRS 2023 | Xidian University | Github | |
- | A Texture and Saliency Enhanced Image Learning Method For Cross-Modal Remote Sensing Image-Text Retrieval | IGARSS 2023 | Xidian University | - | |
Multilanguage Transformer | Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval | JSTARS 2022 | King Saud University | - | |
GaLR | Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information | TGRS 2022 | Aerospace Information Research Institute, Chinese Academy of Sciences | Github | |
- | Cross-modal retrieval of remote sensing images and text based on self-attention unsupervised deep common feature space | IJRS 2022 | National University of Defense Technology | - | |
AMFMN | Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval | TGRS 2021 | Aerospace Information Research Institute, Chinese Academy of Sciences | Github | |
LW-MCR | A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing | TGRS 2021 | Aerospace Information Research Institute, Chinese Academy of Sciences | Github | |
VSE++ | VSE++: Improving Visual-Semantic Embeddings with Hard Negatives | BMVC 2018 spotlight | University of Toronto | Github |
Paper | Title | Publication | Affiliation | Code | Note |
---|---|---|---|---|---|
FIANet | Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation | TGRS 2024 | Southwest Jiaotong University | Github | Image Segmentation |
FedRSClip | FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models | Arxiv 2024 | China Academy of Electronics and Information Technology | - | Scene Classification |
Abbreviation | Title | Publication | Paper | Code & Weights |
---|---|---|---|---|
VHM | VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis | Arxiv 2024 | VHM | link |
TeoChat | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | Arxiv 2024 | TeoChat | link |
RingMoGPT | RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks | TGRS 2024 | RingMoGPT | - |
EarthMarker | EarthMarker: A Visual Prompting Multi-modal Large Language Model for Remote Sensing | TGRS 2024 | EarthMarker | link |
GeoChat | GeoChat: Grounded Large Vision-Language Model for Remote Sensing | CVPR 2024 | GeoChat | link |
EarthGPT | EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain | TGRS 2024 | EarthGPT | link |
RemoteCLIP | RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | TGRS 2024 | RemoteCLIP | link |
RSGPT | RSGPT: A Remote Sensing Vision Language Model and Benchmark | Arxiv 2023 | RSGPT | link |
GeoRSCLIP | RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model | Arxiv 2023 | GeoRSCLIP | link |
GRAFT | Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment | ICLR 2024 | GRAFT | - |
CSP | CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations | ICML 2023 | CSP | link |
GeoCLIP | GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization | NeurIPS 2023 | GeoCLIP | link |
SatCLIP | SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery | Arxiv 2023 | SatCLIP | link |
我欢迎各种反馈,最好通过GitHub Issues 分享。 同样,如果您有任何疑问或只是想与他人交流想法,请随时发布这些内容。
感谢相关论文、相关项目
如果您发现本项目对您的研究有用,请考虑引用它。