DeepAndes: A Self-Supervised Vision Foundation Model for Multispectral Remote Sensing Imagery of the Andes

Guo, Junlin., Zimmer-Dauphinee, James R., Nieusma, Jordan M., Lu, Siqi., Liu, Quan., Deng, Ruining., Cui, Can., Yue, Jialin., Lin, Yizhe., Yao, Tianyuan., Xiong, Juming., Zhu, Junchao., Qu, Chongyu., Yang, Yuechen., Wilkes, Mitchell., Wang, Xiao., VanValkenburgh, Parker., Wernke, Steven A., & Huo, Yuankai. (2025). DeepAndes: A Self-Supervised Vision Foundation Model for Multispectral Remote Sensing Imagery of the AndesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing18, 26983-26999. https://doi.org/10.1109/JSTARS.2025.3619423

Archaeologists often use remote sensing, which involves studying landscapes through satellite imagery, to understand how past societies grew, interacted, and adapted over long periods of time. These large-scale surveys can reveal patterns that ground-based fieldwork alone cannot. Their power increases even more when combined with deep learning and computer vision, which help detect archaeological features automatically. However, traditional supervised deep learning methods struggle because they require huge amounts of detailed annotations, which are difficult and time-consuming to create for subtle archaeological features.

At the same time, new vision foundation models—large, general-purpose computer vision systems—have shown impressive performance using minimal annotations. But most of these models are designed for standard RGB images, not the multispectral satellite imagery (including eight different spectral bands) that archaeologists rely on for detecting subtle, buried, or eroded features.

To address this gap, the researchers created DeepAndes, a transformer-based vision foundation model specifically built for Andean archaeology. It was trained on three million multispectral satellite images and uses a customized version of the DINOv2 self-supervised learning algorithm, adapted to handle eight-band data. This makes DeepAndes the first foundation model tailored to the Andean region and its archaeological detection challenges.

The team tested DeepAndes on tasks such as classifying difficult, imbalanced image datasets, retrieving specific types of images, and performing pixel-level semantic segmentation. Across all areas, the model outperformed systems trained from scratch or on smaller datasets, achieving higher F1 scoresmean average precision, and Dice scores, especially in few-shot learning situations where only a small number of labeled examples are available.

Overall, these results show that large-scale self-supervised pretraining can greatly improve archaeological remote sensing, helping researchers identify ancient sites and landscapes more accurately and efficiently.

Fig. 1. 

Overview of DeepAndes. This figure shows the training dataset (a)–(d) and three domain-specific downstream tasks (e) using DeepAndes—a vision foundation model designed for multispectral satellite imagery in the Andes region. Particularly, (a) shows a large-scale map of the imagery used to train DeepAndes, highlighting various land cover types, with their area distribution shown in (c). (b) presents the unit sample patch [red box in (a), (b), (d)] with eight spectral bands. (d) illustrates image patching for DINOv2 training, with geospatial sampling densely covering different archaeological sites.