Machine learning benchmarks for Earth-monitoring models

GEO-Bench: Toward foundation models for Earth monitoring

Authors: Alexandre Lacoste, Alexandre Drouin, and David Vazquez
Applying machine learning to monitor earth science is becoming a hot area of research. Various projects are using AI to identify sources of methane pollution, quantify the amount carbon in a forest, predict extreme weather, and monitor crops. Most of these rely on large pretrained models known as foundation models—the same types of models that boosted ChatGPT to fame.

Foundation models are critical for the future of Earth observation monitoring. To create models that can function across many domains, researchers must evaluate them using benchmarks composed of numerous datasets.

A recent ServiceNow research paper, GEO-Bench: Toward Foundation Models for Earth Monitoring, addressed this by proposing the development of a multimodal geographic data benchmark that can help determine which models excel across a broad range of tasks.

Other benchmarks exist, but they’re more specific to certain problems. This one is supposed to be multimodal.

Foundation models encapsulate multimodal data streams through self-supervised training. The trained models can then be fine-tuned for a variety of climate-related remote sensing tasks.

The right data

The paper lays out potential sources of Earth observation data to train foundation models. Images from Earth monitoring are different from machine learning datasets in that they’re usually taken from satellites overhead rather than from the ground. Satellites can gather a large number of images in multispectral bands with periodic visits.

Satellites provide rich fodder for long-term datasets: Petabytes of accessible satellite datasets contain images of Earth under various modalities, dating from present day to the 1960s. With the addition of GPS coordinates and time stamps, each acquisition provides even more data from multiple places, including weather conditions, semantic maps, and elevation.

Distilling this vast amount of information into pretrained models of various sizes offers the opportunity to repackage the information and make it accessible to labs for increasing the performance of downstream tasks.

Data from satellites, such as Sentinel-2 and Landsat 8, provides more than just red, green, and blue (RGB) images—it offers images in multiple spectral bands with periodic revisits. That means there are four dimensions of structured data: longitude, latitude, wavelength, and time.

Data from written sources, such as articles, could also play a role by combining previously geofenced words with satellite images to gain a deeper understanding of what photos contain.

We created a committee of six sector experts to analyze a wide range of potential datasets suitable for the banchmark. After reviewing dataset characteristics, we selected and modified 12 datasets that cover a diversity of sensors, resolution, and geographical span. For example, some sensors (Sentinel 1) use radar technology and can see through clouds.

It was also necessary to have datasets with permissive licenses that allowed us to adapt them. The GEO-Bench benchmark encompasses multispectral, synthetic-aperture radar, hyperspectral, elevation, and cloud probability modalities, with spatial resolutions from 0.1 to 30 megapixels.

Recommendations for using the benchmark

We encourage users to report the findings of their fine-tuned models and to adjust hyperparameters, with a maximum budget of 16 trials per task. We also propose limiting the augmentations of image data to 90-degree rotations and vertical and horizontal flips.

Classification benchmark RGB only: Normalized accuracies of various baselines (higher is better)

To determine the performance of existing models, we reported results for 20 baselines. We found that SwinV2 performed the best when comparing baselines for RGB data from satellites.

Another model, ConvNeXt, beats SwinV2 when limited training data is available. This is because ConvNeXt uses a convolutional neural network from a transformer-based system and therefore excels at different things.

When leveraging multispectral information, we used pretrained models on Sentinel-2 and found that ResNet50 pretrained on Sentinel-2 using MoCo leads to an increase in performance across many tasks.

Future research

The current GEO-Bench has limitations, including that it doesn’t fine-tune on temporal, weather, or text data. Subsequent versions could address these constraints.

Goals include collaborations with the wider community to increase uptake of the benchmarks—basically, using the benchmark to evaluate every model that emerges from the machine learning world regarding Earth monitoring.

We also propose a potential update to the benchmark, called GEO-Bench 2.0, which could use some of the language vision models to decipher images of items in them—for example, counting the number of cows in a particular satellite image. This kind of on-the-fly adaptation, which doesn’t require engineers to manually label parts of images, could lead to more efficient workflows and more discoveries.

Find out more about ServiceNow Research.

Resources

Paper: https://arxiv.org/abs/2306.03831

Code: https://github.com/ServiceNow/geo-bench