How We Built Our Record-Breaking AI Model, WeatherMesh

2.15.24

Yesterday WindBorne revealed the scale of its forecasting accuracy for the first time, namely that WindBorne’s WeatherMesh is the world’s most accurate global forecast model, delivering significantly lower errors than either traditional physics-based NWP or peers in the AI weather prediction space. We here discuss the model details that enabled such an improvement and take a closer look at our model’s performance on a range of metrics. Key takeaways include:

  • Higher performance than IFS ENS mean when running in ensemble mode
  • Higher performance than IFS HRES when running in deterministic mode
  • Higher performance & better time resolution than leading AI models such as DeepMind’s GraphCast — while also using significantly less compute
  • Designed from the start to be used as an operational forecast; WindBorne already uses it live for controlling its global balloon constellation

While we go deeper into numbers in the results section, to frame the magnitude of WindBorne WeatherMesh’s improvement, we have achieved a remarkable 14% improvement to 500 hPa geopotential RMSE (7.15 m2/s2) at a 24h lead time relative to the IFS ENS mean. By contrast, this is more than 3x the improvement operational GraphCast made relative to IFS HRES at the same lead time. WindBorne forecast skill continues to be superior on all other forecast lead times and variables (see figure for 2m temperature, which outperforms by an even greater margin); we cite 500 hPa geopotential as it is a key diagnostic variable used across the weather literature. Geopotential is in effect a representation of pressure, which in turn is a good proxy for the location and movement of weather systems. Indeed, the fact that WindBorne’s WeatherMesh is so accurate in predicting geopotential is reflected in case studies such as predicting Hurricane Ian, in which we deliver far more accurate ground track forecasts — hundreds of kilometers more accurate, even many days out — than even dedicated hurricane models.

500 hPa geopotential height
We focus on 500 hPa geopotential for our headline metrics, but forecasts of 2 meter temperatures surpass even that. Comparisons to ECMWF ENS were made running WeatherMesh in ensemble mode; comparisons to GraphCast were made running WeatherMesh in deterministic mode.

Model

A roughly 1 billion parameter model, at its core WeatherMesh is based on transformers,1 the same innovation that powers ChatGPT. While Pangu-Weather explored one branch of using these for the purpose of weather forecasting, WindBorne has built on the core transformer concept in novel ways, making numerous key advancements that enable significantly higher performance. For competitive reasons we of course can’t share everything, but here we do outline many of the key details.

Inputs and Outputs

WeatherMesh currently predicts surface temperature, pressure, winds, precipitation, and radiation; and then geopotential, temperature, winds, and moisture at 28 pressure levels. The model inputs and outputs to a 0.25 degree resolution grid for compatibility with ERA-5. The model produces hourly outputs.

In total, this makes the input and output tensors roughly 720 x 1440 x 150. These are stored and operated on as normalized fp16; as such, each tensor representing the state of the weather over the globe is ~300MB on disk. We call one of these tensors a “weather instance.” In training, we build samples based on sequences of weather instances for the model to predict.

Encoder-Processor-Decoder with a Recurrent Processor

The high-level structure of WeatherMesh is based on an Encoder-Processor-Decoder structure. The encoder and decoders convert from weather physical space to latent space. The processor operates solely in the latent space, and can be chained multiple times to achieve a forecast of the desired length.

Model Diagram

WeatherMesh currently uses two independently-trained processors, one with a one-hour timestep, and one with a six-hour timestep. With this combination, forecasts of arbitrary lengths with one-hourly resolution can be produced. For operational use navigating our balloon constellation, we chain these to produce forecasts out to one week.

The existing timesteps were determined by our desire to start with 1-3 day forecasts with high time resolution to meet initial customer & balloon navigation needs. That said, there is significant potential for further improvement to longer-term forecasting should we opt to train a processor with a longer timestep.

A key difference to highlight between WeatherMesh and other AI-based weather models is that our processor operates entirely in latent space. This includes recurrence, which continues to operate in latent space. This architecture does something a one-shot system wouldn’t: it asserts an inductive bias toward weather being a dynamic system. As weather evolves dynamically, we believe letting the model evolve naturally in its latent space allows it to avoid errors introduced in the encoding and decoding steps.

Adapters

WindBorne WeatherMesh from the start was intended to be used in live, real-world scenarios. Indeed, we have a strong internal need for more accurate forecasts to fly our Global Sounding Balloons, which collect data throughout the atmosphere, including in historically undersampled regions such as over the ocean. As such, it was a strict requirement that WeatherMesh be capable of making a forecast from the input of a live analysis like HRES or GFS while maintaining high performance. This posed a challenge, as the primary training data is ERA-5 reanalysis, and as other AI models have always seen a degradation in performance when used operationally with a live analysis. To reduce this effect and produce a better operational forecast, WeatherMesh has an adapter trained to input HRES and GFS that then feeds into the rest of the model.

The adapters are based on a convolutional U-Net structure. This is well-suited for the simpler adapter task as compared to the more complex core WeatherMesh vision transformer model. Though a transformer was tried, the limited availability of training data for the specific task meant the U-Net achieved better results.

Ensemble Forecasting

Thanks to adapters, WeatherMesh can be initialized from a range or even a combination of live analyses, including GFS, IFS HRES, and IFS ENS. This in turn means that WeatherMesh can be run in either deterministic mode or ensemble mode. For controlling our balloon constellation, we use ensemble forecasts with 51 members, as balloon navigation is ultimately probabilistic and the ensemble mean provides a lower RMSE.

Additionally, the ability to use a combination of ensembles means that WeatherMesh is capable of producing what we have termed “compound ensembles”. For example, our model is capable of combining the set of all 31 GEFS members with all 51 ENS members to create a 31 * 51 = 1581 member ensemble. While with physics-based NWP using this many members would be computationally challenging to say the least, with each WeatherMesh inference step taking ~3.5 seconds on even a consumer-grade GPU, massive ensemble counts become tractable.

Training

Models were trained on an on-prem cluster of 33 RTX 4090s. The decision to build out our own cluster was motivated by the large size of the datasets: weather data requires hundreds of terabytes, and the cost of cloud solutions that have high bandwidth between GPUs and storage of that size is not only high but also persistent (ie, there are no on-demand solutions that you can turn on and off). So as to not commit many tens of thousands dollars per month, we opted instead to take advantage of our strengths as a hardware company and set up our own machines.

Training took approximately 15X less compute than GraphCast, and roughly 10X less compute than Pangu-Weather. We believe that WeatherMesh’s ability to achieve such performance with a fraction of the compute validates the strength of our architecture decisions.

Training Requirements
WeatherMesh is significantly cheaper to train than peer AI models. Compute requirements were estimated based on theoretical max flops at fp16 for each of the training configurations.

The hardware cost of our compute infrastructure was approximately $100k. Had we run all of our experiments in the cloud instead, across our R&D rather just our training run, this easily would have cost four times as much.

By far the biggest limitation of our current training stack is that we currently do autoregressive training out to only 24 hours, thanks to limited VRAM per GPU (each has just 24 GB). This has forced us to come up with a number of clever ram saving tricks to fit a high resolution vision transformer training stack into an amount of VRAM it had no right to fit in. The fact that we’re getting this performance without longer autoregressive training is quite impressive — and frankly a little shocking. Extending autoregressive training should further improve performance in longer-lead forecasting.

Indeed, increasing VRAM to unlock more autoregressive training is one of the biggest motivators to augment our compute infrastructure. While on-prem has been great for cost-effectively experimenting, we’ve always had plans to move some of the larger runs to the cloud for further scaling as the project matures. Thanks to the performance of our on-prem compute, these plans have been repeatedly pushed back (we were originally considering doing so several months ago); however, we’ve reached the point where it’s become a priority. We’re eager to see performance increase on long-lead time forecasts once we have a full 80GB of VRAM per GPU to work with!

Other Tricks

Generally, in the development of our model and training stack, we’ve leveraged our comprehensive understanding of conventional weather models — an understanding we wouldn’t have were we solely AI researchers. This manifests itself not in the model architecture itself,2 but rather in everything else. Our deep knowledge of NWP has let us avoid common pitfalls numerous times when designing our systems.

We’ve also made advancements on the AI side of things, of course. While we aren’t yet revealing all the techniques we use under the hood, we did make a number of key advancements that we have yet to see in literature. To that end, if you specialize in vision transformers and are reading this, please reach out. While we aren’t yet sharing them openly on the internet, the same doesn’t apply to individuals. If our techniques truly haven’t been used for vision transformers before, we may be interested in publishing (and already have a witty name picked out!).

One such advancement related to significantly reducing the VRAM requirements of vision transformers, allowing us to scale them much further while still using GPU’s with very limited VRAM. When we move our training runs to the cloud, this will enable scaling the models significantly further, as it’s a key ingredient for unlocking longer autoregressive training. This in turn is a key step on our roadmap to further improve performance, particularly at long lead times.

Results

Across all variables, altitudes, and lead times tested, WindBorne WeatherMesh significantly outperforms ECMWF. WeatherMesh also significantly outperforms GraphCast at the time horizons for which it was designed. As noted in the model section, WeatherMesh has not yet been trained for longer time horizons, yet despite this temporary limitation, performance remains strong even at longer lead times.

In the following charts & tables, the units are in percent improvement to RMSE: an improvement of 100% would represent a perfect forecast. As it is defined as improvement, higher is better. Comparisons to ECMWF were done running WeatherMesh in ensemble mode with 51 members. While comparing WeatherMesh in ensemble mode to GraphCast would likely further strengthen WindBorne’s RMSE advantage, this isn’t as apples-to-apples and as such were done running WeatherMesh in deterministic mode. All evaluations were done at 0.25 degree resolution.

Improvement to 500 hPa geopotential height
Improvement to mean sea level pressure
Improvement to 850 hPa temperature
Improvement to 2m temperature

Quite frankly, we were stunned by the magnitude of these improvements — especially knowing how many areas of our model can be further advanced with minimal work, e.g. extending autoregressive training. We’re excited to see what happens next.






1 We use Vision Transformers, first explored in 2020 by Dosovitskiy et al for using transformers for image recognition tasks, and in particular SWIN Transformers.

2 As in other applications of AI, this is a common trap for smart researchers and is mostly a fool’s errand. Instead, we design the models for scalability and give it just the right amount of inductive bias — and then we let the model learn.