WeatherMesh-4 Technical Blog

Read the article

Haoxing is a Machine Learning Engineer working on WeatherMesh development. Her current focus is primarily on evaluations, and she's been having lots of fun "vibecoding" an evaluation dashboard web app. She is actively learning more about meteorology, both to support her work and to enhance her paragliding skills.

A few weeks ago, our commercial customers received the latest update to WeatherMesh, WeatherMesh-4 (WM-4). WM-4 is up to 42.4% more accurate than its predecessor, WM-3. Furthermore, the WM-4 ensemble outperforms the ECWMF ensemble (ENS) by up to 34.6%¹, and retains an advantage of 4.6% at 15 days lead time. On the majority of the evaluated targets, WM-4 also surpasses GenCast, the much larger AI model developed by Google DeepMind, and the previous leading model at 15-day lead time.

Figure 1. Scorecard of the WeatherMesh-4 ensemble mean with the ECMWF ensemble mean as reference. Blue = negative percentage change in RMSE = WeatherMesh-4 is better. Evaluation period of the data shown is March 2024 to March 2025.

WeatherMesh-4 succeeds WeatherMesh-3, featured in our ICLR workshop Spotlight paper. We have also released the source code of WM-3 here. For WM-4, while we are not ready to share full architectural details at this time, we would like to highlight a few innovations.

‍

HEALPix

Previous generations of WeatherMesh, like most weather models, physics- and AI-based alike, operated on a latitude-longitude grid, or a latlon grid. Latlon grids are intuitive and widely used because much of global weather data is already available in this format. However, they're an imperfect representation of Earth's spherical geometry. Longitude lines are widely spaced at the equator but become densely packed toward the poles. For instance, longitude lines 1 degree apart measure approximately 112 km apart at the equator but only about 45 km apart at the Arctic Circle. At the poles, this density reaches an impractical extreme, with as many grid points corresponding to a single location as there are over the entire equator.

Figure 2. A latitude-longitude grid over the surface of the Earth, with the longitude lines crowding together towards the North Pole.

‍

Figure 3. The HEALPix mesh at different depths. To go from one depth to the one one greater, each cell is evenly divided into four.

In WeatherMesh-4, we opted for the HEALPix mesh, which represents the Earth’s spherical surface more faithfully while offering nice computational properties. HEALPix, an acronym for Hierarchical Equal Area isoLatitude Pixelation, was originally proposed in a 1999 preprint on astrophysics and developed primarily for astronomical applications requiring analysis of data distributed over a sphere such as cosmic microwave background (CMB) analysis. It works nicely for data over the surface of the Earth as well, and we are not the first to apply it to ML-based numerical weather prediction (NWP). Its name highlights most of its useful properties:

Hierarchical: Mesh resolution is controlled by a "depth" parameter, with each pixel dividing into four at the next depth level.
Equal Area: Each pixel always represents an region of equal area, unlike in a latlon grid.
IsoLatitude: Nodes remain organized along constant-latitude rings.

For ML applications, HEALPix’s most attractive feature is that it provides a regular ordering of the mesh nodes, with each node having well-defined neighbors along 8 directions. This means that one can “flatten” the mesh and carry out computations on them efficiently².

In WeatherMesh-4, we use a HEALPix mesh of depth 8 to represent the weather state at 0.25-degree resolution³, and a HEALPix mesh of depth 5 to represent the weather latent space, which corresponds to roughly 1.8-degree resolution.

‍

Blur in AI models: Feature or bug?

AI weather models have long been criticized for producing excessively blurry forecasts. And the criticism is fair—when you place an AI-generated forecast next to a physics-based forecast, it will not be difficult to tell which is which. The AI forecasts often lack realism in their fine-scale details, such as the intricate, high-intensity patterns of convective precipitation, making them less practically useful compared to physics-based forecasts.

Figure 4. Side-by-side comparison of an AI-based forecast vs a traditional physics-based forecast of 6-hour accumulated total precipitation in South East Asia.

The reason behind this excessive blurring is simple: It’s exactly what they are trained to do. AI weather models are often trained to minimize mean squared error (MSE) between their predictions and ground truth data. When there is uncertainty in the outcome, the prediction that achieves the lowest possible MSE is the average of all the outcomes.

To spell things out even more, consider the toy example below. Suppose the true weather state is a storm causing 2 mm of precipitation in a circular area. A model predicting precipitation slightly shifted either left or right incurs an MSE loss of 0.498. However, predicting the average—a larger, blurrier patch of rain with half the intensity—yields a lower MSE loss of 0.249. From the perspective of achieving a low loss in training, the averaged, blurry forecast is objectively superior, despite it being less accurate in predicting the specific characteristics of a convective storm.

Figure 5. A toy example showing the MSE loss of a few possible predictions given a fixed ground truth target. Getting the storm location slightly off from the ground truth results in a *higher* loss than predicting the average of the two possibilities.

This blurring has long been discussed as the number-one limitation of deterministic AI models, and considerable effort has been invested in addressing it. Initially, the WindBorne team shared this perspective. However, we have gradually come to the view that eliminating blur in deterministic AI models is a false premise—the “blur” is a feature, not a bug!

As illustrated above, deterministic AI models are trained to produce predictions representing the average outcome given specific initial conditions. In traditional numerical weather prediction (NWP), understanding the average weather outcome is extremely valuable, and we achieve it by paying the heavy computational cost of running deterministic models on perturbed initial conditions and then taking the average of their predictions. This is ensemble forecasting in traditional NWP.

Although most AI models, including WeatherMesh, generate predictions with a single forward pass, it is a misnomer to call them “deterministic”. In reality, these "deterministic" AI models are emulating ensemble means, and they should be interpreted and evaluated accordingly. Nobody looks at an ensemble mean and criticizes it for being blurry or not capturing extremes!

The evolution of AI-based forecasts might be the inverse of traditional NWP evolution. In NWP, we initially developed deterministic models and later extended them to ensemble models to better represent uncertainty and variability. Just the opposite, AI-based forecasts began by effectively predicting the ensemble mean, and people are now exploring methods (e.g. diffusion) to generate realistic, deterministic ensemble members.

We believe viewing deterministic AI models as inherently representing ensemble means is an important step forward. With architectural enhancements in WeatherMesh-4, we are better able to model ensemble-like uncertainty internally. Also as a result of this insight, we have chosen the ECMWF ensemble mean as the benchmark for our evaluations.

At the same time we are tackling the problem of generating high-resolution, physically realistic forecasts. Thanks to the modular architecture of WeatherMesh, we are developing new components that share the same latent space as the main model, but can generate high-resolution regional and point forecasts. We hope to share more updates on this research direction soon.

‍

Evaluating WeatherMesh-4

WeatherMesh-4 improves upon the accuracy of WeatherMesh-3, especially at longer lead times. At 15 days lead time, it’s up to 42.4% more accurate than WM-3, which already outperformed IFS HRES across the board. Impressively, at 15 days, WM-4 is 32.8% more accurate on 2-meter temperature, an important surface variable.

Figure 6. Scorecard comparing WM-4 ensemble with WM-3. Blue = WM-4 is better. Evaluation period is April to July 2024.

The “deterministic” WeatherMesh-4 outperforms ECMWF ensemble mean on all 1,140 targets evaluated, across all surface and atmospheric variables, from 6 hours to 15 days of lead time. Previously, only Google DeepMind’s GenCast achieved comparable performance, and WM-4 surpasses GenCast’s mean on 67.4% of the 570 shared targets (GenCast is only able to generate forecasts at 12-hour intervals). WM-4’s evaluation results on the year of 2020 are available on WeatherBench2, positioning it at the top of 16 featured models. When initialized using ECMWF ensemble initial conditions, the WM-4 ensemble further improves upon the accuracy of the “deterministic” WM-4.

Figure 7. Performance of ECMWF HRES, WeatherMesh-4, and GenCast (mean) relatively to ECMWF ENS on 500mb geopotential, from WeatherBench’s website.

Choosing the validation target

It appears to be standard practice in AI weather forecasting to validate AI models against ERA-5, while validating physics-based models against their own analyses. When comparing traditional physics-based models like GFS and IFS HRES against each other, validation against their respective analyses made sense, particularly since ERA-5 is produced with an older cycle of the IFS model. When AI models are in the picture, it is much less clear what the “fair” validation target is. Consistent with our stance in the WM-3 paper, we advocate validating all models—both physics-based and operational AI—against ERA-5, which is the best available ground truth approximation. Ultimately, forecasting accuracy should solely reflect how well models predict actual weather events; whether a prediction was close to its own system's analysis is irrelevant.

Even better than either analysis or reanalysis is to validate forecasts against direct observations. As part of our effort to create AI-based forecasts that are maximally useful, we are developing an evaluations dashboard to easily visualize metrics and compare model performance. A database with millions of entries is growing daily, including metrics validated against observations, both from ground weather stations as well as those collected by WindBorne’s own Global Sounding Balloons (GSBs). Below is a preview of the “METAR table” from the dashboard, which allows the user to explore performance metrics at over 4000 tracked METAR stations. Data shown reflect performance on select dates in May 2025. We are working on releasing an external version of it so that you could play with the data yourself!

Figure 8. A table comparing the performance of the WM-4 ensemble vs. ECMWF ensemble on 2-meter temperature at select METAR stations. Blue = negative percentage change in RMSE = WM-4 is better. Evaluation period of the data shown is May 2025.

‍

1. Excluding the 75.6% improvement on 250mb geopotential, which we believe to be anomalous.

2. More specifically: the HEALPix mesh is defined to be the recursive subdivision of faces into 2x2 subfaces. Other than some awkwardness at the first level (where it’s a dodecahedron) one can very easily switch resolutions at different depths. It’s straightforward to define a “convolution” and “deconvolution” operator over such a mesh, and it can be implemented as a mere linear and reshape if the points are sorted correctly. This results in a highly efficient structure that we’re able to take advantage of.

3. Actually roughly 0.23-degree, which is the closest depth for representing weather data at 0.25-degree resolution.