Japan's Weather AI Just Beat the Official Forecast — Using a Surprisingly Old-School Algorithm
A gradient-boosted tree model outperformed Japan's official post-processed weather guidance across temperature, wind, and rain — with 5,702 raw inputs trimmed t
A tree-based ML model beat Japan's official AI forecast at 18 sites across the country.
Weather forecasting has a dirty secret: even the best physics-based models, which simulate the atmosphere on supercomputers using equations describing fluid dynamics, thermodynamics, and radiation, still get things wrong in predictable, correctable ways. They can run systematically cool in mountain valleys, underestimate sea breezes on islands, or smooth away the sharp rainfall gradients that make the difference between a flooded rice paddy and a dry one. That's why meteorological agencies layer a second step on top — post-processing, which uses historical data to learn and correct those systematic errors. Japan's Meteorological Agency (JMA) has its own post-processing product, the MSM Guidance (MSMG), and it is already quite good.
Which makes the findings of Iwase and Takenawa (2026) all the more striking. Two researchers at Tokyo University of Marine Science and Technology built a machine learning post-processing model and found it outperformed the official JMA product — not at one lucky site, but generally across 18 locations spanning Japan's remarkable geographic diversity, from subtropical Okinawan plains to snowbound Hokkaido mountain stations, and for three different weather variables: temperature, wind speed, and precipitation.
The algorithm they used wasn't a transformer, a graph neural network, or anything from the frontier of deep learning. It was LightGBM — a gradient-boosted decision tree model first released in 2017 that remains the workhorse of competitive data science. That choice, and the principled way they fed it information, turns out to be the whole story.
The Science
The raw material is Japan's Mesoscale Model (MSM), a numerical weather prediction (NWP) system — the kind of physics simulation that runs on government supercomputers and outputs a four-dimensional picture of the future atmosphere. MSM runs at 5 km horizontal resolution, covering Japan and surrounding ocean areas, issuing forecasts every three hours out to 39 hours ahead. It outputs temperature, wind, moisture, cloud cover, solar radiation, and more, at both the surface and multiple atmospheric pressure levels.
Iwase and Takenawa selected 18 observation stations across Japan (shown above) — a deliberate mix of terrain types, because a good post-processing model needs to generalize across the messy geography of an archipelago. They trained their models to predict three-hourly accumulated precipitation, hourly temperature, and hourly wind speed. Observation data from 2019–2021 served as training data; 2022 for validation and hyperparameter tuning; 2023 as the held-out test set — a clean temporal split that avoids the data-leakage traps that plague some ML weather studies.
The key architectural choice was spatial context. Rather than feeding the model only the MSM grid cell nearest to each station, they also fed in a surrounding grid of surface-level cells and a grid of pressure-level cells. This explodes the feature space: from 62 variables to 5,702. That's where feature selection comes in.
Their approach used correlation analysis to prune redundant inputs — specifically, the method from Bedi and Toshniwal (2018), which scans pairs of features and drops any new feature whose absolute Pearson correlation with an earlier-included feature exceeds a threshold of . Features from the nearest grid cell and calendar variables (lead time, month, hour) were exempted from pruning, protecting the most relevant local information. After this pass, three additional selection methods ranked remaining features by correlation with the target, mutual information, or LightGBM's own internal feature importance scores.
The number of features dropped varied substantially by site — from 236 at Niijima (a small island where the atmosphere is relatively uniform) to 869 at Nobeyama (a high-altitude plateau where surrounding terrain creates a patchwork of correlated but redundant signals). See
Features Dropped by Correlation Analysis, by Station
Number of features removed from the original 5,702-feature surrounding-grid input via pairwise correlation pruning (threshold τ = 0.9), for a selection of stations. Higher numbers indicate more redundancy in the spatial inputs.
| Label | Value |
|---|---|
| Niijima | 236 |
| Saitama | 248 |
| Oma | 256 |
| Ishikari | 271 |
| Uchinoura | 296 |
| Sekigahara | 437 |
| Itoigawa | 516 |
| Kusatsu | 733 |
for a sense of scale.
For comparison, the team also implemented a feedforward neural network and reproduced a CNN baseline following the approach of Kudo (2022), which had been specifically designed for MSM temperature post-processing in the Kanto region.
What They Found
The LightGBM model with surrounding-grid inputs and correlation-based feature selection was the strongest performer overall. Across most of the 18 stations and across most forecast lead times, it achieved lower RMSE (root mean squared error — the standard way to measure average forecast error, weighted to penalize large misses more heavily) than both the raw MSM output and the official MSMG product.
Training Dataset Size Across 18 Stations
Number of training samples (2019–2021) and test samples (2023) at each of the 18 stations. Sample counts reflect removal of missing observation records from the otherwise complete MSM input grids.
| Label | Value |
|---|---|
| Ashimine | 113,854 |
| Uchinoura | 113,854 |
| Minamiaso | 113,828 |
| Imabari | 113,802 |
| Sakai | 113,776 |
| Sekigahara | 113,568 |
| Nobeyama | 113,022 |
| Niijima | 113,477 |
That last comparison is the notable one. MSMG is not a naive benchmark — it represents the JMA's own investment in post-processing, using Kalman filtering, frequency bias correction, and neural networks. Beating it across diverse terrain with a relatively lean modeling approach suggests that incorporating rich spatial context (the surrounding grid) and eliminating redundancy (feature selection) can extract signal that the official product leaves on the table.
The LightGBM model also outperformed the reproduced CNN baseline for temperature — the variable the CNN was specifically designed for. This aligns with a growing body of evidence that tree-based models frequently outperform deep learning on structured, tabular weather data, even when the deep learning model was purpose-built for the task (Shwartz-Ziv and Armon, 2022; Grinsztajn et al., 2022; Hieta and Partio, 2025).
Precipitation, as always, was the hardest problem. Rain has what statisticians call a highly skewed distribution: in any given three-hour window, the most common outcome is zero rain. When you train a model to minimize squared error on this kind of data, it gets pulled toward predicting mild, average-ish values — because the thousands of zero-rain observations dominate the gradient. It rarely learns to shout "heavy rain coming."
To address this, Iwase and Takenawa tested two additional strategies. The first was a Tweedie loss function — a mathematical formulation designed for non-negative, zero-inflated data like insurance claims or, in this case, rainfall. The Tweedie distribution natively handles the mixture of exact zeros and positive values, allowing the model to separately reason about "will it rain?" and "how much?" The second strategy was event-weighted training, which artificially increases the penalty for mispredicting high-rainfall events, forcing the model to care more about the tail of the distribution.
Both approaches improved event-based metrics — specifically the Threat Score ($\mathrm{TS}$), probability of detection ($\mathrm{POD}$), and false alarm rate ($\mathrm{FAR}$) — particularly at higher rainfall thresholds. But the gains were site-dependent, and overall RMSE for precipitation remained slightly below MSMG. Precipitation prediction is a harder, more nonlinear problem than temperature or wind, and the gap here likely reflects both the intrinsic difficulty and potentially MSMG's additional domain-specific tuning for Japanese rainfall patterns.
Feature Space: Original vs. After Pruning (Selected Stations)
The surrounding-grid configuration starts with 5,702 features. Correlation-based pruning removes redundant spatial predictors, with the reduction varying substantially by terrain type.
| Label | Value |
|---|---|
| Niijima (island) | 5,466 |
| Saitama (plain) | 5,454 |
| Itoigawa (coastal) | 5,186 |
| Kusatsu (mountain) | 4,969 |
| Nobeyama (plateau) | 4,833 |
Why This Changes Things
The implications here run in several directions at once.
For forecasting practice: Post-processing is already standard. What this paper argues is that the specific design of the post-processing model — particularly the use of spatial context from surrounding grid cells, and principled redundancy removal — matters substantially. A model that sees only the nearest grid point is forecasting with tunnel vision. Weather systems are spatial phenomena. A cold front approaching from the northwest looks different three grid cells away than it does at the target point, and that upstream information contains predictive signal.
This is conceptually similar to why weather forecasters look at synoptic maps, not just the reading at the station they care about. Iwase and Takenawa have operationalized that spatial awareness into the feature engineering step rather than baking it into the model architecture.
For the deep learning debate: The result that LightGBM outperforms both a feedforward neural network and a purpose-built CNN on structured weather data is not, in 2026, a surprise — but it keeps needing to be said. Deep learning is transformative in domains with natural spatial or sequential structure that networks can learn from raw pixels or tokens. Weather model output, post-processed into feature tables, doesn't always have that character. Gradient-boosted trees handle mixed feature types, missing data, and non-linear interactions efficiently, with less hyperparameter sensitivity and far lower training cost. The message isn't that CNNs have no place in weather AI — modern end-to-end models like Pangu-Weather or GraphCast operate quite differently — but rather that for the specific task of post-processing gridded NWP output into station-level forecasts, the unglamorous tree model remains formidable.
For Japan specifically: Japan is an exceptionally challenging weather prediction environment. The archipelago stretches from subtropical in the south to subarctic in the north, rises sharply from sea level to 3,776-metre Mount Fuji within tens of kilometers, and sits at the intersection of Pacific and continental air masses. The 18 stations in this study weren't chosen at random — they include seaside plains, high-altitude plateaus at 1,350 m (Nobeyama) and 1,223 m (Kusatsu), small islands (Niijima), and remote northern towns (Kamikawa in Hokkaido). The fact that the LightGBM model generalized reasonably well across this diversity, rather than just excelling at easy lowland sites, speaks to the robustness of the approach.
For the rainfall problem: The Tweedie and event-weighted experiments deserve particular attention from practitioners. Improving average error (RMSE) on precipitation is a different objective than improving detection of heavy events. A city emergency manager deciding whether to issue a flood warning doesn't care about average error — they care about whether the model says "50+ mm in 3 hours" when that's actually coming. The finding that modified loss functions shift the model's performance profile toward event detection, even when they don't dramatically improve mean error, is exactly the kind of nuance that gets lost when research reports only RMSE. The authors' transparent accounting of this tradeoff — gains in and at high thresholds, modest RMSE improvement — is genuinely useful for downstream decision-making.
What's Next
Several open questions follow naturally from this work.
The study trains a separate model for each station. That's computationally tractable with 18 sites, but wouldn't scale to thousands of locations without rethinking the architecture. Transfer learning approaches — pre-training on many sites and fine-tuning on new ones — could extend this framework without retraining from scratch each time.
The feature selection methodology, while principled, is also relatively simple: pairwise correlation thresholding followed by ranking-based filters. More sophisticated approaches — recursive feature elimination, Shapley-value-based importance, or learned feature masks — might squeeze out additional signal, especially in mountainous regions where the spatial structure of relevant predictors is complex.
The CNN comparison was limited to one prior architecture for temperature only, and the authors are transparent about this scope. The landscape of deep learning for weather post-processing is moving fast. Models with attention mechanisms, or graph neural networks that explicitly represent station networks, may tell a different story. This paper is not an argument that LightGBM will always win — it's a demonstration of how far you can get with well-engineered inputs and an efficient classical model.
Perhaps most importantly, the precipitation problem remains unsolved in a satisfying way. The Tweedie and weighting approaches improved event detection but didn't fully close the gap with MSMG on overall accuracy. Probabilistic forecasting — predicting a distribution of outcomes rather than a single value — might be a more natural fit for rainfall than any deterministic regression approach. Japan experiences some of the world's most intense rainfall events, including Baiu front precipitation and typhoon-driven downpours. Better post-processing of heavy rain forecasts isn't an academic exercise: it translates directly into earlier warnings, better evacuation decisions, and lives saved.
What Iwase and Takenawa (2026) have shown, across 18 sites and three weather variables, is that thoughtful feature engineering — knowing what information to include and what to throw away — can outperform a government agency's carefully maintained system. The gap isn't magic. It's spatial context and correlation pruning, applied consistently. That's a recipe others can follow, adapt, and improve upon. In a field where forecast error costs are measured in lives and infrastructure, every recovered bit of predictive signal matters.
Among the LightGBM-based models, those using surrounding-grid information with correlation-based feature selection showed the lowest RMSE across many locations and forecast lead times.
Sign in to join the conversation.
Comments (0)
No comments yet. Be the first to share your thoughts.