ClimX: Extreme-aware climate model emulation

What is ClimX?

A benchmark built for the question that matters: how will extremes change?

Motivation

Earth System Models (ESMs) are our best tools to study climate futures, but they are computationally expensive. This limits how densely we can explore uncertainty (scenarios, initial conditions, and model structure) and makes it hard to answer policy-relevant questions about rare but high-impact events.

Climate emulators are lightweight surrogates that approximate ESM outputs, enabling rapid experimentation and risk assessment.

The task

Build a model that predicts daily 2D maps of climate variables at NorESM2-MM resolution, driven by forcing trajectories and, optionally, past climate state.

Formulation

$$ \begin{aligned} x_t = g(&f_t, f_{t-1}, f_{t-2}, \dots, f_{t-\tau}, \\ &x_{t-1}, x_{t-2}, \dots, x_{t-\tau}) \end{aligned} $$

$x_t$: climate state, $f_t$: external forcing at time $t$

Training: historical (1850–2014) + SSP1-2.6, SSP3-7.0, SSP5-8.5 (2015–2100). Testing: held-out SSP2-4.5 (2015–2100).

Data and access

Two-tier distribution: full-resolution training on Hugging Face, lightweight prototyping on Kaggle

Use the same benchmark data for both Kaggle competitions: the deterministic main track and the separate UQ track.

Hugging Face (full)

172GB, full resolution (NetCDF)

Best for: full-resolution training and final model development
Format: Zarr (streamable / chunked)
Includes: historical + training SSP targets and forcings; SSP2-4.5 test forcings (no targets)

Historical: (lat: 192, lon: 288, time: 60224)

Projections: (lat: 192, lon: 288, time: 31389)

Open dataset page

Kaggle (lite)

800Mb, 16× spatially coarsened lite (debug)

Best for: fast prototyping and validating end-to-end pipelines
Format: competition “Data” bundle (lightweight exports)
Same task: same variables and split logic, at reduced spatial resolution
Competitions: deterministic submissions go to the main track, probabilistic submissions go to the UQ track

Historical: (lat: 12, lon: 18, time: 60224)

Projections: (lat: 12, lon: 18, time: 31389)

Open main-track data · Open UQ track

Inputs

Forcing variables include greenhouse gases (global) and aerosols (spatial). Aerosol inputs are temporally sparse for some scenarios, and are interpolated to monthly values.

Greenhouse gas forcing time series — Greenhouse gas inputs across scenarios.

Aerosol emissions inputs — Aerosol emission inputs.

Targets

Your emulator can predict daily 2D maps (192×288) of 7 variables (useful for diagnostics and for computing extremes):

Variable	Description	Units
`tas`	Near-surface air temperature	K
`tasmax`	Daily max near-surface air temperature	K
`tasmin`	Daily min near-surface air temperature	K
`pr`	Precipitation	kg/(m² s)
`huss`	Near-surface specific humidity	kg/kg
`psl`	Sea level pressure	Pa
`sfcWind`	Near-surface wind speed	m/s

See more visualizations

Submission rule: your model must emulate these daily target variables first. Direct prediction of the leaderboard indices is not allowed.

Benchmark target: the leaderboard score is computed on 15 extreme indices derived from daily temperature and precipitation.

Evaluation focused on extremes

Because averages hide risk: extremes determine impacts

Primary score

Models are scored on 15 derived extreme indices using the region-wise normalized Nash–Sutcliffe efficiency (nNSE). Cell-level $R^2$ is transformed via $\mathrm{nNSE}_{ij} = R^2_{ij}/(2-R^2_{ij})$, mapping $R^2$ to $(-1,1]$. Regional scores are area-weighted over AR6 land regions and then averaged uniformly:

$$ \begin{aligned} \mathrm{nNSE}_{kv} &= \frac{ \sum_{(i,j)\in k \cap \mathcal{V}} \cos\phi_i \, \mathrm{nNSE}_{ij} }{ \sum_{(i,j)\in k \cap \mathcal{V}} \cos\phi_i } \end{aligned} $$ $$S = \frac{1}{|V|}\sum_{v \in V}\frac{1}{|K_v|}\sum_{k \in K_v}\mathrm{nNSE}_{kv}$$

$S=1$: perfect agreement, $S=0$: mean predictor, and $S\to -1$: pathological performance. Cells with negligible temporal variability are excluded.

The uncertainty-quantification track is hosted separately on Kaggle and evaluates probabilistic submissions with the analogous CRPS-based regional score.

Why indices?

Indices convert daily fields into impact-relevant summaries: how hot the hottest day gets, how long droughts persist, how much rain falls during the wettest multi-day event, and how the fraction of rainfall from extremes changes.

Why nNSE?

Using nNSE instead of a raw error ensures the score is bounded, physically interpretable, and comparable across indices with different units and scales.

The 15 indices and the questions they answer

Temperature extremes

TXx, TNn: hottest day / coldest night intensity (heatwaves, cold snaps)
SU, TR: frequency of hot days and hot nights (human thermal stress)
FD, ID: frost and ice days (ecosystems and agriculture)
WSDI, CSDI: warm/cold spell duration (persistence of extremes)
GSL: growing season length (shifts in crop calendars)

Precipitation extremes

Rx5day: intensity of multi-day rainfall events (flood risk)
CDD, CWD: dry/wet spell persistence (drought and prolonged wet periods)
R95pTOT: share of rainfall from very wet days (tail-dominated precipitation regimes)
R10mm, SDII: frequency and intensity of heavy rain (infrastructure design)

Submission rules (summary)

Use only the data provided by the organizers for training (no external CMIP6 data; no models pre-trained on CMIP6).
Submit predictions via Kaggle: the main track is hosted at ClimX and the UQ track is hosted at ClimX UQ Track.
Teams are limited to 10 members (team merges allowed up to one month before the deadline).
Organizers may run validity checks and request training/inference code and weights; suspicious entries may be temporarily removed while being reviewed.
Your emulator must output the daily target variables first; the submitted indices must be computed from those outputs rather than predicted directly.
To be eligible for prizes and final ranking, top-ranked participants must open-source code and weights under an MIT or Apache-2.0 license.

Timeline

Proposed schedule for the NeurIPS 2026 ClimX challenge

May 1, 2026 — Dataset and starter kit release

June 15, 2026 — Kaggle launch and call for participation

July 15, 2026 — Registration deadline

October 15, 2026 — Submission deadline and leaderboard freeze

November 1, 2026 — Winner notification and finalist reproducibility checks

December 11–12, 2026 — NeurIPS 2026 workshop, invited talks, and awards

NeurIPS 2027 — Organizers submit a competition summary or demo paper

Tutorials and starter resources

Everything needed to get from data access to a valid submission

1

Starter kit

Clone the public repository for loaders, baselines, metrics, and submission templates for both tracks.

Open the GitHub starter kit

2

Kaggle walkthrough

Use the Kaggle competition page and notebooks for lite-data prototyping, submission formatting, and public leaderboard checks.

Open the main track

3

Full-data training

Use the Hugging Face dataset page for full-resolution training artifacts and access instructions.

Open the dataset page

Recommended onboarding path

Start with the lite dataset on Kaggle to validate loading, training, and submission code end to end.
Reproduce one of the provided baselines before scaling to the full dataset.
Use the public evaluation code to verify index computation and file formatting before your first submission.
Move to the Hugging Face full dataset only after the lite pipeline is stable.

FAQ

Common participant questions, with the operational details collected in one place

Quick links

Main track · UQ track · Full dataset · Starter kit · Visualizations · Model comparison · Climpact indices

Q: What exactly is the prediction task?

A: You build an emulator for daily NorESM2-MM surface fields at about 1-degree resolution. The model predicts 7 daily variables: tas, tasmax, tasmin, pr, huss, psl, and sfcWind. Those daily fields are then converted into the 15 climate-extreme indices used for ranking.

Q: What are the official train and test splits?

A: The official training data are historical (1850-2014) plus SSP1-2.6, SSP3-7.0, and SSP5-8.5 (2015-2100). The held-out test scenario is SSP2-4.5. Only the SSP2-4.5 forcings are released publicly; the targets are withheld for evaluation.

Q: Can I train directly on the 15 leaderboard indices?

A: No. Direct index prediction is not allowed. Your model must first output the daily target fields, and the submitted indices must be computed from those fields. This avoids metric-specific shortcuts and preserves spatial and cross-variable consistency.

Q: Can I use external climate or weather datasets or pretrained models?

A: No. External CMIP6 data and models pretrained on other climate or weather data are prohibited for the competition. Use only the data released for ClimX and standard open-source ML tooling.

Q: Which track should I enter?

A: Use the main track if your goal is to compete on the primary challenge ranking and win ClimX overall, since the official winning teams are determined from the main deterministic leaderboard. Use the UQ track if your model outputs predictive distributions and you specifically want to be ranked with the CRPS-based uncertainty metric.

Q: What is the recommended development workflow?

A: Start with the lite Kaggle dataset to debug your pipeline end to end, reproduce one of the provided baselines from the starter kit, validate your submission formatting with the public evaluation code, and only then move to the full-resolution dataset.

Q: How are submissions ranked?

A: The main track is ranked by region-wise normalized Nash-Sutcliffe efficiency (nNSE), averaged across 15 extreme indices and AR6 land regions. The UQ track follows the same aggregation structure but replaces MSE with CRPS. See the Climpact index definitions for the underlying climate metrics.

Q: Which indices are included?

A: ClimX uses 15 indices spanning temperature and precipitation extremes: FD, SU, ID, TR, GSL, TXx, TNn, WSDI, CSDI, Rx5day, CDD, CWD, R95p, SDII, and R10mm. The intent is to reward models that capture impact-relevant extremes rather than only mean fields.

Q: What baseline models and references are available?

A: The repository includes Climatology, LPS, FFNN, and GNN baselines. You can browse generated reports from the Climatology, LPS, FFNN, and GNN pages, or compare them on the model comparison page.

Q: What will organizers request from top teams?

A: Finalists may be asked for inference code, training code, weights, environment details, and a short model card. The reproducibility package must run on at most one modern accelerator or equivalent CPU resources, use no more than 80GB of accelerator memory and 256GB of system RAM, and complete organizer-side inference within 24 hours.

Q: How do you prevent cheating or leaderboard overfitting?

A: The competition uses a held-out SSP2-4.5 scenario, restricted data usage, organizer-side review of suspicious entries, and re-evaluation of finalist submissions. Teams that cannot pass reproducibility checks are not eligible for prizes or final ranking.

Q: Are there prizes or workshop presentations?

A: Yes. ESA Phi-lab sponsors the challenge with awards for the top three main-track teams (€1,000, €500, and €300), partial travel support of up to €500 per winning team, certificates, and speaking slots at the NeurIPS 2026 workshop.

Q: Where should I ask questions or report issues?

A: Use GitHub issues for starter-kit bugs and documentation clarifications, the Kaggle discussion forum for competition questions, and email for private matters such as reproducibility or eligibility questions.

Contact

Organizer support channels for participants and reviewers

Email the organizers

oscar.pellicer@uv.es

Use email for private questions, reproducibility issues, or anything that should not be posted publicly.

Public discussion

GitHub issues

Use issues for starter-kit bugs, documentation fixes, and technical clarifications that can help other teams.

Competition updates

Kaggle discussion forum

Timeline changes, FAQ updates, and competition announcements will be mirrored on the website and Kaggle pages.

Baselines and results

Reference implementations and their performance reports

Get started

From zero to a valid submission

1

Download data

Use ClimX-lite on Kaggle for quick iteration or download the full dataset on Hugging Face for full-resolution training.

2

Train an emulator

Start from the provided baselines and improve speed, accuracy, and extreme fidelity.

3

Submit on Kaggle

Compute the required indices from your emulator outputs, then submit to the deterministic or UQ Kaggle competition.

ESA Phi-lab sponsors the challenge prizes and travel support for the winning teams.

Open main track Open UQ track Open dataset page