Deep Learning Might Have Been Lying to Us About Resolution

Dec 11, 2025 — Tom Hippensteel

TL;DR:

The Observation: Physics models often show flat error rates even when tested at higher resolutions.
The Suspicion: A new paper argues this isn't "generalization," but rather a "frequency ceiling" (Scale Anchoring) from low-res training.
The Experiment: By aligning frequency embeddings across resolutions, the authors saw errors finally drop as resolution increased—behaving more like true numerical solvers.

Train a model on low-resolution data. Run inference at high resolution. The error stays basically the same.

For years, I’ve seen researchers point to this as evidence of successful generalization. The narrative is that the model learned the underlying physics, not just the grid, so it works at any resolution.

But after reading a new paper from MIT and South China University of Technology, I’m starting to question that assumption.

They call this phenomenon Scale Anchoring. And if their math holds up, a lot of "generalization" results might need a second look.

The Problem Nobody Named

From what I can gather, the issue comes down to frequency limits.

When you train on low-resolution data, there's a hard ceiling on what frequencies that data can represent. It's the Nyquist frequency (half the sampling rate). Mathematically, anything above that limit simply doesn't exist in your training set.

So when the model runs inference at a higher resolution later, it encounters high-frequency components it has never seen. It’s essentially flying blind.

The result? The error "anchors" to the performance ceiling of the low-res training data. Go higher in resolution, and the errors stay flat. The model isn't necessarily generalizing; it's just not failing any worse.

Every Architecture Hits the Same Wall

The researchers tested this across most major deep learning architectures used in spatiotemporal forecasting—GNNs, Transformers, CNNs, Diffusion models, Neural Operators, and Neural ODEs.

They measured RMSERatio—the error at high resolution divided by the error at low resolution.

For a numerical solver: This ratio should be <1.0. More resolution means less error. That's the whole point of increasing resolution.
For the DL models: The ratio stayed between 1.0 and 1.4.

Architecture	Method	RMSERatio (64× Super-Res)
GNN	Neural SPH	1.022
Transformer	DeepLag	1.021
CNN	PARCv2	1.060
Diffusion	DYffusion	1.041
Neural Operator	SFNO	1.017
Neural ODE	FNODE	1.338

To me, that looks less like generalization and more like a hard ceiling.

Aligning the Frequencies

Their solution is called Frequency Representation Learning (FRL).

It took me a minute to wrap my head around the mechanism, but the core idea seems to be about creating a shared language for frequencies. They normalize frequency encodings to each resolution's Nyquist limit.

Basically, they force the model to ensure that the same physical frequency gets the same numerical representation, regardless of the grid size. This theoretically allows the model to extrapolate spectrally, rather than just memorizing scale-specific spatial patterns.

The Results

This is the part that really caught my attention. When they applied this fix, the behavior changed drastically.

Architecture	Baseline RMSERatio	With FRL	Improvement
GNN	1.018	0.175	5.7×
Transformer	1.021	0.188	5.4×
CNN	1.060	0.137	7.7×
Diffusion	1.041	0.221	4.7×
Neural Operator	1.017	0.135	7.6×

The RMSERatio dropped well below 1.0 across the board. Errors finally decreased as resolution increased. It’s the first time I’ve seen some of these architectures behave so much like traditional numerical solvers.

The computational overhead seems manageable, too—training time increased 10-40%, but inference overhead was negligible.

What It Doesn't Solve

I appreciate that the authors are upfront about limitations.

It turns out FRL breaks down at extremely high Reynolds numbers (Re=105) where turbulence becomes strongly nonlinear. The smooth frequency relationships the method relies on seem to stop working at those chaotic scales, and Scale Anchoring creeps back in.

They suggest future work could incorporate Kolmogorov-type spectral constraints, but for now, it seems best suited for moderate Reynolds numbers and weather forecasting tasks.

My Takeaway

If Scale Anchoring is as widespread as this paper suggests, it changes how I look at "generalization" graphs.

A model maintaining similar error across resolutions might not be learning the physics as well as we thought. It might just be hitting a frequency ceiling we weren't looking for.

The fix proposed here is architecture-agnostic and seems to work well for standard use cases. But at the very least, it’s a reminder that a flat error curve isn't always a success story.

Sources

Breaking Scale Anchoring - arXiv:2512.05132
https://arxiv.org/abs/2512.05132

Supporting

arXiv:1806.08734 — "On the Spectral Bias of Neural Networks" (Rahaman et al., 2019) — Foundational work showing neural networks learn low frequencies first and struggle with high frequencies. Related but distinct from Scale Anchoring. https://arxiv.org/abs/1806.08734
ICLR 2021 — "Fourier Neural Operator for Parametric Partial Differential Equations" (Li et al., 2021) — The original FNO paper that introduced zero-shot super-resolution claims. This is what the Scale Anchoring paper is pushing back against. https://openreview.net/forum?id=c8P9NQVtmnO
arXiv:2409.13955 — "On the Effectiveness of Neural Operators at Zero-Shot Weather Downscaling" (2024) — Recent paper finding neural operators underperform non-neural methods at zero-shot weather downscaling. Independent evidence aligning with Scale Anchoring findings. https://arxiv.org/abs/2409.13955

Credibility Assessment

Paper: Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training
Authors: Wenshuo Wang, Fan Zhang (Under Anonymous Review for ICLR 2026)
Status: Preprint (arXiv 2512.05132) / Under Review at ICLR 2026

Author Verification
⚠ — Anonymous submission for double-blind review; cannot verify institutional affiliations. This is standard practice for ICLR submissions.

Institution Check
⚠ — Not disclosed due to anonymous review process (expected for ICLR 2026).

Citation Sampling
✓ — Verified Li et al. 2020 (Fourier Neural Operator, ICLR 2021), ERA5 weather dataset widely used in literature, multiple supporting references exist.

Methodology Specificity
✓ — Highly detailed: Explicit hyperparameters (Tsucc, Tfail, Niter), resolution specifications (32³, 64³, 128³), frequency response measurements, GPU specs (4× NVIDIA A100 80GB), training details (AdamW lr=1×10⁻³, batch size 8, epochs 100). Reproducible pseudocode in appendices.

Limitations Disclosed
✓ — Explicitly discusses failure modes for high-Reynolds turbulence, notes FRL degrades when local spectral relationships aren't smooth, acknowledges need for Kolmogorov scaling integration in future work.

Code/Data Availability
⚠ — Not specified in paper (acceptable for papers under review). Disclosure of LLM usage for writing/editing included per ICLR 2026 policy.

Peer Review Status
Under Review — ICLR 2026 (double-blind submission)

Overall Assessment: PASS

This is high-quality machine learning research with rigorous experimental validation. The paper demonstrates strong credibility indicators despite anonymous authorship: (1) highly specific methodology with reproducible parameters across multiple architectures (GNN, Transformer, CNN, Diffusion, Neural Operators), (2) honest disclosure of limitations and failure modes, (3) legitimate citations to established work (FNO, ERA5 dataset), (4) detailed frequency response analysis with concrete metrics (Anchoring Ratios 1.40-63.1).

The anonymous submission is standard for ICLR's double-blind review process and does not indicate credibility concerns. Technical depth (pseudo-spectral methods, Nyquist frequency analysis, normalized frequency representations) and comprehensive experimental scope (fluid simulation + weather forecasting across 4 resolutions) suggest genuine research work. Variable results across models (45-100% DSR) indicate real experimental data rather than fabricated uniformity.

The single caution flag on code availability is expected for papers under review—authors typically release code upon acceptance.

This assessment evaluates credibility indicators, not absolute authenticity. Evaluation assisted by Claude Sonnet 4.5. Reader discretion advised.