Data Assimilation Project: Uncertainty-aware inference for complex dynamical systems

This project develops scalable methods for uncertainty quantification (UQ) and data assimilation (DA) in large dynamical systems. The emphasis is on combining ensemble-based and variational approaches with physics-informed statistical models (Gaussian processes) to enable robust state/parameter inference and uncertainty-aware prediction.

Thrust: UQ & data assimilation Methods: EnKF/EnKS, 4D-Var, inverse problems, Gaussian processes Applications: Atmospheric chemistry, wind and climate variability

Scientific question

How do we infer hidden states and parameters from sparse noisy data?

Scientific DA problems combine partial observations, uncertain models, and large state spaces, so uncertainty has to be represented in a computationally realistic way.

Core idea

Combine numerical structure with statistical models

This work spans ensemble and variational DA, inverse problems, and physics-informed Gaussian processes, with each tool addressing a different computational bottleneck.

What visitors should learn

Different inference tasks need different approximations

The page is organized around four settings: chemistry DA, scalable 4D-Var, multi-output GP inference, and probabilistic climate forecasting.

Motivation

Many inference problems in science can be framed as estimating an evolving state $x(t)$ from partial and noisy observations $y(t)$, given an imperfect dynamical model. In discrete time, a common abstraction is

\[ x_{k+1} = \mathcal{M}_k(x_k) + \eta_k, \qquad y_k = \mathcal{H}_k(x_k) + \epsilon_k, \]

where $\mathcal{M}_k$ is the model propagator, $\mathcal{H}_k$ maps a model state to observation space, and $(\eta_k,\epsilon_k)$ represent model and observation errors. DA methods compute estimates of $x_k$ (and often uncertain parameters) while tracking uncertainty in a way that is accurate enough for decision support, but also computationally feasible at scale.

Ensemble and variational data assimilation: complementary tools

Two widely used paradigms are ensemble-based and variational data assimilation. They have different computational tradeoffs and failure modes, and many practical workflows mix ideas from both.

Variational DA (4D-Var)

4D-Var estimates an initial condition (and possibly parameters) by solving an optimization problem over an assimilation window:

\[ J(x_0) = \tfrac{1}{2}\lVert x_0 - x_b \rVert_{\mathbf{B}^{-1}}^2 \;+\; \tfrac{1}{2}\sum_{k=0}^{N} \lVert y_k - \mathcal{H}_k(\mathcal{M}_{0\to k}(x_0)) \rVert_{\mathbf{R}_k^{-1}}^2, \]

where $x_b$ is a background (prior) state, and $\mathbf{B},\mathbf{R}_k$ are background and observation error covariances. Gradients (and often Hessian-vector products) are computed with tangent-linear/adjoint models.

Ensemble Kalman methods (EnKF/EnKS)

Ensemble Kalman methods approximate uncertainty using an ensemble and apply Kalman-type updates at observation times. In a linearized form,

\[ x_k^{a} = x_k^{f} + \mathbf{K}_k \bigl(y_k - \mathbf{H}_k x_k^{f}\bigr), \qquad \mathbf{K}_k = \mathbf{P}_k^{f}\mathbf{H}_k^{T}\bigl(\mathbf{H}_k\mathbf{P}_k^{f}\mathbf{H}_k^{T} + \mathbf{R}_k \bigr)^{-1}, \]

with the forecast covariance $\mathbf{P}_k^{f}$ estimated from the ensemble.

Advantages and limitations

Ensemble methods: naturally parallel, no adjoint required, and provide sample-based uncertainty; but they suffer from sampling error in high dimension (spurious long-range correlations), often requiring localization and inflation, and they are sensitive to model error specification.

Variational methods: scale well with state dimension (no explicit covariance matrices) and can enforce model and constraint structure; but they require tangent-linear/adjoint capabilities and robust optimization, and they can be sensitive to nonconvexity and poor preconditioning.

What this project contributes

Flow-dependent prior/error models for ensemble-based chemistry DA, including autoregressive (AR) background covariance construction and analysis of ensemble size/localization effects.
Scalable variational DA strategies (low-memory smoothing and space-time domain decomposition) for large inverse problems.
Physics-informed covariance modeling for multi-output Gaussian processes and hidden-process inference.
Probabilistic GP forecasting workflows with explicit uncertainty calibration for time-series prediction problems.

Topic 1: Chemical data assimilation with EnKF and flow-dependent priors

Atmospheric chemistry DA is challenging because (i) the state is large and coupled across species, (ii) observations are sparse/heterogeneous, and (iii) background error structure is strongly flow-dependent. In ensemble Kalman workflows, the background covariance plays a central role because it controls how observation increments propagate through the model state.

One approach developed in this work models the background covariance through an autoregressive (AR) structure that captures distance decay and flow dependence. A representative construction is

\[ \mathbf{B} = \mathbf{A}^{-1}\mathbf{S}^2\mathbf{A}^{-T}, \qquad \delta c^{\mathrm{B}}(e) = \mathbf{A}^{-1}\mathbf{S}\,\xi(e), \qquad \xi(e)\sim \mathcal{N}(0,\mathbf{I}), \]

where $\mathbf{S}$ encodes spatially varying standard deviations and $\mathbf{A}$ encodes correlations (parameterized through an AR process). This provides a computationally practical way to inject physically meaningful structure into ensemble initialization and into the implied background statistics.

In addition, the chemistry DA studies in this project analyze sensitivity to practical algorithmic choices (ensemble size, localization/inflation strength, and error sources such as emissions and boundary conditions) and explore the use of model singular vectors as physically motivated perturbation directions for ensemble generation.

Chemistry DA: domain, observations, and forecast impact

Examples from an idealized chemistry transport model study: the domain and observation network, and a representative forecast improvement signal after assimilation.

Chemistry assimilation domain example (SE Asia) — Representative domain used in chemistry DA experiments.

Observation network schematic for chemistry data assimilation — Illustrative observation footprint/network.

O3 forecast difference with assimilation compared to non-assimilated baseline — Assimilation can reduce forecast error for observed quantities (example signal).

Field example from chemistry data assimilation study — Field snapshot illustrating spatial structure relevant to DA.

Advantages and limitations

Advantages: AR-style priors provide a compact, flow-aware covariance structure that improves ensemble realism and can reduce spurious long-range correlations.

Limitations: covariance modeling is still approximate (especially under strong nonlinearity), and practical EnKF deployments still require careful tuning of localization, inflation, and model error representation.

Selected references

Emil M Constantinescu, Adrian Sandu, Tianfeng Chai, and Gregory R. Carmichael. Assessment of ensemble-based chemical data assimilation in an idealized setting. Atmospheric Environment, Vol. 41(1); Pages 18-36, 2007. [DOI]
Emil M Constantinescu, Tianfeng Chai, Adrian Sandu, and Gregory R. Carmichael. Autoregressive models of background errors for chemical data assimilation. Journal of Geophysical Research, Vol. 112; Pages D12309, 2007. [DOI]
Adrian Sandu, Emil M Constantinescu, Gregory R. Carmichael, Tianfeng Chai, Dacian Daescu and John Seinfeld. Ensemble methods for dynamic data assimilation of chemical observations in atmospheric models. Journal of Algorithms & Computational Technology, Vol. 5(4); Pages 667-692, 2011.

Topic 2: Variational DA and scalable PDE-constrained inverse problems

In variational DA, the dominant cost often comes from repeatedly solving large linear/nonlinear systems inside an optimization method. This motivates algorithms that reduce memory footprint and expose parallel structure in space and time.

Two representative directions in this project are:

Low-memory best-state estimation for hidden Markov models with model error, which targets large-scale inference where naive smoothing formulations become memory-prohibitive.
Space-time domain decomposition for 4D-Var, which decomposes the inference problem into coupled local subproblems, enabling scalable solvers for large regularized inverse problems.

Scalable variational DA: local solvers and domain decomposition

These two images summarize the algorithmic theme in this part of the project: break a large inverse problem into computational units that are easier to solve and coordinate.

Nested local solver view for scalable variational data assimilation — Nested local solvers illustrate how low-memory variational estimation can be organized around smaller optimization steps rather than one monolithic global solve.

Space-time domain decomposition for 4D-Var — Space-time domain decomposition partitions the assimilation window into coupled local subdomains, exposing parallel structure while preserving consistency across interfaces.

Advantages and limitations

Advantages: variational formulations provide a principled way to incorporate dynamical constraints and correlated error models, while domain decomposition can expose parallelism and reduce memory pressure.

Limitations: performance hinges on high-quality linear/nonlinear solvers and preconditioners; strong nonlinearity can lead to nonconvex optimization, and adjoint/tangent-linear infrastructure remains a key requirement.

Selected references

Luisa D'amore, Emil M Constantinescu, and Luisa Carracciuolo. A scalable space-time domain decomposition approach for solving large scale non linear regularized inverse ill posed problems in 4D variational data assimilation,. Springer Journal of Scientific Computing, 2022. [DOI]
Mihai Anitescu, Xiaoyan Zeng, and Emil M Constantinescu. A low-memory approach for best-state estimation of hidden Markov models with model error. SIAM Journal on Numerical Analysis (SINUM), Vol. 52(1); Pages 468-495, 2014. [DOI]
Lin Zhang, Emil M Constantinescu, Adrian Sandu, Youhua Tang, Tianfeng Chai, Gregory R. Carmichael, Daewon Byun, and Eduardo Olaguer. An adjoint sensitivity analysis and 4D-Var data assimilation study of Texas air quality,. Atmospheric Environment (special ACM issue), Vol. 42(23); Pages 5787 - 5804, 2008. [DOI]

Topic 3: Physics-informed Gaussian processes for multi-output inference

Gaussian processes (GPs) provide a flexible, nonparametric route to inference with uncertainty quantification. A key practical challenge is specifying covariance structure that is both expressive and physically meaningful, especially for multiple outputs (co-kriging) and for settings where some processes are hidden/unobserved.

A physics-informed approach is to define outputs through linear operators acting on a latent GP $f$. If $y_i(x) = \mathcal{L}_i f(x)$ and $f \sim \mathcal{GP}(0,k)$, then the induced cross-covariance is

\[ \mathrm{Cov}\bigl(y_i(x),y_j(x')\bigr) = (\mathcal{L}_i)_x\,(\mathcal{L}_j)_{x'}\,k(x,x'). \]

This construction preserves positive definiteness by design, while enabling covariance models that reflect physical constraints (for example, relationships between pressure and wind derived from geostrophic assumptions).

In space-time settings, GP models can also fuse deterministic numerical weather prediction (NWP) output with historical measurements to produce probabilistic forecasts and realistic scenario ensembles (with calibrated correlation structure).

Physics-informed covariance structure for multi-output GPs

The figures below move from model ingredient to model behavior: first a physics-induced cross-covariance surface, then a coupled multi-output regression example, and finally a wider space-time correlation comparison for wind scenario generation.

Physics-informed cross-covariance surface used in a multi-output GP — An off-diagonal cross-covariance surface $K_{12}$ induced by the governing relation between two outputs. This is often more informative than the base kernel itself because it shows how structure is transferred across variables in the coupled GP.

Multi-output GP regression example with uncertainty bands — Multi-output GP inference with uncertainty bands. The top and bottom panels show two coupled outputs; the dashed curve is the truth, circles mark noisy observations, the dark curve with gray band is an independent fit, and the blue fit with colored band is the physics-coupled regression that transfers information across outputs.

Covariance structure comparison for wind scenario modeling — Space-time covariance comparison for wind scenario modeling. The upper-left triangular block shows empirical correlations from measurements; the fitted blocks compare the full space-time model against simplified alternatives. Diagonal blocks represent temporal correlations at each station, while off-diagonal blocks represent temporal cross-correlations between stations. The full model best preserves cross-station dependence needed for realistic scenarios.

Advantages and limitations

Advantages: operator-based covariance construction is interpretable and embeds physical structure directly into the statistical model.

Limitations: when relationships among outputs are strongly nonlinear, additional closure assumptions may be needed; and scalable GP inference may require approximations for large datasets.

Selected references

Emil M Constantinescu and Mihai Anitescu. Physics-based covariance models for Gaussian processes with multiple outputs. International Journal for Uncertainty Quantification, Vol. 3(1); Pages 47-71, 2013. [DOI]
Julie Bessac, Emil M Constantinescu, and Mihai Anitescu. Stochastic simulation of predictive space-time scenarios of wind speed using observations and physical models. Annals of Applied Statistics, Vol. 12(1); Pages 432-458, 2018. [DOI] [arXiv]

Topic 4: Data-driven prediction with Gaussian processes (MJO forecasting)

Recent work also explores GP models as a data-driven route to probabilistic forecasting of climate variability. For the Madden–Julian Oscillation (MJO), GP models can be calibrated using empirical correlations and then corrected a posteriori to better match forecast uncertainty. This yields both point forecasts and confidence intervals and enables diagnostic analysis of when uncertainty estimates are overconfident or miscalibrated.

In numerical experiments, this strategy improves deterministic prediction skill at short lead times and (through posterior covariance correction) substantially extends probabilistic coverage.

Gaussian process forecasting workflow (MJO)

These panels summarize three parts of the workflow: how the GP is calibrated and variance-corrected, how the RMM dataset is split into train/validation/test periods, and how deterministic and probabilistic forecast quality evolve with lead time.

Gaussian process workflow schematic for MJO prediction — Workflow for GP-based MJO prediction. The training set is used to estimate empirical means and covariances, the validation set is used to correct posterior variance as lead time grows, and the test set is used for out-of-sample probabilistic forecasting.

RMM1 and RMM2 time series dataset splits used for MJO forecasting — RMM1 and RMM2 time series used in the study, with the long historical record partitioned into training, validation, and test windows. This split separates covariance calibration from uncertainty correction and final evaluation. Click the plot to enlarge.

Forecast skill metrics for MJO Gaussian process model — Forecast skill as a function of lead time. On the left, correlation and RMSE are shown for lag choices of 40 and 60 days against common skill thresholds; on the right, phase and amplitude errors diagnose how the GP forecast drifts and underestimates oscillation strength as lead time increases. Click the plot to enlarge.

Advantages and limitations

Advantages: GP forecasts are probabilistic by construction and provide uncertainty estimates that can be analyzed and corrected, rather than being purely heuristic.

Limitations: model calibration is sensitive to covariance structure choices and stationarity assumptions; scaling to very large datasets may require sparse/approximate GP techniques.

Selected references

Haoyuan Chen, Emil M Constantinescu, Vishwas Rao, and Cristiana Stan. Improving the predictability of the Madden-Julian oscillation at subseasonal scales with Gaussian process models. JAMES - Machine learning application to Earth system modeling, Vol. 17(5); Pages e2023MS004188, 2025. [DOI] [arXiv]
Haoyuan Chen, Emil M Constantinescu, Vishwas Rao, and Cristiana Stan. Uncertainty quantification of the Madden-Julian oscillation with Gaussian processes. 2023 Tackling Climate Change with Machine Learning Workshop, part of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023.

Software

Several of the methods above are implemented and tested in production-grade scientific computing libraries.

DAPack

Data assimilation toolkit

Research code for UQ/DA workflows and inverse problems.

Repository

PETSc TAO / TSAdjoint

Scalable optimization + adjoints

Nonlinear optimization and adjoint sensitivity analysis used in variational DA and PDE-constrained inversion.

Project page

Funding

U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) (including SciDAC programs on recent work).
Argonne National Laboratory Directed Research and Development (LDRD) (on recent work).
U.S. National Science Foundation, NOAA, and the Houston Advanced Research Center (on earlier chemistry DA work).

Data Assimilation Project: Uncertainty-aware inference for complex dynamical systems

How do we infer hidden states and parameters from sparse noisy data?

Combine numerical structure with statistical models

Different inference tasks need different approximations

Motivation

Ensemble and variational data assimilation: complementary tools

Variational DA (4D-Var)

Ensemble Kalman methods (EnKF/EnKS)

Advantages and limitations

What this project contributes

Topic 1: Chemical data assimilation with EnKF and flow-dependent priors

Chemistry DA: domain, observations, and forecast impact

Advantages and limitations

Selected references

Topic 2: Variational DA and scalable PDE-constrained inverse problems

Scalable variational DA: local solvers and domain decomposition

Advantages and limitations

Selected references

Topic 3: Physics-informed Gaussian processes for multi-output inference

Physics-informed covariance structure for multi-output GPs

Advantages and limitations

Selected references

Topic 4: Data-driven prediction with Gaussian processes (MJO forecasting)

Gaussian process forecasting workflow (MJO)

Advantages and limitations

Selected references

Software

DAPack

PETSc TAO / TSAdjoint

Related pages

Funding