Data Assimilation Project: Uncertainty-aware inference for complex dynamical systems

This project develops scalable methods for uncertainty quantification (UQ) and data assimilation (DA) in large dynamical systems. The emphasis is on combining ensemble-based and variational approaches with physics-informed statistical models (Gaussian processes) to enable robust state/parameter inference and uncertainty-aware prediction.

Thrust: UQ & data assimilation Methods: EnKF/EnKS, 4D-Var, inverse problems, Gaussian processes Applications: Atmospheric chemistry, wind and climate variability

Motivation

Many inference problems in science can be framed as estimating an evolving state $x(t)$ from partial and noisy observations $y(t)$, given an imperfect dynamical model. In discrete time, a common abstraction is

\[ x_{k+1} = \mathcal{M}_k(x_k) + \eta_k, \qquad y_k = \mathcal{H}_k(x_k) + \epsilon_k, \]

where $\mathcal{M}_k$ is the model propagator, $\mathcal{H}_k$ maps a model state to observation space, and $(\eta_k,\epsilon_k)$ represent model and observation errors. DA methods compute estimates of $x_k$ (and often uncertain parameters) while tracking uncertainty in a way that is accurate enough for decision support, but also computationally feasible at scale.

Ensemble and variational data assimilation: complementary tools

Two widely used paradigms are ensemble-based and variational data assimilation. They have different computational tradeoffs and failure modes, and many practical workflows mix ideas from both.

Variational DA (4D-Var)

4D-Var estimates an initial condition (and possibly parameters) by solving an optimization problem over an assimilation window:

\[ J(x_0) = \tfrac{1}{2}\lVert x_0 - x_b \rVert_{\mathbf{B}^{-1}}^2 \;+\; \tfrac{1}{2}\sum_{k=0}^{N} \lVert y_k - \mathcal{H}_k(\mathcal{M}_{0\to k}(x_0)) \rVert_{\mathbf{R}_k^{-1}}^2, \]

where $x_b$ is a background (prior) state, and $\mathbf{B},\mathbf{R}_k$ are background and observation error covariances. Gradients (and often Hessian-vector products) are computed with tangent-linear/adjoint models.

Ensemble Kalman methods (EnKF/EnKS)

Ensemble Kalman methods approximate uncertainty using an ensemble and apply Kalman-type updates at observation times. In a linearized form,

\[ x_k^{a} = x_k^{f} + \mathbf{K}_k \bigl(y_k - \mathbf{H}_k x_k^{f}\bigr), \qquad \mathbf{K}_k = \mathbf{P}_k^{f}\mathbf{H}_k^{T}\bigl(\mathbf{H}_k\mathbf{P}_k^{f}\mathbf{H}_k^{T} + \mathbf{R}_k \bigr)^{-1}, \]

with the forecast covariance $\mathbf{P}_k^{f}$ estimated from the ensemble.

Advantages and limitations

Ensemble methods: naturally parallel, no adjoint required, and provide sample-based uncertainty; but they suffer from sampling error in high dimension (spurious long-range correlations), often requiring localization and inflation, and they are sensitive to model error specification.

Variational methods: scale well with state dimension (no explicit covariance matrices) and can enforce model and constraint structure; but they require tangent-linear/adjoint capabilities and robust optimization, and they can be sensitive to nonconvexity and poor preconditioning.

What this project contributes

  1. Flow-dependent prior/error models for ensemble-based chemistry DA, including autoregressive (AR) background covariance construction and analysis of ensemble size/localization effects.
  2. Scalable variational DA strategies (low-memory smoothing and space-time domain decomposition) for large inverse problems.
  3. Physics-informed covariance modeling for multi-output Gaussian processes and hidden-process inference.
  4. Probabilistic GP forecasting workflows with explicit uncertainty calibration for time-series prediction problems.

Topic 1: Chemical data assimilation with EnKF and flow-dependent priors

Atmospheric chemistry DA is challenging because (i) the state is large and coupled across species, (ii) observations are sparse/heterogeneous, and (iii) background error structure is strongly flow-dependent. In ensemble Kalman workflows, the background covariance plays a central role because it controls how observation increments propagate through the model state.

One approach developed in this work models the background covariance through an autoregressive (AR) structure that captures distance decay and flow dependence. A representative construction is

\[ \mathbf{B} = \mathbf{A}^{-1}\mathbf{S}^2\mathbf{A}^{-T}, \qquad \delta c^{\mathrm{B}}(e) = \mathbf{A}^{-1}\mathbf{S}\,\xi(e), \qquad \xi(e)\sim \mathcal{N}(0,\mathbf{I}), \]

where $\mathbf{S}$ encodes spatially varying standard deviations and $\mathbf{A}$ encodes correlations (parameterized through an AR process). This provides a computationally practical way to inject physically meaningful structure into ensemble initialization and into the implied background statistics.

In addition, the chemistry DA studies in this project analyze sensitivity to practical algorithmic choices (ensemble size, localization/inflation strength, and error sources such as emissions and boundary conditions) and explore the use of model singular vectors as physically motivated perturbation directions for ensemble generation.

Chemistry DA: domain, observations, and forecast impact

Examples from an idealized chemistry transport model study: the domain and observation network, and a representative forecast improvement signal after assimilation.

Chemistry assimilation domain example (SE Asia)
Representative domain used in chemistry DA experiments.
Observation network schematic for chemistry data assimilation
Illustrative observation footprint/network.
O3 forecast difference with assimilation compared to non-assimilated baseline
Assimilation can reduce forecast error for observed quantities (example signal).
Field example from chemistry data assimilation study
Field snapshot illustrating spatial structure relevant to DA.

Advantages and limitations

Advantages: AR-style priors provide a compact, flow-aware covariance structure that improves ensemble realism and can reduce spurious long-range correlations.

Limitations: covariance modeling is still approximate (especially under strong nonlinearity), and practical EnKF deployments still require careful tuning of localization, inflation, and model error representation.

Selected references

Topic 2: Variational DA and scalable PDE-constrained inverse problems

In variational DA, the dominant cost often comes from repeatedly solving large linear/nonlinear systems inside an optimization method. This motivates algorithms that reduce memory footprint and expose parallel structure in space and time.

Two representative directions in this project are:

  1. Low-memory best-state estimation for hidden Markov models with model error, which targets large-scale inference where naive smoothing formulations become memory-prohibitive.
  2. Space-time domain decomposition for 4D-Var, which decomposes the inference problem into coupled local subproblems, enabling scalable solvers for large regularized inverse problems.

Advantages and limitations

Advantages: variational formulations provide a principled way to incorporate dynamical constraints and correlated error models, while domain decomposition can expose parallelism and reduce memory pressure.

Limitations: performance hinges on high-quality linear/nonlinear solvers and preconditioners; strong nonlinearity can lead to nonconvex optimization, and adjoint/tangent-linear infrastructure remains a key requirement.

Selected references

Topic 3: Physics-informed Gaussian processes for multi-output inference

Gaussian processes (GPs) provide a flexible, nonparametric route to inference with uncertainty quantification. A key practical challenge is specifying covariance structure that is both expressive and physically meaningful, especially for multiple outputs (co-kriging) and for settings where some processes are hidden/unobserved.

A physics-informed approach is to define outputs through linear operators acting on a latent GP $f$. If $y_i(x) = \mathcal{L}_i f(x)$ and $f \sim \mathcal{GP}(0,k)$, then the induced cross-covariance is

\[ \mathrm{Cov}\bigl(y_i(x),y_j(x')\bigr) = (\mathcal{L}_i)_x\,(\mathcal{L}_j)_{x'}\,k(x,x'). \]

This construction preserves positive definiteness by design, while enabling covariance models that reflect physical constraints (for example, relationships between pressure and wind derived from geostrophic assumptions).

In space-time settings, GP models can also fuse deterministic numerical weather prediction (NWP) output with historical measurements to produce probabilistic forecasts and realistic scenario ensembles (with calibrated correlation structure).

Physics-informed covariance structure for multi-output GPs

The figures below move from model ingredient to model behavior: first a physics-induced cross-covariance surface, then a coupled multi-output regression example, and finally a wider space-time correlation comparison for wind scenario generation.

Physics-informed cross-covariance surface used in a multi-output GP
An off-diagonal cross-covariance surface $K_{12}$ induced by the governing relation between two outputs. This is often more informative than the base kernel itself because it shows how structure is transferred across variables in the coupled GP.
Multi-output GP regression example with uncertainty bands
Multi-output GP inference with uncertainty bands. The top and bottom panels show two coupled outputs; the dashed curve is the truth, circles mark noisy observations, the dark curve with gray band is an independent fit, and the blue fit with colored band is the physics-coupled regression that transfers information across outputs.
Covariance structure comparison for wind scenario modeling
Space-time covariance comparison for wind scenario modeling. The upper-left triangular block shows empirical correlations from measurements; the fitted blocks compare the full space-time model against simplified alternatives. Diagonal blocks represent temporal correlations at each station, while off-diagonal blocks represent temporal cross-correlations between stations. The full model best preserves cross-station dependence needed for realistic scenarios.

Advantages and limitations

Advantages: operator-based covariance construction is interpretable and embeds physical structure directly into the statistical model.

Limitations: when relationships among outputs are strongly nonlinear, additional closure assumptions may be needed; and scalable GP inference may require approximations for large datasets.

Selected references

Topic 4: Data-driven prediction with Gaussian processes (MJO forecasting)

Recent work also explores GP models as a data-driven route to probabilistic forecasting of climate variability. For the Madden–Julian Oscillation (MJO), GP models can be calibrated using empirical correlations and then corrected a posteriori to better match forecast uncertainty. This yields both point forecasts and confidence intervals and enables diagnostic analysis of when uncertainty estimates are overconfident or miscalibrated.

In numerical experiments, this strategy improves deterministic prediction skill at short lead times and (through posterior covariance correction) substantially extends probabilistic coverage.

Gaussian process forecasting workflow (MJO)

These panels summarize three parts of the workflow: how the GP is calibrated and variance-corrected, how the RMM dataset is split into train/validation/test periods, and how deterministic and probabilistic forecast quality evolve with lead time.

Gaussian process workflow schematic for MJO prediction
Workflow for GP-based MJO prediction. The training set is used to estimate empirical means and covariances, the validation set is used to correct posterior variance as lead time grows, and the test set is used for out-of-sample probabilistic forecasting.
RMM1 and RMM2 time series dataset splits used for MJO forecasting
RMM1 and RMM2 time series used in the study, with the long historical record partitioned into training, validation, and test windows. This split separates covariance calibration from uncertainty correction and final evaluation. Click the plot to enlarge.
Forecast skill metrics for MJO Gaussian process model
Forecast skill as a function of lead time. On the left, correlation and RMSE are shown for lag choices of 40 and 60 days against common skill thresholds; on the right, phase and amplitude errors diagnose how the GP forecast drifts and underestimates oscillation strength as lead time increases. Click the plot to enlarge.

Advantages and limitations

Advantages: GP forecasts are probabilistic by construction and provide uncertainty estimates that can be analyzed and corrected, rather than being purely heuristic.

Limitations: model calibration is sensitive to covariance structure choices and stationarity assumptions; scaling to very large datasets may require sparse/approximate GP techniques.

Selected references

Software

Several of the methods above are implemented and tested in production-grade scientific computing libraries.

DAPack

Data assimilation toolkit

Research code for UQ/DA workflows and inverse problems.

PETSc TAO / TSAdjoint

Scalable optimization + adjoints

Nonlinear optimization and adjoint sensitivity analysis used in variational DA and PDE-constrained inversion.

Funding