Volume 73, Issue 6 e70044
ORIGINAL ARTICLE
Open Access

Dimensionality Reduction in Full-Waveform Inversion Uncertainty Analysis

W. A. Mulder

Corresponding Author

W. A. Mulder

Shell Global Solutions International B.V., Den Haag, The Netherlands

Department of Geoscience & Engineering, Faculty of Civil Engineering and Geosciences, Delft University of Technology, Delft, The Netherlands

Search for more papers by this author
B. N. Kuvshinov

B. N. Kuvshinov

Shell Global Solutions International B.V., Den Haag, The Netherlands

Search for more papers by this author
First published: 09 July 2025
Funding: The authors received no specific funding for this work.

ABSTRACT

The uncertainty of model parameters obtained by full-waveform inversion can be determined from the Hessian of the least-squares error functional. A description of uncertainty characterisation is presented that takes the null space of the Hessian into account and does not rely on the Bayesian formulation. Because the Hessian is generally too costly to compute and too large to be stored, a segmented representation of perturbations of the reconstructed subsurface model in the form of geological units is proposed. This enables the computation of the Hessian and the related covariance matrix on a larger length scale. Synthetic two-dimensional isotropic elastic examples illustrate how conditional and marginal uncertainties can be estimated for the properties per geological unit by themselves and in relation to other units.

1 Introduction

Subsurface model reconstruction from seismic data with full-waveform inversion (FWI) has become a routine approach. To characterise the uncertainty of the model parameters, the classic approach involves a series expansion around the, hopefully, global minimum of the least-squares data misfit functional or one of its many variants (Backus and Gilbert 1970; Tarantola 2005). The locally quadratic cost function around the minimum is described by the Hessian, assuming the cost or loss function is sufficiently smooth. The uncertainty can be quantified as the region in model parameter space where this function stays below a certain threshold value. The latter is determined by the inversion accuracy and noise level of the data. The region is bounded by the ellipsoid where the paraboloid of the quadratic approximation cuts the threshold value. The principal axes of the ellipsoid are the eigenvectors of the Hessian, and the reciprocals of the lengths of the semi-axes are the square roots of the Hessian's eigenvalues.

A typical three-dimensional model for finite-difference modelling requires Gigabytes of storage and the associated Hessian Exabytes. Its direct computation is out of reach, except for smaller problems (Pratt et al. 1998) or small subsets of points (Hak and Mulder 2010; Mulder and Kuvshinov 2025; Plessix and Mulder 2004, for instance). There are many methods that find some approximation to the Hessian, for instance, by using the Lanczos algorithm (Minkoff 1996; Vasco et al. 2003), low-rank approximations (Bui-Thanh et al. 2012; Eckart and Young 1936; Liu and Peter 2019a; 2019b; Riffaud et al. 2024; H. Zhu et al. 2016) and Kalman filtering (Eikrem et al. 2019; Hoffmann et al. 2024; Huang et al. 2020; Thurin et al. 2019). Crude estimates can be obtained from methods that estimate ‘true-amplitude’ weights (Chen and Xie 2015; Rickett 2003; Riyanti et al. 2008, for instance) or checkerboard tests (Inoue et al. 1990; Lévêque et al. 1993). Examples of approaches that estimate the uncertainty without using the Hessian are null-space shuttles (Deal and Nolet 1996; Fichtner and Zunino 2019; Keating and Innanen 2021) and their generalisations (Meju 2009; Vasco 2007), the Markov-chain Monte Carlo method (Barbosa et al. 2020; Ely et al. 2018; Fichtner and van Leeuwen 2015; Guo et al. 2020; Martin et al. 2012; Piana Agostinetti et al. 2015; Ray et al. 2017, among others), the Hamiltonian Monte Carlo method (Betancourt 2018; Duane et al. 1987; Fichtner and Zunino 2019; Revelo Obando 2018; Sen and Biswas 2017; Zhao and Sen 2021), variational inference (Biswas et al. 2023; Izzatullah et al. 2023; Liu and Wang 2016; Wang et al. 2023; Zhang and Curtis 2020), and machine learning (Qu et al. 2024; Rizzuti et al. 2020; Siahkoohi et al. 2023; Sun et al. 2021), to mention just a few. Reviews on the subject can be found in the paper by Rawlinson et al. (2014), geared towards tomography and in the references mentioned above.

The grid size in FWI modelling is always smaller than the characteristic wavelength, typically 4–5 grid cells per wavelength. If a finite-difference scheme on a uniform grid is used, this value is attained in parts of the model where the (shear) velocity is smallest and much larger values are reached elsewhere. As a result, FWI is inherently uncertain at the grid-spacing scale and small-scale perturbations of model parameters typically fall within the null space of the full Hessian, having a negligibly small influence on the cost function. For this reason, the use of the full Hessian for perturbations of each model parameter in each point of a finite-difference grid is impractical, even if such a Hessian can be calculated.

To quantify uncertainty, we must assess it at larger scales. If we do not eliminate small scales, the uncertainty becomes formally infinite. However, filtering out small-scale variations makes the analysis subjective, because the final result depends on which scales we choose to eliminate. Thus, the question ‘what is FWI uncertainty?’ can only be meaningfully answered in a relative sense and specifically as FWI uncertainty with respect to a chosen class of perturbations. This issue is evident in the description of FWI uncertainty in terms of the associated Hessian matrix. The Hessian is singular or nearly singular, with eigenvalues spanning many orders of magnitude. The uncertainty is governed by the smallest eigenvalues, which correspond to poorly resolved features. When moving to larger scales, we effectively eliminate small eigenvalues, reducing the uncertainty. However, the final result depends on where we set the threshold. This reinforces the fundamental point: FWI uncertainty is not a single well-defined quantity but must always be considered relative to the scale and nature of perturbations under investigation.

One approach to specify larger subsurface blocks is to segment the model, resulting from FWI, into geological units that define parts of the subsurface of a similar rock type. The model subspace consists of model parameters that are defined as perturbations of the original model and are, in the simplest case, piecewise constant within each unit. This restriction to a lower-dimensional space results in a compression of the Hessian with a smaller size than the original one. The size of a unit influences the uncertainty estimates, and we examine its effect by a number of examples.

From a mathematical point of view, Poincaré's separation theorem (Gradshteyn and Ryzhik 2000, for instance) describes the relation between the eigenvalues of the Hessian before and after projection. Here, the projection is similar to the compression but mapped back to the original space. In this way, the projection operator still acts on the same subspace but is defined as a map from the original space to itself. In the numerical examples, the projection replaces perturbations by their average inside a unit and acts as a spatial high-cut filter removing shorter wavelengths from the model. Our approach is similar to a common operation in the multigrid method, originally designed as an optimal numerical algorithm for solving elliptic partial differential equations by using a sequence of discretisations on grids with different scales (Hackbusch 1985; Mulder 2021). It also bears some similarity to spectral coarse graining of graphs (Gfeller and Rios 2007). Our method involves dimensionality reduction and, in that sense, is similar to other approximation methods, referenced earlier, that do not construct the full Hessian.

Section 2 reviews the basics of the Hessian computation and uncertainty estimation. We treat the minimisation problem as a projection of the observed data on the hypersurface formed by the model range in data space. With a proper choice of the weight in the cost functional, the covariance matrix characterising uncertainties induced by noisy data is proportional to the pseudo-inverse of Hessian. The covariance matrix characterising uncertainties that appear due to imperfect minimisation is proportional to the square of the pseudo-inverse of Hessian. Both types of uncertainties can be analysed in the same way, and without loss of generality, we consider uncertainties of the former type only. The Hessian is calculated in the Gauss–Newton approximation. Appendix A explains how to find the Hessian using the adjoint-state method. Appendix B explains how to combine Hessian in the case where one has several independent datasets.

Our approach is different from the standard one, described by, for instance, Tarantola (2005), in two respects. First, we do not rely on the Bayesian formalism nor do we specify a particular shape of the data noise distribution, such as Gaussian. All calculations are made in terms of covariance matrices. This approach is motivated by the fact that, in practice, we want to know the region of uncertainty inside which the model parameters lie within a specified level of confidence, instead of the exact probability for model parameters to have certain prescribed values – which cannot be reliably evaluated anyway. The geometry of the confidence ellipsoid, which represents the uncertainty range, is primarily influenced by the covariance matrix derived from the noise distribution rather than the noise distribution itself. Secondly, our derivations are more general, compared to what can be found in the geophysical literature. The matrices we are dealing with are intrinsically singular, and we therefore use the Moore–Penrose pseudo-inverse rather than the inverse.

The intersection of the confidence ellipsoid with a hyperplane in the model-parameter space provides a lower-dimensional ellipsoid. This smaller ellipsoid is described by the compressed Hessian, which acts only on those components of model vectors that lie in the hyperplane. The compressed Hessian can be constructed by partitioning the original Hessian into two parts, as described in Appendix C. The construction of the compressed Hessian in the general case requires an operator that maps the solution from a fine to a coarse grid. As already mentioned, such an operator is called restriction, and the reverse operator is called prolongation. Section 3 explains how to choose the restriction operator, depending on the desired grouping of the model parameters. In particular, restriction operators can perform averaging over a set of model parameters, corresponding to high-cut filtering in the wavenumber domain.

In Section 4, we consider a series of two-dimensional models, starting with a 2D homogeneous acoustic model for which the Hessian can be found analytically, a 2D isotropic elastic model with a horizontally layered model with a numerically computed Hessian, and another 2D isotropic elastic model. In the examples, we consider a constant relative perturbation of each model parameter prescribed per geological unit. However, the background model obtained by FWI does not have to be constant inside the unit. Also, in the case of finite-difference modelling, grid points belonging to the same unit might be disconnected, even when the unit itself is a connected set. Its interior grid points may still exhibit gaps near, for instance, dipped sharp pinch-outs. We examine the effect of the projection on the null-space components and overall uncertainty estimates, both for the conditional and marginal cases. 

2 Framework to Quantify Full-Waveform Inversion Uncertainties

We will analyse the full-waveform inversion (FWI) uncertainty based on the reconstructed model, assuming that FWI has converged to the global minimum, although in the numerical examples, we will instead use synthetic models created for the occasion. The uncertainty of a model parameter is evaluated from the condition that the norm of the perturbations in the modelled data due to model perturbations is the same as the norm of the expected noise in the observed data. This norm of the perturbed modelled data, which by themselves are supposed to be free of noise apart from unavoidable numerical noise, has a quadratic dependence on the model perturbations, and it is characterised by the Hessian. We review the role of the Hessian in uncertainty analysis and explain its relation with the covariance matrix. We then introduce the confidence ellipsoid and explain the geometrical meaning of the conditional and marginal uncertainties. Conditional uncertainties follow from subsets of the Hessian and marginal uncertainties from subsets of its pseudo-inverse, the covariance matrix.

2.1 Sources of Full-Waveform Inversion Inaccuracies

FWI reconstructs a subsurface model parameterised by a vector m ${\mathbf {m}}$ by minimising a cost functional
X ( m ) = 1 2 u ( m ) d obs W 2 , $$\begin{equation} \mathcal {X}({\mathbf {m}}) = \tfrac{1}{2}\Vert {\mathbf {u}}({\mathbf {m}})-{\mathbf {d}}_{\mathrm{obs}}\Vert _{\mathbf {W}}^2, \end{equation}$$ (1)
measuring the difference between modelled data u ( m ) ${\mathbf {u}}({\mathbf {m}})$ and observed data d obs ${\mathbf {d}}_{\mathrm{obs}}$ . Here, S r ${\mathbf {S}}_{\mathrm{r}}$ is an operator that samples the wavefield u ${\mathbf {u}}$ at the receiver locations for each shot. The modelled data u ${\mathbf {u}}$ are found by solving a partial differential equation
L ( v , m ) = f $$\begin{equation} \mathcal {L}({\mathbf {v}}, {\mathbf {m}}) = {\mathbf {f}}\end{equation}$$ (2)
for v ( m ) ${\mathbf {v}}({\mathbf {m}})$ , for a given model m ${\mathbf {m}}$ and source f ${\mathbf {f}}$ , and comparing u = S r v ${\mathbf {u}}={\mathbf {S}}_{\mathrm{r}}{\mathbf {v}}$ with d obs ${\mathbf {d}}_{\mathrm{obs}}$ at the times and positions where the data were acquired, using the sampling operator S r ${\mathbf {S}}_{\mathrm{r}}$ . The L 2 $L_2$ -norm in Equation (1) is defined by the inner product, u ( m ) d obs W 2 = [ u ( m ) d obs ] T W [ u ( m ) d obs ] $\Vert {\mathbf {u}}({\mathbf {m}})-{\mathbf {d}}_{\mathrm{obs}}\Vert ^2_{\mathbf {W}}= [{\mathbf {u}}({\mathbf {m}})-{\mathbf {d}}_{\mathrm{obs}}]^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}[{\mathbf {u}}({\mathbf {m}})-{\mathbf {d}}_{\mathrm{obs}}]$ , where W ${\mathbf {W}}$ is a weight matrix, accounting for weighting in time, frequency, offset, depth, etc., and the superscript ( · ) T $(\cdot)^{\scriptscriptstyle \mathsf {T}}$ denotes the transpose. The weight matrix W ${\mathbf {W}}$ is positive definite and plays the role of a metric tensor in the data space.

Let m 0 ${\mathbf {m}}_0$ be the parameters that represent the, hopefully, global minimum of the cost functional X $\mathcal {X}$ in the absence of noise. The corresponding noiseless observed data are denoted by d 0 ${\mathbf {d}}_0$ . Levels of constant 1 2 d obs d 0 W 2 $\tfrac{1}{2}\Vert {\mathbf {d}}_{\mathrm{obs}}-{\mathbf {d}}_0 \Vert ^2_{\mathbf {W}}$ form a family of nested ellipsoids in the data space. The model range, that is, all possible values of u ( m ) = S r v ( m ) ${\mathbf {u}}({\mathbf {m}})={\mathbf {S}}_{\mathrm{r}}{\mathbf {v}}({\mathbf {m}})$ with v ${\mathbf {v}}$ satisfying Equation (2), forms a hyperplane in the full data space. This hyperplane is tangent to one of the ellipsoids at the point u 0 = u ( m 0 ) ${\mathbf {u}}_0 ={\mathbf {u}}({\mathbf {m}}_0)$ that is obtained by projecting d 0 ${\mathbf {d}}_0$ on the model range u ( m ) ${\mathbf {u}}({\mathbf {m}})$ . The remainder X ( m 0 ) = 1 2 u ( m 0 ) d 0 W 2 $\mathcal {X}({\mathbf {m}}_0) = \tfrac{1}{2}\Vert {\mathbf {u}}({\mathbf {m}}_0) -{\mathbf {d}}_0 \Vert ^2_{\mathbf {W}}$ involves data that the modelling operator u ( m ) ${\mathbf {u}}({\mathbf {m}})$ cannot explain. The inverted model parameters might be different from m 0 ${\mathbf {m}}_0$ due to reconstruction errors. Another reason is that actual observed data d obs ${\mathbf {d}}_{\mathrm{obs}}$ contain ambient and instrumental noise, and its projection u obs ${\mathbf {u}}_{\mathrm{obs}}$ on the model range is not the same as u 0 ${\mathbf {u}}_0$ as Figure 1 illustrates.

Details are in the caption following the image
Ellipses show points in the data space that are equidistant from noiseless data d 0 ${\mathbf {d}}_0$ with respect to some metric W ${\mathbf {W}}$ . The cost functional (1) with noiseless observed data d 0 ${\mathbf {d}}_0$ is minimised by projecting d 0 ${\mathbf {d}}_0$ on the range of the forward operator u ${\mathbf {u}}$ , which gives the least-squares solution u 0 ${\mathbf {u}}_0$ of the inversion problem for model parameters m 0 ${\mathbf {m}}_0$ . The ellipse passing through u 0 ${\mathbf {u}}_0$ is tangent to the model hyperplane u ${\mathbf {u}}$ . If the observed data d obs ${\mathbf {d}}_{\mathrm{obs}}$ contain noise, the solution of the minimisation problem shifts to the point u obs ${\mathbf {u}}_{\mathrm{obs}}$ . That point can be estimated by projecting d obs ${\mathbf {d}}_{\mathrm{obs}}$ on the linearised model range, which is shown by the red dashed line and is given by u = u 0 + F ( m m 0 ) ${\mathbf {u}}={\mathbf {u}}_0+{\mathbf {F}}({\mathbf {m}}-{\mathbf {m}}_0)$ , with F = m u ( m 0 ) ${\mathbf {F}}= \nabla _{\mathbf {m}}{\mathbf {u}}({\mathbf {m}}_0)$ the Fréchet derivative of the data modelling operator.

2.2 Hessian and Covariance

In what follows, we consider the inversion problem in the vicinity of u 0 ${\mathbf {u}}_0$ , where the model can be linearised as u = u 0 + F ( m m 0 ) ${\mathbf {u}}= {\mathbf {u}}_0 + {\mathbf {F}}({\mathbf {m}}- {\mathbf {m}}_0)$ . Here, F = m u ( m 0 ) ${\mathbf {F}}= \nabla _{\mathbf {m}}{\mathbf {u}}({\mathbf {m}}_0)$ is the Fréchet derivative of the modelling operator, which can be computed by perturbing model parameters individually and recording the corresponding data perturbations for all shots and receivers. The cost functional and its derivative expand in the Taylor series around m 0 ${\mathbf {m}}_0$ as
X = X 0 + 1 2 ( m m 0 ) T H ( m m 0 ) , $$\begin{equation} \mathcal {X}= \mathcal {X}_0+ \tfrac{1}{2}({\mathbf {m}}- {\mathbf {m}}_0)^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}({\mathbf {m}}- {\mathbf {m}}_0), \end{equation}$$ (3)
m X = H ( m m 0 ) , $$\begin{equation} \nabla _{\mathbf {m}}\mathcal {X}= {\mathbf {H}}({\mathbf {m}}- {\mathbf {m}}_0), \end{equation}$$ (4)
where H ${\mathbf {H}}$ is the Hessian, describing the second derivatives of the cost functional with respect to the model parameters. In the Gauss–Newton approximation, the Hessian follows from weighted dot products of the full datasets for each perturbation, summing over all receivers for all sources, and it is equal to H = F T W F ${\mathbf {H}}= {\mathbf {F}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}{\mathbf {F}}$ .

Appendix A describes the adjoint state method for calculating the Hessian in the more general case. For a given value of m X $\nabla _{\mathbf {m}}\mathcal {X}$ , the minimum-norm solution of Equation (4) is δ m = H m X $\delta {\mathbf {m}}= {\mathbf {H}}^\dagger \nabla _{\mathbf {m}}\mathcal {X}$ , where δ m = m m 0 $\delta {\mathbf {m}}= {\mathbf {m}}- {\mathbf {m}}_0$ and the superscript $\dagger$ denotes the Moore–Penrose pseudo-inverse. If the misfit function only depends on the data, possibly by ignoring penalty terms, the Fréchet derivative of the misfit function with respect to the model can be factored into u X $\nabla _{\mathbf {u}}\mathcal {X}$ and the F = m u ${\mathbf {F}}=\nabla _{\mathbf {m}}{\mathbf {u}}$ used here. The first will be required as input for the reverse-time part of an adjoint-state gradient computation in FWI, and should therefore be available. With these building blocks, the implementation of our method should be straightforward for more general cost functions that involve time adaptivity (Bharadwaj et al. 2016; Bozdaǧ et al. 2011; Jiao et al. 2015; van Leeuwen and Mulder 2008; Warner and Guasch 2016) or optimal transport (Engquist and Yang 2022; Métivier et al. 2018).

If random vectors Y ${\bf Y}$ and X ${\bf X}$ are related linearly as Y = L X ${\bf Y} = {\bf L} {\bf X}$ , their covariance matrices C X ${\mathbf {C}}_\mathrm{X}$ and C Y ${\mathbf {C}}_\mathrm{Y}$ satisfy the equation C Y = L C X L T ${\mathbf {C}}_\mathrm{Y} = {\bf L} {\mathbf {C}}_\mathrm{X} {\bf L}^{\scriptscriptstyle \mathsf {T}}$ . If the gradient of the cost functional does not vanish exactly for the inverted model parameters but has a random distribution with covariance matrix C X ${\mathbf {C}}_\mathcal {X}$ , the deviation δ m $\delta {\mathbf {m}}$ of the inverted model parameters from m 0 ${\mathbf {m}}_0$ has a random distribution with the covariance matrix
C m , X = ( H ) T C X H = σ X 2 ( H ) 2 = σ X 2 ( H 2 ) . $$\begin{equation} {\mathbf{C}}_{\mathrm{m},\mathcal{X}}={({\mathbf{H}}^{\ensuremath{\dag}})}^{\mathrm{T}}{\mathbf{C}}_{\mathcal{X}}{\mathbf{H}}^{\ensuremath{\dag}}={\sigma}_{\mathcal{X}}^{2}{({\mathbf{H}}^{\ensuremath{\dag}})}^{2}={\sigma}_{\mathcal{X}}^{2}{({\mathbf{H}}^{2})}^{\ensuremath{\dag}}. \end{equation}$$ (5)
Here, we have taken into account that the Hessian is symmetric and assumed that C X ${\mathbf {C}}_\mathcal {X}$ is a diagonal matrix with all elements equal to σ X 2 $\sigma _\mathcal {X}^2$ . We represent the weight matrix W ${\mathbf {W}}$ in the form W = V T V ${\mathbf {W}}= {\mathbf {V}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {V}}$ and use the normalisations u ̂ = V u $\hat{{\mathbf {u}}} = {\mathbf {V}}{\mathbf {u}}$ , d ̂ = V d $\hat{{\mathbf {d}}} = {\mathbf {V}}{\mathbf {d}}$ . The normalised data space has the Euclidean metric, where the L 2 $L_2$ norm is defined as u d 2 = ( u ̂ d ̂ ) T ( u ̂ d ̂ ) $\Vert {{\mathbf {u}}}-{\mathbf {d}}\Vert ^2 = (\hat{{\mathbf {u}}}-\hat{{\mathbf {d}}})^{\scriptscriptstyle \mathsf {T}}(\hat{{\mathbf {u}}}- \hat{{\mathbf {d}}})$ . The normalised difference between the inverted modelled data with and without noise δ u ̂ obs = V δ d obs $\delta {\hat{{\mathbf {u}}}}_{\mathrm{obs}}= {\mathbf {V}}\delta {\mathbf {d}}_{\mathrm{obs}}$ , where δ d obs = u obs u 0 $\delta {\mathbf {d}}_{\mathrm{obs}}={\mathbf {u}}_{\mathrm{obs}}- {\mathbf {u}}_0$ , is equal to the normal projection of δ d ̂ obs = V δ d obs $\delta {\hat{{\mathbf {d}}}}_{\mathrm{obs}}= {\mathbf {V}}\delta {\mathbf {d}}_{\mathrm{obs}}$ on the range of matrix F ̂ = V F $\hat{{\mathbf {F}}} = {\mathbf {V}}{\mathbf {F}}$ . Taking into account that the normal projection operator on the range of F ̂ $\hat{{\mathbf {F}}}$ is equal to F ̂ F ̂ $\hat{{\mathbf {F}}}\,\hat{{\mathbf {F}}}^\dagger$ , we obtain δ u ̂ obs = F ̂ F ̂ δ d ̂ obs $\delta {\hat{{\mathbf {u}}}}_{\mathrm{obs}}= \hat{{\mathbf {F}}}\, \hat{{\mathbf {F}}}^\dagger \delta {\hat{{\mathbf {d}}}}_{\mathrm{obs}}$ or F ̂ δ m = F ̂ F ̂ δ d ̂ obs $\hat{{\mathbf {F}}} \delta {\mathbf {m}}= \hat{{\mathbf {F}}}\,\hat{{\mathbf {F}}}^\dagger \delta {\hat{{\mathbf {d}}}}_{\mathrm{obs}}$ . The minimum norm solution of this equation is δ m = F ̂ F ̂ F ̂ δ d ̂ obs = F ̂ δ d ̂ obs = ( V F ) V δ d obs $\delta {\mathbf {m}}= \hat{{\mathbf {F}}}^\dagger \hat{{\mathbf {F}}}\, \hat{{\mathbf {F}}}^\dagger \delta {\hat{{\mathbf {d}}}}_{\mathrm{obs}}= \hat{{\mathbf {F}}}^\dagger \delta {\hat{{\mathbf {d}}}}_{\mathrm{obs}}= ({\mathbf {V}}{\mathbf {F}})^\dagger {\mathbf {V}}\delta {\mathbf {d}}_{\mathrm{obs}}$ . Similarly to Equation (5), we conclude that in the case where δ d obs $\delta {\mathbf {d}}_{\mathrm{obs}}$ is distributed with the covariance matrix C d ${\mathbf {C}}_{\mathrm{d}}$ , the value δ m $\delta {\mathbf {m}}$ is distributed with a covariance matrix
C m , d = V F V C d V T V F T . $$\begin{equation} {\mathbf {C}}_{\mathrm{m,d}}= {\left({{\mathbf {V}}} {{{\mathbf {F}}}}\right)}^\dagger \nobreakspace {{\mathbf {V}}} {\mathbf {C}}_{\mathrm{d}}{{\mathbf {V}}}^{\scriptscriptstyle \mathsf {T}}\nobreakspace {{\left({{\mathbf {V}}} {{{\mathbf {F}}}}\right)}^\dagger }^{\scriptscriptstyle \mathsf {T}}. \end{equation}$$ (6)
If the reconstruction inaccuracies due to variations of m X $\nabla _{\mathbf {m}}\mathcal {X}$ and noise of the observed data d obs ${\mathbf {d}}_{\mathrm{obs}}$ do not correlate with each other, the value δ m $\delta {\mathbf {m}}$ is distributed with a covariance matrix equal to the sum of covariances C m , X ${\mathbf {C}}_\mathrm{m, \mathcal {X}}$ and C m , d ${\mathbf {C}}_{\mathrm{m,d}}$ .

According to Aitken's (1935) generalised least-squares method, a best unbiased estimator is obtained when the weight matrix W ${\mathbf {W}}$ in the cost function is chosen in such a way that V C d V T = σ d 2 I ${{\mathbf {V}}} {\mathbf {C}}_{\mathrm{d}}{{\mathbf {V}}}^{\scriptscriptstyle \mathsf {T}}= \sigma _{\mathrm{d}}^2 {\bf I}$ , where I ${\mathbf {I}}$ is the identity matrix, and σ d $\sigma _{\mathrm{d}}$ is the proportionality coefficient. This condition is satisfied if C d ${\mathbf {C}}_{\mathrm{d}}$ is invertible and V = σ d C d 1 / 2 ${{\mathbf {V}}} = \sigma _{\mathrm{d}} {\mathbf {C}}_{\mathrm{d}}^{-1/2}$ . With this choice, we obtain C m , d = σ d 2 V F [ V F ] T = σ d 2 ( V F ) T ( V F ) = σ d 2 ( F T W F ) = σ d 2 H ${\mathbf {C}}_{\mathrm{m,d}}= \sigma _{\mathrm{d}}^2 \left({{\mathbf {V}}} {{\mathbf {F}}}\right)^\dagger [\left({{\mathbf {V}}} {{\mathbf {F}}}\right)^\dagger]^{\scriptscriptstyle \mathsf {T}}= \sigma _{\mathrm{d}}^2 \left[({{\mathbf {V}}} {{\mathbf {F}}})^{\scriptscriptstyle \mathsf {T}}({{\mathbf {V}}} {{\mathbf {F}}})\right]^\dagger = \sigma _{\mathrm{d}}^2 ({{\mathbf {F}}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}{{\mathbf {F}}})^\dagger = \sigma _{\mathrm{d}}^2 {\mathbf {H}}^\dagger$ .

In what follows, we assume that the main contribution to FWI uncertainties comes from the noise in the observed data so that the covariance matrix is proportional to H ${\mathbf {H}}^\dagger$ . The case where the uncertainties are mostly due to inaccurate inversion can be analysed in the same way, using H 2 ${\mathbf {H}}^2$ instead of H ${\mathbf {H}}$ .

2.3 Confidence Ellipsoid

Let H = U S U T ${\mathbf {H}}= {\mathbf {U}}{\mathbf {S}}{\mathbf {U}}^{\scriptscriptstyle \mathsf {T}}$ be the singular value decomposition of the Hessian, with singular values defined by the vector s ${\mathbf {s}}$ and S = diag ( s ) ${\mathbf {S}}=\mathrm{diag}({{\mathbf {s}}})$ , so that δ m $\delta {\mathbf {m}}$ is distributed with the covariance matrix C m = σ d 2 U S U T ${\mathbf {C}}_{\mathrm{m}} = \sigma _{\mathrm{d}}^2 {\mathbf {U}}{\mathbf {S}}^\dagger {\mathbf {U}}^{\scriptscriptstyle \mathsf {T}}$ . If H ${\mathbf {H}}$ has M $M$ non-zero singular values, the vectors s ${\mathbf {s}}$ and
δ m ̂ = σ d 1 S 1 / 2 U T δ m $$\begin{equation} \delta \hat{{\mathbf {m}}} = \sigma _{\mathrm{d}}^{-1} {\mathbf {S}}^{1/2} {\mathbf {U}}^{\scriptscriptstyle \mathsf {T}}\delta {\mathbf {m}}\end{equation}$$ (7)
have M $M$ non-zero components (or components with amplitude exceeding the numerical accuracy), which are distributed with a unit covariance matrix. Consider the scalar ζ = δ m T H δ m / σ d 2 = δ m ̂ T δ m ̂ = δ m ̂ 1 2 + + δ m ̂ M 2 $\zeta = {\delta {\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}\,\delta {\mathbf {m}}} / \sigma _{\mathrm{d}}^2 = \delta \hat{{\mathbf {m}}}^{\scriptscriptstyle \mathsf {T}}\delta \hat{{\mathbf {m}}} = \delta \hat{m}_1^2 + \cdots \ + \delta \hat{m}_M^2$ . By construction, ζ $\zeta$ is the sum of squares of M $M$ independent random variables with zero average values and unit standard deviations σ j 2 = Var ( δ m ̂ j ) = δ m ̂ j 2 δ m ̂ j 2 = 1 $ \sigma _j^2 = {\rm Var} (\delta \hat{m}_j) = \left\langle \delta \hat{m}_j^2 \right\rangle - \left\langle \delta \hat{m}_j \right\rangle ^2 = 1$ , where the angular brackets denote averaging. Individual squares are distributed with the variances Var ( δ m ̂ j 2 ) = δ m ̂ j 4 δ m ̂ j 2 2 = κ σ j 4 ${\rm Var} (\delta \hat{m}_j^2) = \left\langle \delta \hat{m}_j^4 \right\rangle - \left\langle \delta \hat{m}_j^2 \right\rangle ^2 = \kappa \sigma _j^4$ . The proportionality coefficient κ $\kappa$ in the above equation depends on the actual distribution of the δ m ̂ j $\delta \hat{m}_j$ . Given the large number of model parameters, the central limit theorem implies that ζ $\zeta$ is distributed normally with mean value M $M$ and standard deviation ( κ M ) 1 / 2 $(\kappa M)^{1/2}$ . The probability that ζ $\zeta$ does not exceed the value ζ c $\zeta _c$ equals
p ( ζ ζ c ) = 1 2 1 + erf α ( ζ c / M ) 1 , α = M / ( 2 κ ) , $$\begin{equation} p(\zeta \le \zeta _c) = \tfrac{1}{2}{\left[ 1 + {\mathrm{erf}}\!{\left(\alpha {\left\lbrace ({\zeta _c}/{M})-1\right\rbrace} \right)} \right]},\ \ \alpha = \sqrt { M / (2 \kappa)}, \end{equation}$$ (8)
where erf ( · ) ${\rm erf(\cdot)}$ is the error function. For a normal distribution, where δ m ̂ j 4 = 3 σ j 4 $\left\langle \delta \hat{m}_j^4 \right\rangle = 3 \sigma _j^4$ and κ = 2 $\kappa = 2$ , Equation (8) follows from the chi-squared test in the limit M 1 $M \gg 1$ .

Figure 2 illustrates the behaviour of p ( ζ ζ c ) $p(\zeta \le \zeta _c)$ and shows that Equation (8) provides a good approximation for exact distributions if α 5 $\alpha \ge 5$ . The function p ( ζ ζ c ) $p(\zeta \le \zeta _c)$ exhibits a jump when crossing the point ζ c = M $\zeta _c = M$ , which becomes sharper with increasing M $M$ . For typical subsurface modelling, the solution of equation  p ( ζ ζ c ) = p c $p(\zeta \le \zeta _c) = p_c$ can be approximated by ζ c M $\zeta _c \simeq M$ for any p c $p_c$ that is not too close to 0 or 1. This happens because of the high dimensionality of the problem. If points are randomly distributed inside an M $M$ -dimensional ellipsoid with M 1 $M \gg 1$ , most of them are located near the ellipsoid boundary.

Details are in the caption following the image
Solid lines show how the confidence interval p ( ζ ζ c ) $p(\zeta \le \zeta _c)$ depends on ζ c / M $\zeta _c / M$ for α = 5 , 10 , 50 $\alpha = 5, 10, 50$ . For normal distributions, p ( ζ ζ c ) = 0 ξ c χ M 2 ( ξ ) d ζ $p(\zeta \le \zeta _c) = \int _0^{\xi _c} \chi _M^2(\xi) {\text{d}}\zeta$ , where χ M 2 ( ζ ) = ζ M / 2 1 e ζ / 2 / [ 2 M / 2 Γ ( M / 2 ) ] $\chi _M^2(\zeta) = \zeta ^{M/2-1} e^{-\zeta /2} / [2^{M/2} \Gamma (M/2)]$ is the chi-squared distribution. The cumulative distribution function for χ M 2 $\chi ^2_M$ with M = 100 $M = 100$ is shown by open circles, and it is well approximated by Equation (8) with α = 5 $\alpha = 5$ .
Let D $D$ be the number of data samples used for FWI. The energy of the noise present in the data is 1 2 δ d T δ d = 1 2 σ d 2 D $\tfrac{1}{2}\delta {\mathbf {d}}^{\scriptscriptstyle \mathsf {T}}\delta {\mathbf {d}}= \tfrac{1}{2}\sigma _{\mathrm{d}}^2 D$ . Using this relation, we write the condition ζ c M $\zeta _c \le M$ as
1 2 δ m T H δ m ε 0 , ε 0 = ε ( M / D ) E , $$\begin{equation} \tfrac{1}{2}\delta {\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}\,\delta {\mathbf {m}}\le \varepsilon _0,\quad \varepsilon _0= \epsilon (M/D) {\mathcal {E}}, \end{equation}$$ (9)
where E = d T d / 2 ${\mathcal {E}} = {\mathbf {d}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {d}}/2$ is the energy of the measured signal and ε = δ d T δ d / d T d $\epsilon = \delta {\mathbf {d}}^{\scriptscriptstyle \mathsf {T}}\delta {\mathbf {d}}/ {\mathbf {d}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {d}}$ is the ratio of the noise energy to the signal energy. Inequality (9) can be interpreted as follows. If the noise is distributed uniformly over the data space, its energy per degree of freedom is equal to ε E / D $\epsilon {\mathcal {E}} / D$ . Since the modelled data form a subspace of dimension M $M$ in the data space, the energy of noise that is projected on the range of the linearised modelling operator (dashed line in Figure 1) is equal to ε 0 = ε ( M / D ) E $\varepsilon _0= \epsilon (M / D) {\mathcal {E}}$ . This part of the noise introduces uncertainty in reconstructed model parameters, which is described by condition (9). The remaining noise lies in the space complementary to the model range, and it changes the minimal residual of the cost function X X 0 $\mathcal {X}-\mathcal {X}_0$ , but not values of the model parameters where the residual is minimised.

Inequality (9) defines a confidence ellipsoid in the model parameter space, which contains viable solutions to the minimisation problem. The confidence ellipsoid provides a complete characterisation of uncertainty in the linear approximation. However, even if the Hessian is known, analysis of the corresponding ellipsoid is difficult because of the high dimensionality of the model space. This dimensionality can be reduced by projecting the ellipsoid on specific hyperplanes in the parameter space and by considering its intersections with these hyperplanes. Figure 3 illustrates the procedure. We introduce axes δ m 1 , δ m 2 , $\delta m_1, \delta m_2, \ldots$ in the parameter space. The components of a vector along these axes are equal to perturbations of the corresponding model parameter. The set of axes δ m 2 , δ m 3 , $\delta m_2, \delta m_3, \ldots$ that does not include the axis δ m 1 $\delta m_1$ is denoted by δ m 2 $\delta {\mathbf {m}}_2$ . Points lying on axis δ m 1 $\delta m_1$ satisfy the condition δ m 2 = 0 $\delta {\mathbf {m}}_2 = {\mathbf {0}}$ . The range between points A $A$ and B $B$ , where the axis δ m 1 $\delta m_1$ intersects the ellipsoid, represents the conditional uncertainty of the parameter δ m 1 $\delta m_1$ , that is, the uncertainty range of δ m 1 $\delta m_1$ under the condition that all the other model parameters are fixed.

The conditional uncertainty of δ m 1 $\delta m_1$ is described by the inequality δ m 1 H 11 δ m 1 / 2 ε 0 $\delta m_1 H_{11} \delta m_1 / 2 \le \varepsilon _0$ . It provides the smallest uncertainty bound for this parameter: | δ m 1 | ( 2 ε 0 / H 11 ) 1 / 2 $|\delta {m}_{1}|\le {(2{\varepsilon}_{0}/{H}_{11})}^{1/2}$ .

The largest uncertainty bound – the marginal uncertainty – describes the case where changes in the cost function associated with a given model parameter are maximally compensated by varying other model parameters. The marginal uncertainty is obtained by projecting the ellipsoid on the parameter axis considered. In the example illustrated by Figure 3, the marginal uncertainty of parameter m 1 $m_1$ is the range between points C $C$ and D $D$ , where the lines (actually, hyperplanes) δ m 1 = const $\delta m_1 = \text{const}$ . are tangent to the ellipse. At point E $E$ , δ m 1 $\delta m_1$ reaches its largest value.

In general, we can consider the problem of finding μ i = max ( δ m i ) $\mu _i = \max (\delta m_i)$ subject to ψ = 1 2 δ m T H δ m = ε 0 $\psi =\tfrac{1}{2}\delta {\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}\,\delta {\mathbf {m}}=\varepsilon _0$ . The Lagrangian L ( δ m , λ ) = δ m i λ ( ψ ε 0 ) $\mathcal {L}(\delta {\mathbf {m}},\lambda)=\delta m_i-\lambda (\psi -\varepsilon _0)$ has a derivate L / δ m j = δ i j λ ( H δ m ) j $\partial \mathcal {L}/\partial {\delta m_j}=\delta _{ij}-\lambda ({\mathbf {H}}\delta {\mathbf {m}})_j$ that should vanish. Defining v = H δ m ${\mathbf {v}}={\mathbf {H}}\delta {\mathbf {m}}$ , this leads to v j = δ i j / λ $v_j=\delta _{ij}/\lambda$ and δ m i = ( C v ) i = c i i / λ $\delta m_i=({\mathbf {C}}{\mathbf {v}})_i=c_{ii}/\lambda$ , for the pseudo-inverse C = H ${\mathbf {C}}={\mathbf {H}}^\dagger$ . Then, ψ = 1 2 δ m T v = 1 2 δ m i / λ = 1 2 c i i / λ 2 = ε 0 $\psi =\tfrac{1}{2}\delta {\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {v}}=\tfrac{1}{2}\delta m_i/\lambda =\tfrac{1}{2}c_{ii}/\lambda ^2=\varepsilon _0$ . The positive solution for λ $\lambda$ yields μ i = ( 2 ε 0 c ii ) 1 / 2 ${\mu}_{i}={(2{\varepsilon}_{0}{c}_{\textit{ii}})}^{1/2}$ for the maximum of δ m i $\delta m_i$ , and the minus sign produces the minimum. The resulting the marginal uncertainty range is | δ m i | [ 2 ε 0 ( H ) ii ] 1 / 2 $|\delta {m}_{i}|\le {[2{\varepsilon}_{0}{({\mathbf{H}}^{\ensuremath{\dag}})}_{\textit{ii}}]}^{1/2}$ .

Instead of extracting a single model parameter m 1 $m_1$ , one can split the vector m ${\mathbf {m}}$ into two complementary parts m = ( m 1 , m 2 ) T ${\mathbf {m}}= ({\mathbf {m}}_1, {\mathbf {m}}_2)^{\scriptscriptstyle \mathsf {T}}$ . The Hessian is partitioned accordingly as
H = H 11 H 12 H 21 H 22 $$\begin{equation} {\mathbf {H}}= \def\eqcellsep{&}\begin{pmatrix} {\mathbf {H}}_{11} & {\mathbf {H}}_{12}\\ {\mathbf {H}}_{21} & {\mathbf {H}}_{22}\\ \end{pmatrix} \end{equation}$$ (10)
and acts on ( δ m 1 , δ m 2 ) T $(\delta {\mathbf {m}}_1,\delta {\mathbf {m}}_2)^{\scriptscriptstyle \mathsf {T}}$ . As is explained in Appendix C, the action of the Hessian on the vectors can then be represented by a sum of two terms, given by Equation (C.6). The Hessian block H 11 ${\bf H}_{11}$ describes the conditional uncertainties of the parameters m 1 ${\mathbf {m}}_1$ , assuming the parameters m 2 ${\mathbf {m}}_2$ are fixed. The combination of Hessian blocks H 22 H 21 H 11 H 12 ${\bf H}_{22} - {\bf H}_{21} {\bf H}_{11}^\dagger {\bf H}_{12}$ describes the marginal uncertainty of the parameters m 2 ${\mathbf {m}}_2$ , where one sets δ m 1 = H 11 H 12 δ m 2 $\delta {\mathbf {m}}_1 = - {\mathbf {H}}_{11}^{\dagger } {\mathbf {H}}_{12} \delta {\mathbf {m}}_2$ to compensate for changes in the cost function due to variations of parameters m 2 ${\mathbf {m}}_2$ .
Details are in the caption following the image
The conditional uncertainty range of the parameter m 1 $m_1$ is the interval between points A $A$ and B $B$ , where the δ m 1 $\delta m_1$ -axis crosses the confidence ellipsoid. The marginal uncertainty range of parameter m 1 $m_1$ is the interval between points C $C$ and D $D$ , which is the projection on the δ m 1 $\delta m_1$ -axis of the confidence ellipsoid. The conditional and marginal uncertainty ranges of δ m 1 $\delta m_1$ are proportional to 1 / h 1 , 1 $\sqrt {1 / h_{1,1}}$ and to h 1 , 1 $\sqrt {h_{1,1}^\dagger }$ , respectively.

3 Restriction and Compression

3.1 General Formalism

Instead of a model with n $n$ parameters, we can consider its restriction to a subspace with dimension r < n $r<n$ . An arbitrary restriction operator can be represented by an r × n $r \times n$ matrix Q ${\mathbf {Q}}$ , whose rows q j ${\bf q}_j$ with j = 1 , 2 , , r $j = 1, 2, \ldots, r$ are linearly independent vectors in the n $n$ -dimensional parameters space. The action of the Hessian on those components of m ${\mathbf {m}}$ that do not lie in the space spanned by vectors q j ${\bf q}_j$ , that is, the range R ( Q T ) $\mathcal {R}({\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}})$ of Q T ${\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ , is ignored. This is equivalent to the condition that the model parameter vectors m ${\mathbf {m}}$ lying in the null space N ( Q ) $\mathcal {N}({\mathbf {Q}})$ of Q ${\mathbf {Q}}$ are fixed.

Theorem 1.The projected Hessian H p = P H P ${\mathbf {H}}_\mathrm{p}= {\mathbf {P}}{\mathbf {H}}{\mathbf {P}}$ and the compressed Hessian H c = Q H Q ${\mathbf {H}}_\mathrm{c}={\mathbf {Q}}{\mathbf {H}}{\mathbf {Q}}^\dagger$ have the same eigenvalues. The corresponding eigenvectors can be mapped to each other by Q ${\mathbf {Q}}^\dagger$ and Q ${\mathbf {Q}}$ , respectively, apart from null-space components.

The n × n $n\times n$ orthogonal projection operator P ${\mathbf {P}}$ , defined on the subspace with dimension r < n $r<n$ spanned by the row-vectors of Q ${\mathbf {Q}}$ , is equal to P = Q Q ${\mathbf {P}}= {\mathbf {Q}}^\dagger {\mathbf {Q}}$ .

The n × n $n \times n$ projected Hessian H p ${\mathbf {H}}_\mathrm{p}$ is defined by the condition that its action on parameter vectors m ${\mathbf {m}}$ is the same as the action of the original Hessian H ${\mathbf {H}}$ on projected parameter vectors P m ${\mathbf {P}}{\mathbf {m}}$ , that is, m T H p m = ( P m ) T H P m ${\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}_\mathrm{p}{\mathbf {m}}= ({\mathbf {P}}{\mathbf {m}})^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}{\mathbf {P}}{\mathbf {m}}$ . This requirement implies H p = Q Q H Q Q = P H P ${\mathbf {H}}_\mathrm{p}= {\mathbf {Q}}^\dagger {\mathbf {Q}}{\mathbf {H}}{\mathbf {Q}}^\dagger {\mathbf {Q}}={\mathbf {P}}{\mathbf {H}}{\mathbf {P}}$ .

We also introduce the r × r $r \times r$ compressed Hessian H c ${\mathbf {H}}_\mathrm{c}$ such that m c T H c m c = m T H p m = m c T Q H Q m c ${\mathbf {m}}_\mathrm{c}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}_\mathrm{c} {\mathbf {m}}_\mathrm{c} = {\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}_\mathrm{p} {\mathbf {m}}= {\mathbf {m}}_\mathrm{c}^{\scriptscriptstyle \mathsf {T}}{\mathbf {Q}}{\mathbf {H}}{\mathbf {Q}}^\dagger {\mathbf {m}}_\mathrm{c}$ , where m c ${\mathbf {m}}_\mathrm{c}$ is an r $r$ -dimensional vector defined by the equation  m c = Q m ${\mathbf {m}}_\mathrm{c}= {\mathbf {Q}}{\mathbf {m}}$ . The minimum-norm solution of the above equation is m = Q m c ${\mathbf {m}}= {\mathbf {Q}}^\dagger {\mathbf {m}}_\mathrm{c}$ , which shows that Q ${\mathbf {Q}}^\dagger$ is the prolongation operator.

The compressed Hessian H c ${\mathbf {H}}_\mathrm{c}$ is expressed via H ${\mathbf {H}}$ and H p ${\mathbf {H}}_\mathrm{p}$ as H c = Q H Q = Q H p Q ${\mathbf {H}}_\mathrm{c}= {\mathbf {Q}}{\mathbf {H}}{\mathbf {Q}}^\dagger = {\mathbf {Q}}{\mathbf {H}}_\mathrm{p}{\mathbf {Q}}^\dagger$ , where we have used the following properties of pseudo-inverses: Q Q Q = Q ${\mathbf {Q}}{\mathbf {Q}}^\dagger {\mathbf {Q}}= {\mathbf {Q}}$ and Q Q Q = Q $ {\mathbf {Q}}^\dagger {\mathbf {Q}}{\mathbf {Q}}^\dagger = {\mathbf {Q}}^\dagger$ .

PROOF.An eigenvector v ${\mathbf {v}}^\prime$ with eigenvalue λ $\lambda ^\prime$ of the compressed H c ${\mathbf {H}}_\mathrm{c}$ obeys Q H Q v = λ v ${\mathbf {Q}}{\mathbf {H}}{\mathbf {Q}}^\dagger {\mathbf {v}}^\prime =\lambda ^\prime {\mathbf {v}}^\prime$ . Since P Q = Q ${\mathbf {P}}{\mathbf {Q}}^\dagger = {\mathbf {Q}}^\dagger$ , we have Q Q H Q v = H p ( Q v ) = λ ( Q v ) ${\mathbf {Q}}^\dagger {\mathbf {Q}}{\mathbf {H}}{\mathbf {Q}}^\dagger {\mathbf {v}}^\prime ={\mathbf {H}}_\mathrm{p}({\mathbf {Q}}^\dagger {\mathbf {v}}^\prime)= \lambda ^\prime ({\mathbf {Q}}^\dagger {\mathbf {v}}^\prime)$ . The other way around, consider H p v = λ v ${\mathbf {H}}_\mathrm{p}{\mathbf {v}}= \lambda {\mathbf {v}}$ and note that v ${\mathbf {v}}$ lies in the range of P ${\mathbf {P}}$ and, therefore, of Q ${\mathbf {Q}}^\dagger$ . Then, Q H p v = ( Q H Q ) ( Q v ) = H c ( Q v ) = λ ( Q v ) ${\mathbf {Q}}{\mathbf {H}}_\mathrm{p}{\mathbf {v}}=({\mathbf {Q}}{\mathbf {H}}{\mathbf {Q}}^\dagger)({\mathbf {Q}}{\mathbf {v}})={\mathbf {H}}_\mathrm{c}({\mathbf {Q}}{\mathbf {v}})=\lambda ({\mathbf {Q}}{\mathbf {v}})$ . $\Box$

3.2 Compression With Semi-Orthogonal Matrices

As mentioned above, the operator P ${\mathbf {P}}$ projects vectors on the space spanned by rows of the restriction matrix Q ${\mathbf {Q}}$ , which is the same as the range R ( Q T ) $\mathcal {R}({\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}})$ of the transposed matrix. The matrix Q ${\mathbf {Q}}$ can be constructed such that its rows q j ${\mathbf {q}}_j$ form an orthonormal basis in R ( Q T ) $\mathcal {R}({\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}})$ . Then, Q = Q T ${\mathbf {Q}}^\dagger = {\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ and Q Q T = I r ${\mathbf {Q}}{\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}= {\mathbf {I}}_r$ , where I r ${\mathbf {I}}_r$ is a r × r $r \times r$ unit matrix. The corresponding n × n $n \times n$ operator P = Q T Q ${\mathbf {P}}= {\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {Q}}$ projects vectors on the same subspace of the model space as the original operator P ${\mathbf {P}}$ .

Theorem 2.Given a real symmetric matrix A ${\mathbf {A}}$ of size n × n $n \times n$ with eigenvalues λ i $\lambda _i$ , i = 1 , 2 , , n $i=1,2,\ldots,n$ , sorted in descending order. The semi-orthogonal r × n $r\times n$ matrix Q ${\mathbf {Q}}$ , with r n $r\le n$ , defines its restriction to an r $r$ -dimensional linear subspace and obeys Q Q T = I m ${\mathbf {Q}}\, {\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}={\mathbf {I}}_m$ , where I m ${\mathbf {I}}_m$ is the r × r $r \times r$ identity matrix. Then, the eigenvalues μ i $\mu _i$ , i = 1 , 2 , , r $i=1, 2, \ldots, r$ , in descending order, of B = Q A Q T ${\mathbf {B}}={\mathbf {Q}}{\mathbf {A}}{\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ obey

λ i μ i λ n r + i , i = 1 , 2 , , r . $$\begin{equation} \lambda _i\ge \mu _i\ge \lambda _{n - r + i},\quad i = 1, 2, \ldots, r. \end{equation}$$ (11)

The compression m c = Q m ${\mathbf {m}}_\mathrm{c}= {\mathbf {Q}}{\mathbf {m}}$ with semi-orthogonal matrix Q ${\mathbf {Q}}$ can be viewed as a map to a lower-dimensional space. The projected Hessian H p ${\mathbf {H}}_\mathrm{p}$ and the corresponding covariance matrix C p = H p ${\mathbf {C}}_\mathrm{p} = {\mathbf {H}}_\mathrm{p}^\dagger$ are transformed as H c = Q H p Q T ${\mathbf {H}}_\mathrm{c}= {\mathbf {Q}}{\mathbf {H}}_\mathrm{p}{\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ and C c = Q C p Q T ${\mathbf {C}}_\mathrm{c}= {\mathbf {Q}}{\mathbf {C}}_\mathrm{p}{\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ . By the definition of the Moore–Penrose pseudo-inverse, C p H p C p = C p ${\mathbf {C}}_\mathrm{p}{\mathbf {H}}_\mathrm{p}{\mathbf {C}}_\mathrm{p}= {\mathbf {C}}_\mathrm{p}$ , H p C p H p = H p ${\mathbf {H}}_\mathrm{p}{\mathbf {C}}_\mathrm{p}{\mathbf {H}}_\mathrm{p}= {\mathbf {H}}_\mathrm{p}$ , H p C p T = H p C p $\left({\mathbf {H}}_\mathrm{p}{\mathbf {C}}_\mathrm{p}\right)^{\scriptscriptstyle \mathsf {T}}= {\mathbf {H}}_\mathrm{p}{\mathbf {C}}_\mathrm{p}$ , C p H p T = C p H p $\left({\mathbf {C}}_\mathrm{p}{\mathbf {H}}_\mathrm{p}\right)^{\scriptscriptstyle \mathsf {T}}= {\mathbf {C}}_\mathrm{p}{\mathbf {H}}_\mathrm{p}$ . Also, C p H p = P ${\mathbf {C}}_\mathrm{p}{\mathbf {H}}_\mathrm{p}={\mathbf {P}}$ . It can be verified that the compressed matrices H c ${\mathbf {H}}_\mathrm{c}$ and C c ${\mathbf {C}}_\mathrm{c}$ also satisfy the above four properties. Hence, if Q ${\mathbf {Q}}$ is semi-orthogonal, the pseudo-inverse of the compressed Hessian is the same as the compression of the projected covariance matrix: H c = Q C p Q T ${\mathbf {H}}_\mathrm{c}^\dagger = {\mathbf {Q}}{\mathbf {C}}_\mathrm{p} {\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ .

Semi-orthogonal matrices allow for the application of Poincaré's separation theorem (Gradshteyn and Ryzhik 2000, for instance), which relates the eigenvalues of a given real symmetric matrix of size n × n $n\times n$ to those of its compression to a subspace with dimension m < n $m<n$ .

The theorem implies that the eigenvalues of H p ${\mathbf {H}}_\mathrm{p}$ and H c ${\mathbf {H}}_\mathrm{c}$ are not larger than the eigenvalues of the original Hessian H ${\mathbf {H}}$ .

Poincaré's separation theorem is not applicable to the covariance matrix C p ${\mathbf {C}}_\mathrm{p}$ in relation to H ${\mathbf {H}}$ because the pseudo-inverse of the projected Hessian H p ${\mathbf {H}}_\mathrm{p}$ is not the same as the projection of the pseudo-inverse of the Hessian H ${\mathbf {H}}$ , that is, C p = H p P H P ${\mathbf {C}}_\mathrm{p}={\mathbf {H}}_\mathrm{p}^\dagger \ne {\mathbf {P}}{\mathbf {H}}^\dagger {\mathbf {P}}$ . This can also be understood from a heuristic point of view because restricting the types of possible perturbations also limits the opportunity to find a combination where such perturbations maximally compensate each other.

An alternative way to determine the marginal uncertainty of a parameter m 1 $m_1$ that represents a geological unit is the following. Choose a δ m 1 = Δ m 1 $\delta m_1=\Delta m_1$ and minimise the cost functional on the original model over all parameters while keeping δ m 1 = Δ m 1 $\delta m_1=\Delta m_1$ fixed. The minimum is reached at δ m 2 , min = ( δ m 2 , min , δ m 3 min , ) T $\delta {\mathbf {m}}_{2,\min }=(\delta m_{2,\min },\delta m_{3\min },\ldots)^{\scriptscriptstyle \mathsf {T}}$ , and the value of the cost function at the minimum is X min = X ( Δ m i , δ m 2 , min ) $\mathcal {X}_{\min }=\mathcal {X}(\Delta m_i,\delta {\mathbf {m}}_{2,\min })$ . If X 0 = 0 $\mathcal {X}_0=0$ in Equation (3), the quadratic behaviour of X ( δ m ) = 1 2 δ m T H δ m = ε 0 $\mathcal {X}(\delta {\mathbf {m}})=\tfrac{1}{2}\delta {\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}\,\delta {\mathbf {m}}=\varepsilon _0$ in δ m 1 $\delta m_1$ can be used to rescale X min $\mathcal {X}_{\min }$ to X ( δ m ) $\mathcal {X}(\delta {\mathbf {m}})$ , resulting in a marginal uncertainty | δ m i | ( ε 0 / X min ) 1 / 2 | Δ m i | $|\delta {m}_{i}|\le {({\varepsilon}_{0}/{\mathcal{X}}_{\min})}^{1/2}|\mathrm{\Delta}{m}_{i}|$ . In Figure 3, the ellipse described by X min = X ( Δ m i , δ m 2 , min ) $\mathcal {X}_{\min }=\mathcal {X}(\Delta m_i,\delta {\mathbf {m}}_{2,\min })$ would be an enlarged or shrunk version of the original ellipse given by X ( δ m ) = ε 0 $\mathcal {X}(\delta {\mathbf {m}})=\varepsilon _0$ and the rescaling would make them the same.

3.3 Construction of Semi-Orthogonal Restriction Matrices

The simplest way to construct a semi-orthogonal restriction matrix Q ${\mathbf {Q}}$ is to specify its rows as q 1 = ( 1 , 0 , , 0 ) ${\mathbf {q}}_1 = (1, 0, \ldots, 0)$ , q 2 = ( 0 , 1 , , 0 ) ${\mathbf {q}}_2 = (0, 1, \ldots, 0)$ , so that j $j$ -th vector q j ${\mathbf {q}}_j$ has a single non-zero j $j$ -th component equal to 1. The corresponding compressed Hessian H c ${\mathbf {H}}_\mathrm{c}$ is the r × r $r \times r$ upper-left block of the original Hessian H ${\mathbf {H}}$ , which characterises conditional uncertainties of first r $r$ model parameters. One can also choose q j ${\mathbf {q}}_j$ with r $r$ arbitrary indices j $j$ , with 1 j n $1\le j\le n$ . In that case, the compressed Hessian consists of the elements of the original Hessian formed at the intersections of rows and columns for the selected indices. More generally, a semi-orthogonal restriction matrix Q ${\mathbf {Q}}$ can be constructed from arbitrarily chosen rows from an orthogonal matrix.

Orthogonal matrices with unit determinants act as rotations. A rotation in n $n$ dimensions can be described by the pivot axis ξ ${\bm{\xi}}$ , and a unit ( n 1 ) $(n-1)$ -dimensional ‘pull’ vector η = η 1 , η 2 , , η n 1 T ${\bm{\eta}} = \left(\eta _1, \eta _2, \ldots, \eta _{n-1}\right)^{\scriptscriptstyle \mathsf {T}}$ in the plane perpendicular to ξ ${\bm{\xi}}$ (Hanson 1995, Equation 7). The coordinate axes are transformed in such a way that the pivot axis ξ ${\bm{\xi}}$ rotates towards η ${\bm{\eta}}$ by an angle θ $\theta$ in the plane formed by ξ ${\bm{\xi}}$ and η ${\bm{\eta}}$ . In the case where the pivot axis is the last coordinate axis, the matrix R T ${\mathbf {R}}^{\scriptscriptstyle \mathsf {T}}$ that performs the above rotation has the form (Hanson 1995)
R T = 1 r 1 , 1 r 1 , 2 r 1 , n 1 s η 1 r 2 , 1 1 r 2 , 2 r 2 , n 1 s η 2 r n 1 , 1 r n 1 , 2 1 r n 1 , n 1 s η N 1 s η 1 s η 2 s η N 1 c . $$\begin{equation} {\mathbf {R}}^{\scriptscriptstyle \mathsf {T}}= \def\eqcellsep{&}\begin{pmatrix} 1 - r_{1, 1} & -r_{1, 2} & \ldots & -r_{1, n - 1} & -s \eta _1\\ -r_{2, 1} & 1 - r_{2, 2} & \ldots & -r_{2, n - 1} & -s \eta _2\\ \ldots & \ldots & \ldots & \ldots & \ldots \\ -r_{n - 1, 1} & -r_{n - 1, 2} & \ldots & 1 - r_{n - 1, n - 1} & -s \eta _{N-1}\\ s \eta _1 & s \eta _2 & \ldots & s \eta _{N-1} & c \end{pmatrix}. \end{equation}$$ (12)
Here, r i j = ( 1 c ) η i η j $r_{ij} = (1 - c) \eta _i \eta _j$ , c = cos θ $c = \cos \theta$ and s = sin θ $s = \sin \theta$ . Rows of R T ${\mathbf {R}}^{\scriptscriptstyle \mathsf {T}}$ are coordinates of the new coordinate axes in the original coordinate systems. The transpose of matrix (12) is commonly called the rotation matrix R ${\mathbf {R}}$ (fixed coordinate system rotation), and it relates point coordinates m ${\mathbf {m}}^\prime$ in the rotated coordinate system to the original coordinates m ${\mathbf {m}}$ by m = R m ${\mathbf {m}}^\prime = {\mathbf {R}}{\mathbf {m}}$ .
As an example, consider a 2D coordinate system where the pivot axis ξ = e y ${\bm{\xi}}= {\mathbf {e}}_y$ rotates outwards the ‘pull’ vector η = e x ${\bm{\eta}}= {\mathbf {e}}_x$ over an angle π / 4 $\pi / 4$ . Setting θ = π / 4 $\theta = - \pi / 4$ in Equation (12), we have
R T = 1 2 1 1 1 1 . $$\begin{equation} {\mathbf {R}}^{\scriptscriptstyle \mathsf {T}}=\frac{1}{\sqrt {2}}\def\eqcellsep{&}\begin{pmatrix} \phantom{-}1& 1\\ -1& 1 \end{pmatrix}. \end{equation}$$ (13)
If the matrix Q ${\mathbf {Q}}$ is chosen as the first row of R T ${\mathbf {R}}^{\scriptscriptstyle \mathsf {T}}$ , the compressed Hessian H c ${\mathbf {H}}_\mathrm{c}$ acts along the first axis of the rotated coordinate system where δ m 1 = δ m 2 $\delta m_1 = \delta m_2$ . Figure 4a provides an illustration of the cost function near the minimum if two parameters m 1 $m_1$ and m 2 $m_2$ are involved. If the noise level is set to X noise = 1 $\mathcal {X}_{\mathrm{noise}}=1$ , the cross section  X = X noise $\mathcal {X}=\mathcal {X}_{\mathrm{noise}}$ through the paraboloid defines an ellipse. Figure 4a displays the values X X noise $\mathcal {X}\le \mathcal {X}_{\mathrm{noise}}$ . If one parameter is fixed, in this case, the value of m 2 $m_2$ at the minimum of the cost function, the conditional probability distribution measures the width of m 1 $m_1$ inside the ellipse, whereas the marginal distribution describes the outer bounds of m 1 $m_1$ on the ellipse, as shown in Figure 4c. The line segment between the two small open circles in Figure 4c, which falls inside the original confidence ellipse, represents the compressed confidence range. This segment has a length between the shortest and longest axis, illustrating Poincaré's theorem.
Another approach to construct Q ${\mathbf {Q}}$ , used in Section 4 is the following. The computational domain is partitioned into several disjoint subsets of model parameters. The number of model parameters in the j $j$ -th set is denoted as M j $M_j$ , and the set of their indices is denoted as S j $S_j$ . The j $j$ -throw of Q ${\mathbf {Q}}$ is defined as q j , k = 1 / M j $q_{j, k} = 1 / \sqrt {M_j}$ if k S j $ k \in S_j$ and q j , k = 0 $q_{j, k} = 0$ if k S j $k \not\in S_j$ . For example, the selection of two sets of model parameters { m 1 , m 2 , m 3 } $\lbrace m_1, m_2, m_3 \rbrace$ and { m 4 , m 5 } $\lbrace m_4, m_5\rbrace$ out of a larger set results in the restriction matrix
Q = 1 3 1 3 1 3 0 0 0 0 0 0 1 2 1 2 0 . $$\begin{equation} {\mathbf {Q}}= \def\eqcellsep{&}\begin{pmatrix} \tfrac{1}{\sqrt {3}}& \tfrac{1}{\sqrt {3}}& \tfrac{1}{\sqrt {3}}& 0& 0 &0 &\dots \\ 0 &0&0& \tfrac{1}{\sqrt {2}}& \tfrac{1}{\sqrt {2}} &0& \dots \end{pmatrix}. \end{equation}$$ (14)
In this case, the compressed Hessian represents perturbations for which the model parameters within the same set are changed by the same value.
We can also construct a compressed Hessian that describes perturbations with certain spectral properties. In particular, filtering out small-scale perturbations bears some similarity to the homogenisation method (Cao et al. 2024; Capdeville and Métivier 2018; Cupillard and Capdeville 2018; Gibson et al. 2014; Owhadi and Zhang 2008). Another approach is to compose Q ${\mathbf {Q}}$ from rows of discrete cosine or sine Fourier transforms (Fichtner and Trampert 2011b). If n $n$ is a power of 2, the Walsh–Hadamard matrices with entries equal to ± 1 / n $\pm 1/\sqrt {n}$ can be used to define Q ${\mathbf {Q}}$ (Fino and Algazi 1976; Thompson 2017). Hessian compression with Hadamard matrices is conceptually the same as filtering in the wavelet-domain representation. If we denote the ν $\nu$ -thcolumn of the 2 q $2^q$ -dimensional Walsh–Hadamard matrix by v ν ( q ) ${\mathbf {v}}_\nu ^{(q)}$ , the column vectors v ν ( q ) ${\mathbf {v}}_\nu ^{(q)}$ can be constructed as follows (Ben-Artzi et al. 2007). Starting from a one-dimensional vector v 1 ( 1 ) ${{\mathbf {v}}_1^{(1)}}$ whose single component is equal to 1, 2D Walsh–Hadamard vectors are obtained by joining the components of v 1 ( 1 ) ${{\mathbf {v}}_1^{(1)}}$ with plus and minus signs: v 1 ( 2 ) = ( 1 , 1 ) T ${{\mathbf {v}}_1^{(2)}} = (1, 1)^{\scriptscriptstyle \mathsf {T}}$ and v 2 ( 2 ) = ( 1 , 1 ) T ${{\mathbf {v}}_2^{(2)}} = (1, -1)^{\scriptscriptstyle \mathsf {T}}$ . From the 2 j $2^j$ -dimensional vector v μ ( j ) ${\mathbf {v}}_\mu ^{(j)}$ , two 2 j + 1 $2^{j + 1}$ -dimensional vectors are constructed by v 2 μ 1 ( j + 1 ) T = [ ( v μ ( j ) ) T , ( 1 ) μ ( v μ ( j ) ) T ] ${{\mathbf{v}}_{2\mu -1}^{(j+1)}}^{\mathsf{T}}=[{({\mathbf{v}}_{\mu}^{(j)})}^{\mathsf{T}},{(-1)}^{\mu -}{({\mathbf{v}}_{\mu}^{(j)})}^{\mathsf{T}}]$ , and v 2 μ ( j + 1 ) T = [ ( v μ ( j ) ) T , ( 1 ) μ ( v μ ( j ) ) T ] ${{\mathbf {v}}_{2\mu }^{(j+1)}}^{\scriptscriptstyle \mathsf {T}}= [({{\mathbf {v}}_\mu ^{(j)}})^{\scriptscriptstyle \mathsf {T}}, (-1)^{\mu } ({{\mathbf {v}}_\mu ^{(j)}})^{\scriptscriptstyle \mathsf {T}}]$ . We then can introduce the Walsh–Hadamard restriction matrices Q p , ν ( q ) ${\mathbf {Q}}_{p, \nu }^{(q)}$ with dimensions 2 p q × 2 p $2^{p-q} \times 2^p$ as
Q p , ν ( q ) = 1 2 q / 2 v ν ( q ) T 0 0 0 v ν ( q ) T 0 0 0 v ν ( q ) T , $$\begin{equation} {\mathbf{Q}}_{p,\nu}^{(q)} = \frac{1}{2^{q/2}} \left( \def\eqcellsep{&}\begin{array}{llll} {\mathbf{v}}_\nu^{(q)^{\mathsf{T}}} & 0 & \cdots & 0\\ 0 & {\mathbf {v}}_\nu^{(q)^{\mathsf{T}}} & \cdots & 0\\ \cdots & \cdots & \cdots & \cdots \\ 0 & 0 & \cdots & {\mathbf {v}}_\nu^{(q)^{\mathsf{T}}}\\ \end{array} \right), \end{equation}$$ (15)
where v ν ( q ) ${{\mathbf {v}}_\nu ^{(q)}}$ is the Walsh–Hadamard vectors constructed above. Restricting the 2 p × 2 p $2^p \times 2^p$ matrix H ${\mathbf {H}}$ with Q p , ν ( q ) ${\mathbf {Q}}_{p, \nu }^{(q)}$ is the same as (spatial) frequency filtering. The Fourier domain of H ${\mathbf {H}}$ is split into 2 p q $2^{p - q}$ blocks of equal length, and the action of H ${\mathbf {H}}$ is restricted by vectors lying in the ν $\nu$ -th Fourier block.
Details are in the caption following the image
(a) The cost functional X ( m 1 , m 2 ) $\mathcal {X}(m_1,m_2)$ for two parameters has the shape of a paraboloid near the minimum at ( m 0 , 1 , m 0 , 2 ) $(m_{0,1},m_{0,2})$ . Its map view (b) shows the conditional and marginal uncertainty for given m 2 = m 0 , 2 $m_2=m_{0,2}$ for a threshold value ε = 1 $\epsilon =1$ . (c) The eigenvalues of the Hessian describing the paraboloid correspond to the axes of the ellipsoid. The dotted lines indicate the coordinate axes after a 45 $45^\circ$ rotation. The projection on the subspace defined by m 1 m 0 , 1 = m 2 m 0 , 2 $m_1-m_{0,1}=m_2-m_{0,2}$ is the line segment between the two white dots, where the line intersects the ellipse. It has a length in between that of the shortest and longest axis.

4 Examples

4.1 Computation of Perturbation Data

The Gauss–Newton approximation of the Hessian provides the same result as the Hessian for a modelling operator based on its Born approximation. The computation of Born scattering data for each model perturbation with, for instance, a finite-difference code, involves the simultaneous solution of two (systems of) equations. The partial differential equation (2) can be split into L ( m 0 , u 0 ) = f $\mathcal {L}({\mathbf {m}}_0, {\mathbf {u}}_0) = {\mathbf {f}}$ for the background wavefield u 0 = u ( m 0 ) ${\mathbf {u}}_0 = {\mathbf {u}}({\mathbf {m}}_0)$ and L ( m 0 , δ u ) [ L ( m 0 , u 0 ) / m ] δ m $\mathcal {L}({\mathbf {m}}_0, \delta {\mathbf {u}}) \simeq - [{\partial \mathcal {L}({\mathbf {m}}_0, {\mathbf {u}}_0)}/{\partial {\mathbf {m}}}] \delta {\mathbf {m}}$ for the scattered field δ u $\delta {\mathbf {u}}$ , thereby doubling the compute cost. The Born approximation is usually applied to models that are split into a smooth background model that does not provide scattering in the frequency band of interest, and rough components that define the reflectors (Tarantola 1984; Østmo et al. 2002). In our case, the background model is assumed to be the full-waveform inversion (FWI) result and the perturbation data may contain free-surface and interbed multiples.

For the two-dimensional isotropic elastic examples shown later, we use a Taylor series approach, which requires the computation of L ( m 1 , u 1 ) = f $\mathcal {L}({\mathbf {m}}_1, {\mathbf {u}}_1) = {\mathbf {f}}$ with m 1 = m 0 + ε δ m ${\mathbf {m}}_1 = {\mathbf {m}}_0 + \varepsilon \,\delta {\mathbf {m}}$ and provides δ u u 1 u 0 / ε $\delta {\mathbf {u}}\simeq \left({\mathbf {u}}_1- {\mathbf {u}}_0 \right) / \varepsilon$ . In that case, the receiver data for the background wavefield S r u 0 ${\mathbf {S}}_{\mathrm{r}}{\mathbf {u}}_0$ , with sampling operator S r ${\mathbf {S}}_{\mathrm{r}}$ , only have to be computed once for all model perturbations. While this avoids the doubling of the cost, the choice of ε $\varepsilon$ is critical. If too small, the data will be severely affected by numerical noise and round-off errors. If too large, non-linear effects will appear. Nevertheless, we have used this method for the 2D isotropic elastic examples that are shown later. The elements of the Hessian follow from dot products between data for different perturbations (Mulder and Kuvshinov 2025).

According to Equation (13) of Huang (2023), the Born approximation in the scalar constant-density acoustic case produces scattering data of the form G j ( x , x ) G ( 0 ) ( x , x ) = Ω j d x G ( 0 ) ( x , x ) V j ( x ) G ( 0 ) ( x , x ) $G_j({\mathbf {x}},{\mathbf {x}}^{\prime })- G^{(0)}({\mathbf {x}},{\mathbf {x}}^{\prime })=\int _{\Omega _j} {\text{d}}{\mathbf {x}}^{\prime \prime } \, G^{(0)}({\mathbf {x}},{\mathbf {x}}^{\prime \prime }) V_j({\mathbf {x}}^{\prime \prime }) G^{(0)}({\mathbf {x}}^{\prime \prime },{\mathbf {x}}^{\prime })$ . In our setting, the full domain Ω = j = 1 m Ω j $\Omega =\bigcup _{j=1}^m \Omega _j$ is partitioned into m $m$ disjoint subsets Ω j $\Omega _j$ and the perturbation V j ( x ) $V_j({\mathbf {x}})$ has a unit amplitude inside Ω j $\Omega _j$ and is zero elsewhere. The Taylor series approach yields an approximation somewhere in between the Born approximation G ( 0 ) G ( 0 ) $G^{(0)} G^{(0)}$ and the G ( 0 ) G $G^{(0)} G$ of Equation (19) in Huang (2023), known as the Dyson equation in quantum mechanics or as the primary–secondary formulation in controlled-source electromagnetics. The differences between these three should be small for small perturbations, in the order of percent, implying the implicit assumption that the uncertainties are also small, of the same order of magnitude.

In practice, it is more convenient to work with relative perturbations of the form
m i = m 0 , i 1 + ( δ m ) i m 0 , i = m 0 , i ( 1 + δ log m i ) , $$\begin{equation} m_i=m_{0,i}{\left(1+\frac{(\delta m)_i}{m_{0,i}}\right)}=m_{0,i}(1+\delta \log m_i), \end{equation}$$ (16)
for all model parameters enumerated by i $i$ . This will rescale the Hessian by diag ( m 0 ) H diag ( m 0 ) $\mathrm{diag}({{\mathbf {m}}_0}){\mathbf {H}}\,{\mathrm{diag}}({{\mathbf {m}}_0})$ , with diag ( m 0 ) ${\mathrm{diag}}({{\mathbf {m}}_0})$ the diagonal matrix with values m 0 , i $m_{0,i}$ . Here and in what follows, we simplify the notation and will use m i $m_i$ to denote the relative perturbation δ log m i $\delta \log m_i$ . The Hessian is assumed to be scaled accordingly.
Details are in the caption following the image
Action of the Hessian on a scatterer at ( x , z ) = ( 0 , 1000 ) $(x,z)=(0,1000)$ m, for a perturbation m 1 = δ log [ 1 / ( ρ v p 2 ) ] $m_1=\delta \log [1/(\rho v_p^2)]$ , shown in (a) and (b), or a perturbation m 2 = δ log ( 1 / ρ ) $m_2=\delta \log (1/\rho)$ , shown in (c) and (d).
Details are in the caption following the image
(a) Eigenvalues of the Hessian (black), of its projection on 2 × 2 $2\times 2$ (red dashed) and 4 × 4 $4\times 4$ points (blue dash-dotted). (b) A subset with the horizontal scale multiplied by 1, 1.035 and 1.88, respectively.
Details are in the caption following the image
(a) 1D isotropic elastic model. (b) Eigenvalues of the Hessian on the modelling grid (black) and when coarsened per 2 (red) and per 4 (blue). (c) As (b) but with the larger eigenvalues and the horizontal axis scaled by 1, 1.06 and 1.3, respectively.
Details are in the caption following the image
(a) Conditional uncertainty based on H p ${\mathbf {H}}_\mathrm{p}$ for single parameters and (b) for 3 × 3 $3\times 3$ blocks. (c) Marginally uncertainty. The drawn lines correspond to the 1D model grid, the dashed to combinations of 2 points into a single relative model perturbation and the dash-dotted lines to 4 points combined. For the marginals in (c), only the finest grid is shown.
Details are in the caption following the image
Isotropic elastic model with (a) density, (b) P- and (c) S-wave velocity.
Details are in the caption following the image
The index map (a) defines the piecewise constant values per layer for the model parameters in Figure 9. The negative indices correspond to 4 reservoirs. A coarser version (b) is obtained by pairwise combinations of layers, excluding seawater, top layer, and reservoirs.
Details are in the caption following the image
Standard deviations on a logarithmic scale for the marginal distributions for components (a) δ log I p $\delta \log I_p$ , (b) δ log v p $\delta \log v_p$ , and (c) δ log ( v s / v p ) $\delta \log (v_s/v_p)$ . Values may be clipped at the extrema of the colour scale.
Details are in the caption following the image
As Figure 11, but for the conditional distributions of (a) δ log I p $\delta \log I_p$ , (b) δ log v p $\delta \log v_p$ , and (c) δ log ( v s / v p ) $\delta \log (v_s/v_p)$ .
Details are in the caption following the image
As the marginal distributions of Figure 11, but for coarser units.

4.2 Two-Dimensional Homogeneous Acoustic Problem

We start with a Hessian for the 2D constant-density acoustic wave equation, computed analytically in the frequency domain with the exact Green functions. The model has a density of ρ = 2 $\rho =2 $ g/cm 3 $^3$ and a P-wave velocity of v p = 1.5 $v_p=1.5 $ km/s. We choose a 15-Hz Ricker wavelet and only consider frequencies from 4 to 30 Hz at a 0.5-Hz interval. The Hessian is computed on a regular grid in a subdomain defined by x [ 250 , 250 ] $x\in [-250, 250] $ m and z [ 750 , 1250 ] $z\in [750, 1250] $ m with a 5-m spacing. Sources and receivers are located at zero depth with lateral positions x s [ 887.5 , 887.5 ] $x_s\in [-887.5, 887.5] $ m for the shots and x r [ 900 , 900 ] $x_r\in [-900, 900] $ m for the receivers, both with a 25-m spacing.

Figure 5 displays two ‘lines’ of the Hessian, the response of a scatterer at the centre of the domain, at x = 0 $x=0 $ m and z = 1000 $z=1000 $ m, for either a perturbation m 1 = δ log [ 1 / ( ρ v p 2 ) ] $m_1=\delta \log [1/(\rho v_p^2)]$ , in Figure 5a, b, or m 2 = δ log ( 1 / ρ ) $m_2=\delta \log (1/\rho)$ , in Figure 5c, d. The imprint of the Ricker wavelet is visible in the vertical direction, whereas longer wavelengths appear in the horizontal direction. As is clear from these images, the finer scales at the level of the 5-m grid spacing are not resolved, and we expect a large null space.

Figure 6a displays the eigenvalues of a subset of the Hessian as a black line, for 100 × 100 $100\times 100$ points instead of 101 × 101 $101\times 101$ , dropping the results for positions at the highest value of x $x$ and of z $z$ . The reason for taking a subset is that is easier to build its projections by combining 2 × 2 $2\times 2$ or 4 × 4 $4\times 4$ points inside small squares of the grid. The resulting eigenvalues are shown as red and blue lines, respectively. All curves are scaled by the maximum eigenvalue for the original grid. On the latter, less than 3500 of the 20,000 eigenvalues are not zero, taking 10 16 $10^{-16}$ as a rather small threshold. When projected to groups of 2 × 2 $2\times 2$ points, about 3000 of the 5000 are not zero. For groups of 4 × 4 $4\times 4$ , about 1200 of the 1250 eigenvalues are not zero.

This shows that the projection helps to remove the null-space components, in particular, sub-resolution model features that cannot be reconstructed from the data and have an infinite uncertainty. We also observe that the projections do not increase the eigenvalues, in agreement with Poincaré's theorem. Alternatively, by stretching the horizontal axis for the compressed case, we obtain results that are closer to the original black curve, in particular for the 2 × 2 $2\times 2$ compression represented by the red curve in Figure 6b. This behaviour is expected as long as the group of grid points lies in a Fresnel zone and their responses add coherently.

4.3 Two-Dimensional Ocean Bottom Node Data, One-Dimensional Isotropic Elastic Model

Figure 7a displays a deep-water 1D isotropic elastic model, in terms of density ρ $\rho$ , P-wave velocity v p $v_p$ and S-wave velocity v s $v_s$ . For the computation of the Hessian, the water layer, down to a depth of 1400 m, is described by a constant density and water velocity. In reality, they will vary with temperature and salinity and column pressure and abrupt depth changes such as a thermocline layer may even produce reflections in the seismic bandwidth. For the deepest layer, beyond 4700 m, the three elastic parameters are also assumed to be constant with depth. The grid spacing for 2D finite-difference modelling was set at 10 m. The parameters for the water layer and the deepest were assumed to be known, leaving 3 × 330 = 990 $3\times 330=990$ model parameters on the 10-m modelling grid. A total of 161 shots were fired at a depth of 10 m and horizontal offsets from 0 to 8000 m at a 50-m interval, using an 8-Hz Ricker wavelet. For the receiver at the sea bottom, only the P-data were used with a recording time of 10 s and a 4-ms sampling. Reciprocity was applied for modelling. A free-surface boundary condition was imposed.

Figure 7b shows the eigenvalues of the Hessian as a black line. As already mentioned, the water and deepest layer were ignored. The result was not scaled by the number of points in the x $x$ -coordinate. When the 330 points were coarsened by combining adjacent depth pairs, the compressed Hessian has 3 × 330 / 2 = 495 $3\times 330/2=495$ eigenvalues, drawn in red. When 4 points in depth are combined, the last layer has only 2 points combined, and there are 249 eigenvalues, drawn in blue. In the last case, the null-space components have effectively been removed. Because the uniform finite-difference grid is much finer than the resolvable scales at a larger depth, the null space is expected to have a substantial size. Figure 7c shows a subset of the same eigenvalues but with the horizontal axis stretched.

Figure 8 displays three types of uncertainty estimates. The first in Figure 8a is the conditional one, obtained by fixing all parameters and selecting one value on the main diagonal of the projected Hessian H p ${\mathbf {H}}_\mathrm{p}$ and finding σ k $\sigma _k$ from 1 2 σ k h p , k , k σ k = ε X 0 $\tfrac{1}{2}\sigma _k h_{\mathrm{p},k,k}\sigma _k=\epsilon ^\prime \mathcal {X}_0$ with ε = 10 4 $\epsilon ^\prime =10^{-4}$ and X 0 $\mathcal {X}_0$ the data energy in the reference or background model.

We have plotted the results for the projected Hessian H p ${\mathbf {H}}_\mathrm{p}$ rather than the compressed H c ${\mathbf {H}}_\mathrm{c}$ , because the former is defined on the original model space. Appendix D offers a pictorial description of three ways to map uncertainties obtained from the compressed Hessian to the modelling grid, using Q T ${\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ , simple copies, or estimates from H p ${\mathbf {H}}_\mathrm{p}$ . The latter is more suitable when the geological units are small, and null-space components dominate, causing ellipsoids to be elongated in the direction perpendicular to that of the compression, as illustrated in Figure A1c.

With the chosen restriction operator, the projection with H p ${\mathbf {H}}_\mathrm{p}$ replaces the original model perturbations by their average in each segment. This can be easily seen by an example. Consider the compression operator from Equation (14) and a diagonal Hessian H ${\mathbf {H}}$ with diag ( H ) = a , a , a , b , b T $\mathrm{diag}({{\mathbf {H}}})=\left(a,a,a,b,b\right)^{\scriptscriptstyle \mathsf {T}}$ and zeros elsewhere. Then, H c ${\mathbf {H}}_\mathrm{c}$ is diagonal with diag ( H c ) = ( a , b ) T $\mathrm{diag}({{\mathbf {H}}_\mathrm{c}})=(a,b)^{\scriptscriptstyle \mathsf {T}}$ and
H p = a / 3 a / 3 a / 3 0 0 a / 3 a / 3 a / 3 0 0 a / 3 a / 3 a / 3 0 0 0 0 0 b / 2 b / 2 0 0 0 b / 2 b / 2 . $$\begin{equation*} {\mathbf {H}}_\mathrm{p}=\def\eqcellsep{&}\begin{pmatrix} a/3&a/3&a/3&0&0\\ a/3&a/3&a/3&0&0\\ a/3&a/3&a/3&0&0\\ 0&0&0&b/2&b/2\\ 0&0&0&b/2&b/2 \end{pmatrix}. \end{equation*}$$
The Hessians H ${\mathbf {H}}$ , H c ${\mathbf {H}}_\mathrm{c}$ and H p ${\mathbf {H}}_\mathrm{p}$ share the same eigenvalues a $a$ and b $b$ , but for H ${\mathbf {H}}$ they are repeated 3 and 2 times, respectively, whereas H p ${\mathbf {H}}_\mathrm{p}$ has three zeros because of the repeated entries in the matrix.

The second type of uncertainty estimate in Figure 8b is partially conditional, fixing parameters everywhere except at one depth, and plotting the diagonal of the local covariance matrix. This amounts to selecting one 3 × 3 $3\times 3$ block of the Hessian for each point and extracting the diagonal of its inverse. Figure 8c shows the marginal uncertainty, based on the diagonal of the covariance matrix for the full problem.

The uncertainty increases with depth, as expected, and also after projection to the lower-dimensional subspaces. The marginal uncertainty is very large but decreases after projection when null-space and near null-space components, in particular those related to unresolved features, are removed. We also observe a decrease towards the bottom boundary in Figure 8c, which is presently not understood.

A potential disadvantage of using H p ${\mathbf {H}}_\mathrm{p}$ instead of H c ${\mathbf {H}}_\mathrm{c}$ is its size. However, that storage can be saved by working with the compressed Hessian H c ${\mathbf {H}}_\mathrm{c}$ instead of H p ${\mathbf {H}}_\mathrm{p}$ , since the latter contains duplicate entries, as explained in Appendix E.

Details are in the caption following the image
As the conditional distributions of Figure 12, but for coarser units.
Details are in the caption following the image
Marginal covariance matrices for the reservoir with index 1 $-1$ in the fine (a) and coarse (b) case. For the conditional case (c), fine and coarse results are the same, by definition.
Details are in the caption following the image
Uncertainty for a pair of parameters in two adjacent layers, for (a) δ log I p $\delta \log I_p$ , (b) δ log v p $\delta \log v_p$ and (c) δ log ( v s / v p ) $\delta \log (v_s/v_p)$ , scaled by X noise = ε X 0 $\mathcal {X}_{\mathrm{noise}}=\epsilon ^\prime \mathcal {X}_0$ . The white ellipses correspond to the value of σ $\sigma$ for the conditional uncertainty and the magenta ellipses to the marginal one. The projection to single parameters is shown as a line segment.

4.4 Two-Dimensional Marine Example

Figure 9 shows a 2D marine model, used earlier (Mulder and Kuvshinov 2023; 2025). Figure 10a displays the index map, where each index value denotes a geological unit. The four negative values refer to four reservoirs. In the model, we have assumed that elastic properties are constant inside each fine-grid unit, although that is not required for the method, as only the relative perturbations are assumed to be constant. Figure 10b depicts a coarser version, obtained by combining pairs of adjacent layers, excluding the seawater, down to a depth of 800 m and the four reservoirs. Both index maps define projections, a finer and a coarser one, relative to the modelling grid that has a 10-m grid spacing.

For the acquisition, 199 shot positions range from x s = $x_s=-$ 2900 to 7000 m at a 50-m interval and a depth of 10 m. The source wavelet is a 15-Hz Ricker integrated twice in time, that is, a Gaussian with a standard deviation σ w = ( π 2 f peak ) 1 $\sigma _w=(\pi \sqrt {2}f_{\mathrm{peak}})^{-1}$ and f peak = 15 $f_{\mathrm{peak}}=15 $ Hz. Receivers at an 8-m depth have offsets at a 25-m interval from 100 to 6000 m or less when the rightmost boundary of the domain is reached, and 7 s of data were recorded and sampled at 4 ms. A free-surface boundary condition is imposed, suppressing low frequencies in the data.

The finest grid has points in the set V ( 0 ) $V^{(0)}$ . When compressed with a projection operator Q ( 0 ) ( 1 ) ${\mathbf {Q}}_{(0)}^{(1)}$ , the larger scale geological units are elements of the set V ( 1 ) $V^{(1)}$ . A further compression produces a set V ( 2 ) $V^{(2)}$ . Then, Q ( 1 ) ( 2 ) = Q ( 0 ) ( 2 ) Q ( 0 ) ( 1 ) T ${\mathbf {Q}}_{(1)}^{(2)}={\mathbf {Q}}_{(0)}^{(2)} \left({\mathbf {Q}}_{(0)}^{(1)} \right)^{\scriptscriptstyle \mathsf {T}}$ . The resulting operator involves the following steps: undo the 1 / n f , j $1/\sqrt {n_{\mathrm{f},j}}$ scaling for the finer H c ( 1 ) ${\mathbf {H}}_\mathrm{c}^{(1)}$ , where n f , j $n_{\mathrm{f},j}$ is the number of points inside each geological unit j $j$ , add contributions from finer to coarser, apply the 1 / n c , k $1/\sqrt {n_{\mathrm{c},k}}$ scaling after summation, where n c , k $n_{\mathrm{c},k}$ is the number of points inside each larger geological unit after projection, assuming each coarser unit is obtained by combining one or more finer ones. This describes the relation between the Hessian used for Figures 11 and 12 and the one used for Figures 13 and 14. These figures are based on the compressed Hessians H c ${\mathbf {H}}_\mathrm{c}$ and their pseudo-inverses, for the finer and coarser segmentation, and the uncertainty estimates are just copied to the modelling grid for display purposes (option 2 in Appendix D).

Figure 15a shows a subset of the covariance matrix, restricted to the reservoir with index 1 $-1$ and describing the marginal distribution. The result for the coarser projection is shown in Figure 15b and has somewhat smaller values. Figure 15c displays the result for the conditional case, assuming all parameters outside the reservoir are known. Since this part of the model is the same in the coarser projection, the corresponding matrix is also the same.

Figure 16 displays the conditional distribution with all parameters fixed with the exception of the P-impedances for the two units with index 41 and 44, deeper below it, corresponding to the central part of the model in between the two faults and the reservoirs with index 1 $-1$ and 2 $-2$ . The image depicts ( X unc / X noise ) 1 / 2 ${({\mathcal{X}}_{\mathrm{unc}}/{\mathcal{X}}_{\mathrm{noise}})}^{\hspace*{-0.16em}1/2}$ as a function of the two model parameters, ignoring other model parameters. The white ellipse is the boundary of the uncertainty region for a noise energy X noise $\mathcal {X}_{\mathrm{noise}}$ taken as 10 8 $10^{-8}$ of the data energy X 0 $\mathcal {X}_0$ . It follows from the singular value decomposition H = U S U T ${\mathbf {H}}={\mathbf {U}}{\mathbf {S}}{\mathbf {U}}^{\scriptscriptstyle \mathsf {T}}$ , with singular values s = diag ( S ) ${\mathbf {s}}=\mathrm{diag}({{\mathbf {S}}})$ in the diagonal matrix S ${\mathbf {S}}$ , and setting X unc = 1 2 y H T y H T $\mathcal {X}_{\mathrm{unc}}=\tfrac{1}{2}{\mathbf {y}}_{\mathrm{H}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {y}}_{\mathrm{H}}^{\phantom{{\scriptscriptstyle \mathsf {T}}}}$ with y H = S 1 / 2 U T δ m ${\mathbf{y}}_{\mathrm{H}}=\mathbf{S}{}^{1/2}{\mathbf{U}}^{\mathsf{T}}\delta \mathbf{m}$ and δ m = U ( S ) 1 / 2 y H $\delta \mathbf{m}=\mathbf{U}({\mathbf{S}}^{\ensuremath{\dag}}){}^{\hspace*{-0.16em}1/2}{\mathbf{y}}_{\mathrm{H}}$ , similar to Equation (7). The ellipsoid is parameterised by y H ${\mathbf {y}}_{\mathrm{H}}$ on a high-dimensional sphere with a radius ( 2 X noise ) 1 / 2 $(2{\mathcal{X}}_{\mathrm{noise}}){}^{\hspace*{-0.16em}1/2}$ .

Similarly, the covariance matrix C = U S U T ${\mathbf {C}}={\mathbf {U}}{\mathbf {S}}^\dagger {\mathbf {U}}^{\scriptscriptstyle \mathsf {T}}$ , with S ${\mathbf {S}}^\dagger$ the pseudo-inverse of S ${\mathbf {S}}$ , determines ellipses defined by constant values of m T C m = y C T y C T ${\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {C}}{\mathbf {m}}={\mathbf {y}}_{\mathrm{C}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {y}}_{\mathrm{C}}^{\phantom{{\scriptscriptstyle \mathsf {T}}}}$ , with y C = ( S ) 1 / 2 U T m ${\mathbf{y}}_{\mathrm{C}}=({\mathbf{S}}^{\ensuremath{\dag}}){}^{\hspace*{-0.16em}1/2}{\mathbf{U}}^{\mathsf{T}}\mathbf{m}$ on a hypersphere and m = U S 1 / 2 y C $\mathbf{m}=\mathbf{U}\mathbf{S}{}^{1/2}{\mathbf{y}}_{\mathrm{C}}$ . The magenta ellipse in Figure 16 corresponds to a 2 × 2 $2\times 2$ subset of C ${\mathbf {C}}$ and represents the marginal distribution of the two parameters. The latter has a slightly different orientation of the axes.

In the example of Figure 16, units 41 and 44 below it can be taken together, leaving a single value for each of the three elastic model components. The projection operator in this case is a subset of Q ( 1 ) ( 2 ) ${\mathbf {Q}}_{(1)}^{(2)}$ that combines units 41 (upper) and 44 (lower). It can be expressed as the first row of the transposed rotation matrix
R T = 0.7037 0.7105 0.7105 0.7037 , $$\begin{equation*} {\mathbf {R}}^{\scriptscriptstyle \mathsf {T}}=\def\eqcellsep{&}\begin{pmatrix} \phantom{-}0.7037 & 0.7105\\ -0.7105 & 0.7037 \end{pmatrix}, \end{equation*}$$
related to an angle of 45 . 28 $45.28^\circ$ . The white line segments in Figures 16a–c describe the range of uncertainty for the conditional distribution. The line segment and ellipse provide another illustration of Poincaré's separation theorem. The line segment is a 1D subset of the area inside the ellipse, with a length between the short and long axis of the ellipse. In the conditional case, the endpoints coincide with points on the ellipse. The projection for the marginal distribution does not necessarily coincide with the ellipse, as the inverse of the projection is not the same as the projection of the inverse, although the endpoints of the line segment are nevertheless very close to points on the magenta ellipse in Figures 16b and 16c but not in Figure 16a.

5 Discussion and Conclusions

We have considered full-waveform inversion (FWI) uncertainty as the range of model parameters within which the sensitivity of the modelled data to parameter variations does not exceed the noise present in the observed data. This sensitivity is estimated with noise-free forward modelling and is characterised by the Hessian of a cost or loss function. Full characterisation of noise in the observed data requires specifying its covariance matrix. We show that for our purpose the overall noise energy level, which is assumed to be known and which does not influence the sensitivity estimation, is a sufficient proxy. Computation of the full Hessian is usually not feasible in practice. Moreover, due to a key feature of FWI – the use of grid spacings significantly smaller than the wavelengths of the modelled waves – the Hessian is inevitably singular. This leads to formally infinite uncertainty in directions corresponding to parameter perturbations that lie in the null space, that is, combinations that do not affect the modelled data. The null-space may occupy 80% or more of the full model space. To obtain meaningful and finite uncertainty estimates, it is necessary to project the Hessian onto a lower-dimensional subspace. As a consequence, FWI uncertainty estimates are not absolute but relative, depending on the choice of dimensionality reduction or projection approach.

We developed a formalism to find lower-dimensional projections of the Hessian, taking into account that its pseudo-inverse is generally required. A number of examples show that the projection removes a part or all of the null-space components. When a finite-difference method with a fixed grid spacing is used for modelling and inversion, the grid is typically too dense in the deeper parts of a model where the velocities are higher. Suppression of the null-space components related to sub-resolution structures is therefore necessary. If the scale length after projection is relatively small, the eigenvalues of the compressed Hessian are distributed following the same pattern as the eigenvalues of the original Hessian. However, if the length scale after projection becomes too large, the spectrum of the Hessian and the related uncertainty estimates will be distorted. Apparently, this happens at scales larger than the size of the Fresnel zone where the variations of parameters at the grid points that belong to the same geological unit influence the perturbation of the measured signal incoherently.

The examples show that the proposed approach provides reasonable estimates of the conditional uncertainty. When confined to a subset of the domain, partially conditional estimates can be useful to quantify relative uncertainties for multi-parameter inversion. The estimates of the marginal uncertainty, which are affected by the global effect of the model parameters, are less reliable because the inverse of the projected Hessian differs from the projection of the inverse Hessian, required for computing the marginal uncertainty.

We have only considered the simplest projection operator based on averaging and representing the relative model perturbations as piecewise constant per geological unit. The relation to the Walsh–Hadamard transform and its generalisations would facilitate the selection of additional components, other than long wavelength structures. Also, smoother representations replacing the current blocky choice could be useful (Loris et al. 2007; Simons et al. 2011, for instance).

We envisage a workflow that starts, possibly, with an initial velocity model. FWI on successively finer length scales and higher frequencies provides a subsurface model (Bunks et al. 1995). Seismic interpretation leads to a segmentation in geological units. Automatic horizon tracking can provide a space-filling partitioning into subsurface volumes and pattern recognition can assist in combining those into units of similar rock type. Nevertheless, segmentation of highly heterogeneous three-dimensional models obtained by FWI can be a challenge. Once accomplished, the Hessian follows from constant relative perturbations of each model parameter per unit, even if the parameters vary per unit, at the cost of a forward simulation of the seismic dataset. Since inversion requires O ( 100 ) $O(100)$ iterations for O ( 1 ) $O(1)$ model parameters per point or per set of points, this should be feasible for O ( 100 ) $O(100)$ geological units, either of rather large size for the full model, or much smaller in target-oriented applications, or a combination of both.

An alternative, elegant approach combines segmentation and inversion (Bodin et al. 2009; Burdick and Lekić 2017; Guo et al. 2020; Hawkins and Sambridge 2015; Malinverno and Leaney 2005; D. Zhu and Gibson 2018). An application for FWI (Ray et al. 2017) uses a quad tree to represent the 2D subsurface model by the Haar wavelet basis and a reversible-jump Markov-chain Monte Carlo method to sample the posterior model. Its disadvantages are compute cost and a blocky model representation. Other ways to reduce compute cost are operator upscaling (Stuart et al. 2019) and homogenisation (Cao et al. 2024; Capdeville and Métivier 2018; Cupillard and Capdeville 2018; Gibson et al. 2014; Owhadi and Zhang 2008).

The analysis based on the Hessian is limited to small perturbations around the global minimum. Uncertainty quantification for cases where the reference model selected for the analysis is not close to the global minimum requires other tools, several of which are mentioned in the Introduction.

Acknowledgements

This study benefited from discussions with Sijmen Gerritsen, Gautam Kumar, Wei Dai and René-Édouard Plessix. A short preliminary version of this paper was presented at the EAGE 2023 Annual Meeting (Mulder and Kuvshinov 2023) using different parameters. At the time of publication, the authors are no longer affiliated with Shell Global Solutions International B.V.

    APPENDIX A: Calculation of the Hessian

    We derive an expression for the Hessian in Hilbert space. To emphasise that we consider a general case that describes continuum or discretised scalar or vector fields and operators or matrices, we do not use boldface symbols as elsewhere in the paper.

    A functional X ( u , m ) $\mathcal {X}(u, m)$ , where u U $u \in U$ and m M $m \in M$ , is minimised under the constraint s = 0 $s = 0$ , where s S $s \in S$ is defined by the map F ( u , m ) ${\mathcal {F}} (u, m)$ from U × M S $U \times M \rightarrow S$ , and U $U$ , M $M$ and S $S$ are inner product (Hilbert) spaces. We assume that for each set of parameters m $m$ there exists a unique u ( m ) $u(m)$ that satisfies the above constraint. Then, the problem considered reduces to minimisation of the functional X ( m ) = X ( u ( m ) , m ) $\mathcal {X}(m) = \mathcal {X}(u(m), m)$ .

    Following the Lagrangian formalism, or the adjoint state method, we introduce the augmented functional
    X aug = X + v , s S , $$\begin{equation} \mathcal {X}_{\mathrm{aug}} = \mathcal {X} + {\left\langle v, s \right\rangle} _S, \end{equation}$$ (A.1)
    which coincides with X ${\mathcal {X}}$ if u = u ( m ) $u = u(m)$ . Here, v S $v\in S$ is the so-called adjoint field that plays the role of the Lagrangian multiplier. Unconstrained perturbations δ m $\delta m$ and δ u $\delta u$ cause perturbations δ X = u X δ u + m X δ m $\delta \mathcal {X} = \partial _u \mathcal {X}\delta u + \partial _m \mathcal {X}\delta m$ and δ s = u F δ u + m F δ m $\delta s = \partial _u \mathcal {F}\nobreakspace \delta u + \partial _m \mathcal {F} \delta m$ , where u $\partial _u$ and m $\partial _m$ denote Fréchet derivatives with respect to u $u$ and m $m$ , respectively. The Riesz representation theorem states that for each linear functional L ${\mathcal {L}}$ on a Hilbert space X $X$ there exists a unique element x L X $x_L \in X$ such that L x = x L , x X ${\mathcal {L}} x = \left\langle x_L, x \right\rangle _X$ for each x X $x \in X$ . Here, the angular brackets · , · $\left\langle \cdot, \cdot \right\rangle$ denote the inner product and the subscript ‘ X $X$ ’ indicates that this inner product is taken in X $X$ -space. Taking into account that the Fréchet derivative is a linear functional, applying the Riesz theorem and involving the adjoint relation, we obtain
    δ X aug = δ X δ u + ( u F ) * v , δ u U + δ X δ m + ( m F ) * v , δ m M . $$\begin{equation} \begin{split} \delta \mathcal {X}_{\mathrm{aug}} =& {\left\langle \frac{\delta \mathcal {X}}{\delta u} + (\partial _u \mathcal {F})^\ast v, \delta u \right\rangle} _U + \\ & {\left\langle \frac{\delta \mathcal {X}}{\delta m} + (\partial _m \mathcal {F})^\ast v, \nobreakspace \delta m \right\rangle} _M. \end{split} \end{equation}$$ (A.2)
    Here, δ X / δ u ${\delta \mathcal {X}} / {\delta u}$ and δ X / δ m ${\delta \mathcal {X}} / {\delta m}$ are Riesz representations of u X $\partial _u \mathcal {X}$ and m X $\partial _m \mathcal {X}$ and ‘ * $\ast$ ’ denotes the adjoint operator. With the choice
    ( u F ) * v = δ X δ u , $$\begin{equation} (\partial _u \mathcal {F})^\ast v= - \frac{\delta \mathcal {X}}{\delta u}, \end{equation}$$ (A.3)
    Equation (A.2) reduces to δ X aug = δ X / δ m , δ m M $\delta \mathcal {X}_{\mathrm{aug}} = \left\langle {\delta \mathcal {X}} / {\delta m}, \delta m \right\rangle _M$ , where
    δ X δ m = ( m F ) * v + δ X δ m . $$\begin{equation} \frac{\delta \mathcal {X}}{\delta m} = (\partial _m \mathcal {F})^\ast v+ \frac{\delta \mathcal {X}}{\delta m}. \end{equation}$$ (A.4)
    Since δ X aug = δ X $\delta \mathcal {X}_{\mathrm{aug}} = \delta \mathcal {X}$ for perturbations constrained by the condition F = 0 ${\mathcal {F}} = 0$ , the value δ X / δ m ${\delta \mathcal {X}} / {\delta m}$ given by Equation (A.4) is the Riesz representation of m X $\partial _m \mathcal {X}$ . Equations (A.3) and (A.4) constitute the first-order adjoint-state method. Specifying parameters m $m$ and solving the equation  F ( u , m ) = 0 ${\mathcal {F}} (u, m) = 0$ for u $u$ , one finds the derivatives of X $\mathcal {X}$ and F $\mathcal {F}$ on the right-hand sides of Equations (A.3) and (A.4). Equation (A.3) can be solved for v $v$ and then Equation (A.4) to find δ X / δ m ${\delta \mathcal {X}} / {\delta m}$ .
    Further perturbations of m $m$ by δ m $\delta m^\prime$ , which are independent of δ m $\delta m$ , cause second-order changes of X ${\mathcal {X}}$ equal to
    δ 2 X = m m X ( m , m ) = m δ X δ m δ m , δ m M . $$\begin{equation} \delta ^2\mathcal {X}= \partial _{m m} \mathcal {X} (m^\prime, m) = {\left\langle \partial _m {\left(\frac{\delta \mathcal {X}}{\delta m} \right)} \delta m^\prime,\nobreakspace \delta m \right\rangle} _M. \end{equation}$$ (A.5)
    The second Fréchet derivative m m X $\partial _{m m} \mathcal {X}$ is a bilinear operator, which is called the Hessian. The first term in the angular brackets in Equation (A.5) describes the action of the Hessian on δ m $\delta m^\prime$ in the Riesz representation, and it is the same as H δ m ${\mathbf {H}}\,\delta {\mathbf {m}}^\prime$ in the notation used elsewhere in the paper. Differentiation of Equation (A.4) provides
    m δ X δ m δ m = ( m F ) * δ v + u δ X δ m δ u + m δ X δ m δ m + m u F δ u * v + m m F δ m * v . $$\begin{equation} \begin{split} \partial _m {\left(\frac{\delta \mathcal {X}}{\delta m} \right)} \delta m^\prime = &\ (\partial _m \mathcal {F})^\ast \delta v^{\star \prime } + \partial _u {\left(\frac{\delta \mathcal {X}}{\delta m} \right)} \delta u^\prime \\ & + \partial _m {\left(\frac{\delta \mathcal {X}}{\delta m} \right)} \delta m^\prime + {\left(\partial _{m u} \mathcal {F} \nobreakspace \delta u^\prime \right)}^\ast v\\ & + {\left(\partial _{m m} \mathcal {F} \nobreakspace \delta m^\prime \right)}^\ast v. \end{split}\end{equation}$$ (A.6)
    Here, δ v = m v δ m $\delta v^{\star \prime } = \partial _m v^{\star }\nobreakspace \delta m^\prime$ and δ u = m u δ m $\delta u^\prime = \partial _m u\nobreakspace \delta m^\prime$ are changes in v $v^{\star }$ and u $u$ associated with δ m $\delta m^\prime$ , m m X = m ( δ X / δ m ) $\partial _{mm} \mathcal {X}= \partial _m (\delta \mathcal {X}/ \delta m)$ and m u X = m ( δ X / δ u ) $\partial _{mu} \mathcal {X}= \partial _m (\delta \mathcal {X}/ \delta u)$ . Differentiating Equation (A.3), one finds the governing equation for the secondary adjoint field δ v $\delta v^{\star \prime }$ ,
    u F * δ v = u δ X δ u δ u m δ X δ u δ m u m F δ m * v u F δ u * v . $$\begin{equation} \begin{split} {\left(\partial _u \mathcal {F} \right)}^\ast \delta v^{\star \prime } =& - {\partial _u} {\left(\frac{\delta \mathcal {X}}{\delta u} \right)} \delta u^\prime - {\partial _m} {\left(\frac{\delta \mathcal {X}}{\delta u} \right)} \delta m^\prime \\ & - {\left(\partial _{um} \mathcal {F} \nobreakspace \delta m^\prime \right)}^\ast v^{\star } - {\left(\partial _u \mathcal {F} \nobreakspace \delta u^\prime \right)}^\ast v^{\star }. \end{split}\end{equation}$$ (A.7)
    Equations (A.6) and (A.7) constitute the second-order adjoint-state method. Such equations have been previously derived by Fichtner and Trampert (2011a) and Métivier et al. (2013) assuming that X $\mathcal {X}$ does not depend on m $m$ and using the linear approximation for the operator F ${\mathcal {F}}$ . Equation (50) of Fichtner and Trampert (2011a) for the correspondent ‘Hessian kernel’ is recovered by substituting Equation (A.6) into Equation (A.5), involving the adjoint relation and specifying the inner product in the S $S$ -space as the integral over space and time. Petra and Sachs (2021) present a more general derivation, applicable to Banach spaces. In Hilbert spaces, their results reduce to Equations (A.5)–(A.7). The values δ X / δ u $\delta \mathcal {X}/ \delta u$ and v $v$ are small near the extrema of X $\mathcal {X}$ . Neglecting such terms in Equations (A.6) and (A.7) produces the Gauss–Newton approximation for the Hessian action.
    Details are in the caption following the image
    Ellipses defining the uncertainty regions in the 2-parameter case. Conditional uncertainties are indicated by blue line segments, ending at blue dots. The marginal uncertainty in each model parameter determines the bounding boxes, drawn as dashed black rectangles. The red line segments indicate the uncertainty after compression. The endpoints, marked by red dots, are the intersections between the ellipse and a line at 45 $45^\circ$ . To map the result back to the original parameters, three options can be considered: coordinates of the intersection point following the red dashed line segments, intersection points of a red dotted circle centred at the minimum with the horizontal and vertical lines through the minimum, or their intersections with a red dash-dotted line perpendicular to the one at 45 $45^\circ$ through the compression result. The first always ends up inside the bounding box of the original ellipse, and the other two may lie inside or outside. In terms of m 1 $m_1$ , the three estimates may all lie outside (d) or inside (e) the original ellipse, or some inside and some outside, and similarly for m 2 $m_2$ .

    APPENDIX B: Minimisation of Combined Cost Functions

    Consider a cost function consisting of two terms
    X ( m ) = 1 2 m m pr T H pr m m pr + 1 2 F m d T W F m d . $$\begin{equation} \begin{split} {\mathcal {X}}({\mathbf {m}}) =& \tfrac{1}{2}{\left({\mathbf {m}}- {\mathbf {m}}_\mathrm{pr}\right)}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}_\mathrm{pr}{\left({\mathbf {m}}- {\mathbf {m}}_\mathrm{pr}\right)} + \\ & \tfrac{1}{2}{\left({\mathbf {F}}{\mathbf {m}}- {\mathbf {d}}\right)}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}{\left({\mathbf {F}}{\mathbf {m}}- {\mathbf {d}}\right)}. \end{split} \end{equation}$$ (B.1)
    Here, m pr ${\mathbf {m}}_\mathrm{pr}$ is the prior value of the model parameter m ${\mathbf {m}}$ , H pr ${\mathbf {H}}_\mathrm{pr}$ is the prior precision matrix and the second term describes the misfit between the linearly modelled data F m ${\mathbf {F}}{\mathbf {m}}$ and the measured data d ${\mathbf {d}}$ . The matrices H pr ${\mathbf {H}}_\mathrm{pr}$ and W ${\mathbf {W}}$ are symmetric positive semi-definite, and hence they can be represented as H pr = U pr T U pr ${\mathbf {H}}_\mathrm{pr}= {\mathbf {U}}_\mathrm{pr}^{\scriptscriptstyle \mathsf {T}}{\mathbf {U}}_\mathrm{pr}$ and W = U w T U w ${\mathbf {W}}= {\mathbf {U}}_\mathrm{w}^{\scriptscriptstyle \mathsf {T}}{\mathbf {U}}_\mathrm{w}$ . The gradient of the cost function (B.1) has the form
    m X ( m ) = H Σ m m pr + F T W F m pr d , $$\begin{equation} \nabla _{\mathbf {m}}{{\mathcal {X}}} ({\mathbf {m}})= {\bf H}_{\Sigma } {\left({\mathbf {m}}- {\mathbf {m}}_\mathrm{pr}\right)} + {\mathbf {F}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}{\left({\mathbf {F}}\,{\mathbf {m}}_\mathrm{pr}- {\mathbf {d}}\right)}, \end{equation}$$ (B.2)
    where H Σ = H pr + H I ${\bf H}_{\Sigma } = {\mathbf {H}}_\mathrm{pr}+ {\mathbf {H}}_\mathrm{I}$ and H I = F T W F ${\mathbf {H}}_\mathrm{I}= {\mathbf {F}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}{\mathbf {F}}$ . The least-squares solution of the equation m X ( m Σ ) = 0 $\nabla _{\mathbf {m}}{{\mathcal {X}}}({\mathbf {m}}_\Sigma) = 0$ is
    m Σ = m pr K F m pr d , $$\begin{equation} {\mathbf {m}}_\Sigma = {\mathbf {m}}_\mathrm{pr}- {\mathbf {K}}{\left({\mathbf {F}}{\mathbf {m}}_\mathrm{pr}- {\mathbf {d}}\right)}, \end{equation}$$ (B.3)
    where
    K = H Σ F T W = H Σ U I T U w $$\begin{equation} {\mathbf {K}}= {\bf H}_{\Sigma }^\dagger \nobreakspace {\mathbf {F}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}= {\bf H}_{\Sigma }^\dagger \nobreakspace {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}{\mathbf {U}}_\mathrm{w}^{\phantom{\dagger }} \end{equation}$$ (B.4)
    is the Kalman gain and U I = U w F ${\mathbf {U}}_\mathrm{I}= {\mathbf {U}}_\mathrm{w}{\mathbf {F}}$ is the square root of matrix H I ${\mathbf {H}}_\mathrm{I}$ . In fact, m X $\nabla _{\mathbf {m}}{{\mathcal {X}}}$ vanishes exactly at m = m Σ ${\mathbf {m}}= {\mathbf {m}}_\Sigma$ . Since H pr ${\mathbf {H}}_\mathrm{pr}$ and H I ${\mathbf {H}}_\mathrm{I}$ are symmetric positive semi-definite, the null space of H Σ ${\mathbf {H}}_\Sigma$ lies inside the null space of H I ${\mathbf {H}}_\mathrm{I}$ : N H Σ N H I $\mathcal {N}\left({\bf H}_{\Sigma } \right) \subset \mathcal {N}\left({\mathbf {H}}_\mathrm{I}\right)$ . Hence, the range of H I ${\mathbf {H}}_\mathrm{I}$ lies inside the range of H Σ ${\mathbf {H}}_\Sigma$ : R H I R H Σ $\mathcal {R}\left({\mathbf {H}}_\mathrm{I}\right) \subset \mathcal {R}\left({\mathbf {H}}_{\Sigma } \right)$ . The ranges of matrices have the properties R A B R A $\mathcal {R}\left({\mathbf {A}}{\mathbf {B}}\right) \subset \mathcal {R}\left({\mathbf {A}}\right)$ and R A T A = R A T $\mathcal {R}\left({\mathbf {A}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {A}}\right) = \mathcal {R}\left({\mathbf {A}}^{\scriptscriptstyle \mathsf {T}}\right)$ . From these relations, it follows that R F T W R F T U w T = R H I R H Σ $\mathcal {R}\left({\mathbf {F}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}\right) \subset \mathcal {R}\left({\mathbf {F}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {U}}_\mathrm{w}^{\scriptscriptstyle \mathsf {T}}\right) = \mathcal {R}\left({\mathbf {H}}_\mathrm{I}\right) \subset \mathcal {R}\left({\mathbf {H}}_{\Sigma } \right)$ . Since the range of F T W ${\mathbf {F}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}$ lies inside the range of H Σ ${\mathbf {H}}_\Sigma$ , this matrix does not change after applying the projection operator P Σ = H Σ H Σ ${\mathbf {P}}_\Sigma = {\bf H}_{\Sigma } {\bf H}_{\Sigma }^\dagger$ onto R ( H Σ ) $\mathcal {R}({\mathbf {H}}_\Sigma )$ . Substitution of F T W = P Σ F T W ${\mathbf {F}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}= {\mathbf {P}}_\Sigma {\mathbf {F}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {W}}$ in Equation (B.2) shows that m X ( m Σ ) = 0 $\nabla _{\mathbf {m}}{{\mathcal {X}}} ({\mathbf {m}}_\Sigma) = 0$ . We take into account that H I = U I T U I ${\mathbf {H}}_\mathrm{I}= {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}{\mathbf {U}}_\mathrm{I}$ and assume that U I T ${\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}$ lies in the column space of H pr ${\mathbf {H}}_\mathrm{pr}$ , that is, R ( H I ) R ( H pr ) $\mathcal {R}({\mathbf {H}}_\mathrm{I}) \subset \mathcal {R}({\mathbf {H}}_\mathrm{pr})$ and H pr H pr U I T = U I T ${\mathbf {H}}_\mathrm{pr}{\mathbf {H}}_\mathrm{pr}^\dagger {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}= {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}$ . In that case, the formula of the pseudo-inverse of a sum of symmetric matrices (Pringle and Rayner 1971), which is a generalisation of the Sherman–Morrison–Woodbury matrix identity, is applicable, and we have
    H Σ = H pr + U I T U I = H pr H pr U I T I + U I H pr U I T 1 U I H pr $$\begin{equation} {\mathbf {H}}_\Sigma ^\dagger = {\left({\mathbf {H}}_\mathrm{pr}+ {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}{\mathbf {U}}_\mathrm{I} \right)}^\dagger = {\mathbf {H}}_\mathrm{pr}^\dagger - {\mathbf {H}}_\mathrm{pr}^\dagger {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}{\left({\mathbf {I}}+ {\mathbf {U}}_\mathrm{I} {\mathbf {H}}_\mathrm{pr}^\dagger {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}\right)}^{-1} {\mathbf {U}}_\mathrm{I} {\mathbf {H}}_\mathrm{pr}^\dagger \end{equation}$$ (B.5)
    with identity matrix I ${\mathbf {I}}$ . Then H Σ U I T = H pr U I T I + U I H pr U I T 1 ${\mathbf {H}}_\Sigma ^\dagger {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}= {\mathbf {H}}_\mathrm{pr}^\dagger {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}\left({\mathbf {I}}+ {\mathbf {U}}_\mathrm{I} {\mathbf {H}}_\mathrm{pr}^\dagger {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}\right)^{-1}$ and K = H pr U I T I + U I H pr U I T 1 U w ${\mathbf {K}}= {\mathbf {H}}_\mathrm{pr}^\dagger {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}\left({\mathbf {I}}+ {\mathbf {U}}_\mathrm{I} {\mathbf {H}}_\mathrm{pr}^\dagger {\mathbf {U}}_\mathrm{I}^{\scriptscriptstyle \mathsf {T}}\right)^{-1} {\mathbf {U}}_\mathrm{w}$ . Equation (B.5) can be written in terms of the Kalman gain and the covariance matrices as
    C Σ = I K F C pr , $$\begin{equation} {\mathbf {C}}_\Sigma = {\left({\bf I} - {\bf K} {{\mathbf {F}}} \right)} {\mathbf {C}}_\mathrm{pr}, \end{equation}$$ (B.6)
    where C Σ = H Σ ${\mathbf {C}}_\Sigma = {\mathbf {H}}_\Sigma ^\dagger$ and C pr = H pr ${\mathbf {C}}_\mathrm{pr}= {\mathbf {H}}_\mathrm{pr}^\dagger$ . Assuming that W = U w T U w ${\mathbf {W}}= {\mathbf {U}}_\mathrm{w}^{\scriptscriptstyle \mathsf {T}}{\mathbf {U}}_\mathrm{w}$ is non-singular, the Kalman gain can be cast into the standard form
    K = C pr F T W 1 + F C pr F T 1 . $$\begin{equation} {\mathbf {K}}= {\mathbf {C}}_\mathrm{pr}{{\mathbf {F}}}^{\scriptscriptstyle \mathsf {T}}{\left({\mathbf {W}}^{-1} + {{\mathbf {F}}} {\mathbf {C}}_\mathrm{pr}{{\mathbf {F}}}^{\scriptscriptstyle \mathsf {T}}\right)}^{-1}. \end{equation}$$ (B.7)

    APPENDIX C: Partitioning of Symmetric, Positive Semi-definite Matrices

    Any symmetric, positive semi-definite matrix H ${\mathbf {H}}$ allows for the representation H = U T U ${\mathbf {H}}= {{\mathbf {U}}}^{\scriptscriptstyle \mathsf {T}}{{\mathbf {U}}}$ , where U ${{\mathbf {U}}}$ is a positive semi-definite matrix, which is called the square root of H ${\mathbf {H}}$ . Using the partitioning U = U 1 , U 2 , ${{\mathbf {U}}} = \left({{\mathbf {U}}}_1, {{\mathbf {U}}}_2, \right)$ we cast H ${\mathbf {H}}$ in the block form,
    H = H 11 H 12 H 21 H 22 = U 1 T U 1 U 1 T U 2 U 2 T U 1 U 2 T U 2 . $$\begin{equation} {\mathbf {H}}= \def\eqcellsep{&}\begin{pmatrix} {\mathbf {H}}_{11} & {\mathbf {H}}_{12}\\ {\mathbf {H}}_{21} & {\mathbf {H}}_{22}\\ \end{pmatrix} = \def\eqcellsep{&}\begin{pmatrix} {\mathbf {U}}_1^{\scriptscriptstyle \mathsf {T}}{{\mathbf {U}}}_1 & {{\mathbf {U}}}_1^{\scriptscriptstyle \mathsf {T}}{{\mathbf {U}}}_2\\ {{\mathbf {U}}}_2^{\scriptscriptstyle \mathsf {T}}{{\mathbf {U}}}_1 & {{\mathbf {U}}}_2^{\scriptscriptstyle \mathsf {T}}{{\mathbf {U}}}_2\\ \end{pmatrix}. \end{equation}$$ (C.1)
    Taking into account the properties of pseudo-inverse
    A T A = A A T , A T A A = A T , A A A = A , $$\begin{equation} {\left({\bf A}^{\scriptscriptstyle \mathsf {T}}{\bf A} \right)}^\dagger = {\bf A}^\dagger {\left({\bf A}^{{\scriptscriptstyle \mathsf {T}}} \right)}^{\dagger },\nobreakspace \nobreakspace \nobreakspace {\bf A}^{{\scriptscriptstyle \mathsf {T}}} {\bf A} {\bf A}^{\dagger } = {\bf A}^{{\scriptscriptstyle \mathsf {T}}},\nobreakspace \nobreakspace \nobreakspace {\bf A} {\bf A}^{\dagger } {\bf A} = {\bf A}, \end{equation}$$ (C.2)
    one finds H 11 H 11 = U 1 T U 1 ( B 1 T U 1 ) = U 1 T U 1 U 1 U 1 T = U 1 T U 1 T ${\mathbf {H}}_{11} {\mathbf {H}}_{11}^\dagger = {{\mathbf {U}}}_1^{\scriptscriptstyle \mathsf {T}}{{\mathbf {U}}}_1 ({\bf B}_1^{\scriptscriptstyle \mathsf {T}}{{\mathbf {U}}}_1)^\dagger = \left({{\mathbf {U}}}_1^{\scriptscriptstyle \mathsf {T}}{{\mathbf {U}}}_1 {{\mathbf {U}}}_1^{\dagger }\right) \left({{\mathbf {U}}}_1^{{\scriptscriptstyle \mathsf {T}}} \right)^{ \dagger } = {{\mathbf {U}}}_1^{\scriptscriptstyle \mathsf {T}}\left({{\mathbf {U}}}_1^{{\scriptscriptstyle \mathsf {T}}} \right)^{ \dagger }$ and (see Theorem 9.1.6 of Albert 1972)
    H 11 H 11 H 12 = U 1 T U 1 T U 1 T U 2 = U 1 T U 2 = H 12 . $$\begin{equation} {\mathbf {H}}_{11} {\mathbf {H}}_{11}^\dagger {\mathbf {H}}_{12} = {\left[ {{\mathbf {U}}}_1^{\scriptscriptstyle \mathsf {T}}{\left({{\mathbf {U}}}_1^{{\scriptscriptstyle \mathsf {T}}} \right)}^{ \dagger } {{\mathbf {U}}}_1^{\scriptscriptstyle \mathsf {T}}\right]}{{\mathbf {U}}}_2 = {{\mathbf {U}}}_1^{\scriptscriptstyle \mathsf {T}}{{\mathbf {U}}}_2 = {\mathbf {H}}_{12}. \end{equation}$$ (C.3)
    Equation (C.3) implies R ( H 12 ) R ( H 11 ) $\mathcal {R}({\mathbf {H}}_{12}) \subset \mathcal {R}({\mathbf {H}}_{11})$ . Using Equation (C.3) one checks that the matrix H ${\mathbf {H}}$ is block-diagonalised by the transformation
    H ¯ = S T H S = H 11 0 0 H 22 H 21 H 11 H 12 , $$\begin{equation} {{\overline{\bf H}}}= {\bf S}^{\scriptscriptstyle \mathsf {T}}{\bf H} {\bf S} = \def\eqcellsep{&}\begin{pmatrix} \bf H_{11} & 0\\ 0 & {\bf H}_{22} - {\bf H}_{21} {\bf H}_{11}^\dagger {\bf H}_{12} \end{pmatrix}, \end{equation}$$ (C.4)
    where S ${\bf S}$ is a non-singular matrix,
    S = I H 11 H 12 0 I , S 1 = I H 11 H 12 0 I . $$\begin{equation} {\bf S} = \def\eqcellsep{&}\begin{pmatrix} \bf I & -{\bf H}_{11}^\dagger {\bf H}_{12}\\ 0 & {\bf I} \end{pmatrix},\qquad {\bf S}^{-1} = \def\eqcellsep{&}\begin{pmatrix} \bf I & {\bf H}_{11}^\dagger {\bf H}_{12}\\ 0 & {\bf I} \end{pmatrix}. \end{equation}$$ (C.5)
    From Equation (C.4), it follows that m T H m = m ¯ T H ¯ m ¯ ${\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}{\mathbf {m}}= {{\overline{{\mathbf {m}}}}}^{\scriptscriptstyle \mathsf {T}}\, {{\overline{\bf H}}}\,{{\overline{{\mathbf {m}}}}}$ , where m ¯ = S 1 m ${{\overline{{\mathbf {m}}}}}= {\bf S}^{-1} {\mathbf {m}}$ . Partitioning the vector m ${\mathbf {m}}$ as m = ( m 1 , m 2 ) T ${\mathbf {m}}= ({\mathbf {m}}_1, {\mathbf {m}}_2)^{\scriptscriptstyle \mathsf {T}}$ we get
    m T H m = m ¯ 1 T H 11 m ¯ 1 + m 2 T H ¯ 22 m 2 . $$\begin{equation} {\mathbf {m}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}{\mathbf {m}}= {{\overline{{\mathbf {m}}}}}_1^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}_{11} {{\overline{{\mathbf {m}}}}}_1 + {\mathbf {m}}_2^{\scriptscriptstyle \mathsf {T}}{{\overline{\bf H}}}_{22} {\mathbf {m}}_2. \end{equation}$$ (C.6)
    Here, m ¯ 1 = m 1 + H 11 H 12 m 2 ${{\overline{{\mathbf {m}}}}}_1 = {\mathbf {m}}_1 + {\mathbf {H}}_{11}^{\dagger } {\mathbf {H}}_{12} {\mathbf {m}}_2$ and H ¯ 22 = H 22 H 21 H 11 H 12 ${{\overline{\bf H}}}_{22} = {\bf H}_{22} - {\bf H}_{21} {\bf H}_{11}^\dagger {\bf H}_{12}$ is the pseudo-Schur complement of H ${\mathbf {H}}$ with respect to H 11 ${\bf H}_{11}$ .

    APPENDIX D: Back to the Modelling Grid

    In general, there is no obvious relation between the conditional and marginal uncertainties obtained from the compressed Hessian and those from the Hessian for the modelling grid. Nevertheless, we may consider three options to map the compressed results back to the original modelling grid: based on Q T ${\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ , just copy, or using the projected H p ${\mathbf {H}}_\mathrm{p}$ . A pictorial description for the two-parameter case is provided to illustrate the difference between the resulting uncertainties.

    Figure A1 shows the ellipses described by X = 1 2 ( m m 0 ) T H ( m m 0 ) = ε 0 $\mathcal {X}=\tfrac{1}{2}({\mathbf {m}}-{\mathbf {m}}_0)^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}({\mathbf {m}}-{\mathbf {m}}_0)=\varepsilon _0$ for two parameters, m = m 1 , m 2 T ${\mathbf {m}}=\left(m_1,m_2\right)^{\scriptscriptstyle \mathsf {T}}$ , with ε 0 = 1 $\varepsilon _0=1$ , and with Hessians given by, respectively,
    H = 0.63 0.36 0.36 0.90 , H = 1.05 0.60 0.60 1.50 , H = 8.0 5.5 5.5 4.3 , H = 8.0 3.0 3.0 1.5 , H = 0.4 0.2 0.2 4.0 , H = 0.8 0 0 0.8 . $$\begin{equation} \begin{split} {\mathbf {H}}\!&=\!\def\eqcellsep{&}\begin{pmatrix} 0.63&0.36\\ 0.36&0.90 \end{pmatrix}\!,\ {\mathbf {H}}\!=\!\def\eqcellsep{&}\begin{pmatrix} 1.05&-0.60\\ -0.60&1.50 \end{pmatrix}\!,\ {\mathbf {H}}\!=\!\def\eqcellsep{&}\begin{pmatrix} 8.0&5.5\\ 5.5&4.3 \end{pmatrix}\!,\\ {\mathbf {H}}\!&=\!\def\eqcellsep{&}\begin{pmatrix} 8.0&-3.0\\ -3.0&1.5 \end{pmatrix},\ {\mathbf {H}}\!=\!\def\eqcellsep{&}\begin{pmatrix} 0.4& 0.2\\ 0.2&4.0 \end{pmatrix},\ {\mathbf {H}}\!=\!\def\eqcellsep{&}\begin{pmatrix} 0.8&0\\ 0&0.8 \end{pmatrix}. \end{split}\end{equation}$$ (D.1)
    The conditional uncertainties defined by X ε 0 $\mathcal {X}\le \varepsilon _0$ are | m j m 0 , j | [ 2 ε 0 / diag ( H ) ] 1 / 2 $|{m}_{j}-{m}_{0,j}|\le {[2{\varepsilon}_{0}/\mathrm{diag}(\mathbf{H})]}^{1/2}$ , j = 1 , 2 $j=1,2$ , and are drawn as blue line segments, ending at the blue dots. The marginal uncertainties are given by | m j m 0 , j | [ 2 ε 0 diag ( H ) ] 1 / 2 $|{m}_{j}-{m}_{0,j}|\le {[2{\varepsilon}_{0}\mathrm{diag}({\mathbf{H}}^{\ensuremath{\dag}})]}^{1/2}$ . The corresponding line segments are the horizontal and vertical lines through the minimum at the centre of the ellipse, bounded by the dashed black bounding box of the ellipse. The intersection points of the bounding box with the ellipse are marked by black dots.

    The compression matrix for 2 points is Q = 1 , 1 / 2 ${\mathbf {Q}}=\left(1,1\right)/\sqrt {2}$ . Its application to the model parameters, relative to the minimum, yields a line at 45 $45^\circ$ . The compressed Hessian is H c = Q H Q T ${\mathbf {H}}_\mathrm{c}={\mathbf {Q}}{\mathbf {H}}{\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ . Since it is a 1 × 1 $1\times 1$ matrix in this example, the related conditional and marginal uncertainty are the same: | m c m c , 0 | [ 2 ε 0 / diag ( H c ) ] 1 / 2 = [ 2 ε 0 diag ( H c ) ] 1 / 2 $|{m}_{c}-{m}_{c,0}|\le {[2{\varepsilon}_{0}/\mathrm{diag}({\mathbf{H}}_{\mathrm{c}})]}^{1/2}={[2{\varepsilon}_{0}\mathrm{diag}({\mathbf{H}}_{\mathrm{c}}^{\ensuremath{\dag}})]}^{1/2}$ . In the figures, they are indicated by the drawn red line segments. The endpoints always lie inside the bounding box defined by the marginal uncertainties.

    How can these results be mapped back to the original coordinates m 1 $m_1$ and m 2 $m_2$ ? The red endpoints of the line segments can originate from any ellipse passing through them with its origin at the midpoint of the segment and, therefore, this question cannot be answered. However, three options can be considered and are sketched as examples in Figure A1.

    (1) Take the coordinates of the compressed result, marked by the red dots: m m 0 = Q T ( m c m c , 0 ) ${\mathbf {m}}-{\mathbf {m}}_0={\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}({\mathbf {m}}_\mathrm{c}-{\mathbf {m}}_{\mathrm{c},0})$ . This is sketched by the dashed red lines ending at the red open circles. If the original ellipse happens to be identical to the line segment, with a short axis at 45 $-45^\circ$ of zero length, these coordinates determine the marginal uncertainty and the conditional uncertainty vanishes.

    (2) Assume the ellipse to be a circle. This is sketched by the dotted circle, intersecting the horizontal and vertical lines through the minimum at the small red open circles. Not all intersections are drawn.

    (3) The conditional uncertainty can be based on H p = Q T H c Q = P H P ${\mathbf {H}}_\mathrm{p}={\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}_\mathrm{c}{\mathbf {Q}}={\mathbf {P}}{\mathbf {H}}{\mathbf {P}}$ with P = Q T Q ${\mathbf {P}}={\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {Q}}$ . This is represented by the dash-dotted red lines ending at the red open circles.

    The three maps result in endpoints of the conditional uncertainty ranges that are progressively larger. Those for the first option lie always inside the bounding box of the original ellipse. For the other two, they can be inside or outside. Figure A1 depicts several cases, the last one a circle. The points may all end up outside (d) or inside the ellipse (e). With only two parameters, the three estimates differ by a factor of 2 $\sqrt {2}$ . With n f $n_\mathrm{f}$ parameters, the factor is n f $\sqrt {n_\mathrm{f}}$ and becomes quite large for n f 1 $n_\mathrm{f}\gg 1$ .

    APPENDIX E: Low-Storage Use of the Projected Hessian

    At first sight, the projected Hessian H p ${\mathbf {H}}_\mathrm{p}$ requires as much storage as the Hessian on the modelling grid. Here, we explain how this can be avoided by just using the compressed Hessian H c ${\mathbf {H}}_\mathrm{c}$ . The description is for the single-parameter case, but the generalisation to the multi-parameter is straightforward.

    The set of n $n$ grid points is partitioned into m $m$ segments or geological units. The indicator matrix X $\mathbf {X}$ has x j , i = 1 $\mathrm{x}_{j,i}=1$ if point i $i$ lies in segment j $j$ , for i = 1 , , n $i=1,\ldots,n$ and j = 1 , , m $j=1,\ldots,m$ . Otherwise, x j , i = 0 $\mathrm{x}_{j,i}=0$ . The vector n f $\mathbf {n}_\mathrm{f}$ contains the number of grid points n f , j = i = 1 n x j , i $n_{\mathrm{f},j}=\sum _{i=1}^n \mathrm{x}_{j,i}$ contained in each segment and j = 1 m n f , j = m $\sum _{j=1}^m n_{\mathrm{f},j}=m$ . The simplest prolongation operator is a piecewise constant interpolation, which implies just copying and is represented by X T $\mathbf {X}^{\scriptscriptstyle \mathsf {T}}$ . The related restriction operator is R = N f 1 X $\mathbf {R}=\mathbf {N}_\mathrm{f}^{-1} \mathbf {X}$ , where N f $\mathbf {N}_\mathrm{f}$ is a diagonal matrix with diag ( N f ) = n f $\mathrm{diag}({\mathbf {N}_\mathrm{f}})=\mathbf {n}_\mathrm{f}$ and zero otherwise, representing the arithmetic mean. Its orthonormal version is denoted by Q = N f 1 / 2 X ${\mathbf {Q}}=\mathbf {N}_\mathrm{f}^{-1/2} \mathbf {X}$ . For an n × n $n\times n$ non-negative symmetric matrix H ${\mathbf {H}}$ , the compressed version is defined as H c = Q H Q T ${\mathbf {H}}_\mathrm{c}={\mathbf {Q}}{\mathbf {H}}{\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}$ and the projected version as H p = Q T H c Q = P H c P ${\mathbf {H}}_\mathrm{p}={\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}_\mathrm{c}{\mathbf {Q}}={\mathbf {P}}{\mathbf {H}}_\mathrm{c}{\mathbf {P}}$ with projection matrix P = Q T Q ${\mathbf {P}}= {\mathbf {Q}}^{\scriptscriptstyle \mathsf {T}}{\mathbf {Q}}$ . With the current restriction and prolongation operators, the operator P ${\mathbf {P}}$ applied to model parameters m ${\mathbf {m}}$ in the single-parameter case replaces the n f , j $n_{\mathrm{f},j}$ values in unit j $j$ by their average n f , j 1 i ( j ) n f , j m i $n_{\mathrm{f},j}^{-1}\sum _{i(j)}^{n_{\mathrm{f},j}} m_{i}$ , where i ( j ) $i(j)$ enumerates the n f , j $n_{\mathrm{f},j}$ grid points contained in segment j $j$ . In this way, the projection P ${\mathbf {P}}$ acts as a spatial high-cut filter, removing shorter wavelengths. If we define H r = R H R T = N f 1 X H X T N f 1 = N f 1 / 2 H c N f 1 / 2 ${\mathbf {H}}_\mathrm{r} ={\mathbf {R}}\,{\mathbf {H}}\,{\mathbf {R}}^{\scriptscriptstyle \mathsf {T}}= \mathbf {N}_\mathrm{f}^{-1}\, \mathbf {X}\, {\mathbf {H}}\, \mathbf {X}^{\scriptscriptstyle \mathsf {T}}\mathbf {N}_\mathrm{f}^{-1}=\mathbf {N}_\mathrm{f}^{-1/2} {\mathbf {H}}_\mathrm{c}\, \mathbf {N}_\mathrm{f}^{-1/2}$ with H c = N f 1 / 2 X H X T N f 1 / 2 $ {\mathbf {H}}_\mathrm{c}=\mathbf {N}_\mathrm{f}^{-1/2} \mathbf {X}\,{\mathbf {H}}\,\mathbf {X}^{\scriptscriptstyle \mathsf {T}}\mathbf {N}_\mathrm{f}^{-1/2}$ , then H p = X T H r X ${\mathbf {H}}_\mathrm{p}=\mathbf {X}^{\scriptscriptstyle \mathsf {T}}{\mathbf {H}}_\mathrm{r} \,\mathbf {X}$ , showing that H p ${\mathbf {H}}_\mathrm{p}$ consists of blocks each with n f , j $n_{\mathrm{f},j}$ copies of H r ${\mathbf {H}}_\mathrm{r}$ . To find the conditional uncertainty for a single parameter, we fix all parameters, select one value on the main diagonal of the Hessian and solve σ i $\sigma _i$ from 1 2 σ i h i , i σ i = ε X 0 $\tfrac{1}{2}\sigma _i h_{i,i}\sigma _i=\epsilon ^\prime \mathcal {X}_0$ , where X 0 $\mathcal {X}_0$ is the data energy in the reference or background model. Let the diagonal element h i , i $h_{i,i}$ of the original Hessian H ${\mathbf {H}}$ be replaced by that of the projected version H p ${\mathbf {H}}_\mathrm{p}$ . The diagonal of H p ${\mathbf {H}}_\mathrm{p}$ has groups of n f , j $n_{\mathrm{f},j}$ identical values, equal to ( H r ) j , ${({\mathbf{H}}_{\mathrm{r}})}_{j,}$ . We can, therefore, determine the above conditional σ $\sigma$ -values for H r ${\mathbf {H}}_\mathrm{r}$ instead of H p ${\mathbf {H}}_\mathrm{p}$ and interpolate them to the original grid with X T $\mathbf {X}^{\scriptscriptstyle \mathsf {T}}$ , that is, σ i = [ 2 ε X 0 / ( X T diag ( H r ) ) i ] 1 / 2 ${\sigma}_{i}={[2{\varepsilon}^{\prime}{\mathcal{X}}_{0}/{({\mathbf{X}}^{\mathsf{T}}\mathrm{diag}({\mathbf{H}}_{\mathrm{r}}))}_{i}]}^{1/2}$ . As diag ( H r ) $\mathrm{diag}({{\mathbf {H}}_\mathrm{r} })$ is a vector, storage requirements for this operation are low.

    Data Availability Statement

    Data sharing is not applicable to this article as no new data were created or analysed in this study. However, if the computational results shown in the figures are considered as new data, then the authors elect not to share those.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.