Volume 93, Issue 2 pp. 539-568
Original Articles
Open Access

Double Robust Bayesian Inference on Average Treatment Effects

Christoph Breunig

Corresponding Author

Christoph Breunig

Department of Economics, University of Bonn

Search for more papers by this author
Ruixuan Liu

Ruixuan Liu

CUHK Business School, Chinese University of Hong Kong

Search for more papers by this author
Zhengfei Yu

Zhengfei Yu

Faculty of Humanities and Social Sciences, University of Tsukuba

Search for more papers by this author
First published: 29 March 2025
Citations: 2
We thank the anonymous reviewers, as well as Xiaohong Chen, Yanqin Fan, Essie Maasoumi, Yichong Zhang, and numerous seminar and conference participants for helpful comments and illuminating discussions. Breunig gratefully acknowledges the support of the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC-2047/1 – 390685813. Yu gratefully acknowledges the support of JSPS KAKENHI Grant Number 21K01419. Funding Statement: Open Access funding enabled and organized by Projekt DEAL. WOA Institution: Rheinische Friedrich-Wilhelms-Universität Bonn Consortia Name: Projekt DEAL.

Abstract

We propose a double robust Bayesian inference procedure on the average treatment effect (ATE) under unconfoundedness. For our new Bayesian approach, we first adjust the prior distributions of the conditional mean functions, and then correct the posterior distribution of the resulting ATE. Both adjustments make use of pilot estimators motivated by the semiparametric influence function for ATE estimation. We prove asymptotic equivalence of our Bayesian procedure and efficient frequentist ATE estimators by establishing a new semiparametric Bernstein–von Mises theorem under double robustness; that is, the lack of smoothness of conditional mean functions can be compensated by high regularity of the propensity score and vice versa. Consequently, the resulting Bayesian credible sets form confidence intervals with asymptotically exact coverage probability. In simulations, our method provides precise point estimates of the ATE through the posterior mean and delivers credible intervals that closely align with the nominal coverage probability. Furthermore, our approach achieves a shorter interval length in comparison to existing methods. We illustrate our method in an application to the National Supported Work Demonstration following LaLonde (1986) and Dehejia and Wahba (1999).

1 Introduction

This paper proposes a double robust Bayesian approach for estimating the average treatment effect (ATE) under unconfoundedness, given a set of pretreatment covariates. Our new Bayesian procedure involves both prior and posterior adjustments. First, following Ray and van der Vaart ( 2020 ), we adjust the prior distributions of the conditional mean function using an estimator of the propensity score. Second, we use this propensity score estimator together with a pilot estimator of the conditional mean to correct the posterior distribution of the ATE. The adjustments in both steps are closely related to the functional form of the semiparametric influence function for ATE estimation under unconfoundedness. They do not only shift the center but also change the shape of the posterior distribution. For our robust Bayesian procedure, we derive a new Bernstein–von Mises (BvM) theorem, which means that this posterior distribution, when centered at any efficient estimator, is asymptotically normal with the efficient variance in the semiparametric sense. The key innovation of our paper is that this result holds under double robust smoothness assumptions within the Bayesian framework.

Despite the recent success of Bayesian methods, the literature on ATE estimation is predominantly frequentist-based. For the missing data problem specifically, it was shown that conventional Bayesian approaches (i.e., using uncorrected priors) can produce inconsistent estimates, unless some unnecessarily strong smoothness conditions on the underlying functions were imposed; see the results and discussion in Robins and Ritov ( 1997 ) or Ritov, Bickel, Gamst, and Kleijn ( 2014 ). Once the prior distribution was adjusted using some preestimated propensity score, Ray and van der Vaart ( 2020 ) recently established a novel semiparametric BvM theorem under weaker smoothness requirement for the propensity score function. However, a minimum differentiability of order is still required for the conditional mean function in the outcome equation, where p denotes the dimensionality of covariates. In this paper, we are interested in Bayesian inference under double robustness that allows for a trade-off between the required levels of smoothness in the propensity score and the conditional mean functions.

Under double robust smoothness conditions, we show that Bayesian methods, which use propensity score adjusted priors as in Ray and van der Vaart ( 2020 ), satisfy the BvM theorem only up to a “bias term” depending on the unknown true conditional mean and propensity score functions. In this paper, our robust Bayesian approach accounts for this bias term in the BvM theorem by considering an explicit posterior correction. Both the prior adjustment and the posterior correction are based on functional forms that are closely related to the efficient influence function for the ATE in Hahn ( 1998 ). We show that the corrected posterior satisfies the BvM theorem under double robust smoothness assumptions. Our novel procedure combines the advantages of Bayesian methodology with the robustness features that are the strengths of frequentist procedures. Our credible intervals are Bayesianly justifiable in the sense of Rubin ( 1984 ), as the uncertainty quantification is conducted conditionally on the observed data and can also be interpreted as frequentist confidence intervals with asymptotically exact coverage probability. Our procedure is inspired by insights from the double machine learning (DML) literature, as well as the bias-corrected matching approach of Abadie and Imbens ( 2011 ), since our robustification of an initial procedure removes some nonnegligible bias and remains asymptotically valid under weaker regularity conditions. While the main part of our theoretical analysis focuses on the ATE of binary outcomes, also considered by Ray and van der Vaart ( 2020 ), we outline extensions of our methodology to continuous and multinomial cases, as well as to other causal parameters.

In both simulations and an empirical illustration using the National Supported Work Demonstration data, we provide evidence that our procedure performs well compared to existing Bayesian and frequentist approaches. In our Monte Carlo simulations, we find that our method results in improved empirical coverage probabilities, while maintaining very competitive lengths for confidence intervals. This finite sample advantage is also observed over Bayesian methods that rely solely on prior corrections. In particular, we note that our approach leads to more accurate uncertainty quantification and is less sensitive to estimated propensity scores being close to boundary values.

The BvM theorem for parametric Bayesian models is well established; see, for instance, van der Vaart ( 1998 ). Its semiparametric version is still being studied very actively when nonparametric priors are used ( Castillo ( 2012 ), Castillo and Rousseau ( 2015 ), Ray and van der Vaart ( 2020 )). To the best of our knowledge, our new semiparametric BvM theorem is the first one that possesses the double robustness property. Our paper is also connected to another active research area concerning Bayesian inference for parameters in econometric models, which is robust to partial or weak identification ( Chen, Christensen, and Tamer ( 2018 ), Giacomini and Kitagawa ( 2021 ), Andrews and Mikusheva ( 2022 )). The framework and the approach we take is different. Nonetheless, they share the same scope of tailoring the Bayesian inference procedure to new challenges in contemporary econometrics.

2 Setup and Implementation

This section provides the main setup of the average treatment effect (ATE). We motivate the new Bayesian methodology and detail the practical implementation.

2.1 Setup

We consider a family of probability distributions for some parameter space , where the (possibly infinite dimensional) parameter η characterizes the probability model. Let be the true value of the parameter and denote , which corresponds to the frequentist distribution generating the observed data.

For individual i, consider a treatment indicator . The observed outcome is determined by where are the potential outcomes of individual i associated with or 0. We now focus on the binary outcome case where both and take values in . An extension to multinomial or continuous outcomes is provided in Section 6 . The covariates for individual i are denoted by , a vector of dimension p, with the distribution and the density . Let denote the propensity score and the conditional mean. Suppose that the researcher observes independent and identically distributed (i.i.d.) observations of for . The joint density of is given by where
(2.1)
The parameter of interest is the ATE given by , where denotes the expectation under . For its identification, we impose the following standard assumption of unconfoundedness and overlap ( Rosenbaum and Rubin ( 1984 ), Imbens ( 2004 ), Imbens and Rubin ( 2015 )).

Assumption 1.(i) and (ii) there exists such that for all x in the support of .

We introduce additional notation from the Bayesian perspective, following the similar setup from Ray and van der Vaart ( 2020 ). For the purpose of assigning prior distributions to in the Bayesian procedure, it is convenient to transform them by a link function. We make use of the Logistic function here. Specifically, we consider the reparametrization of given by . We index the probability model as , in line with the notation introduced at the first paragraph of this section, where
(2.2)
Below, we write , , and to make the dependence on η explicit. Given any prior on the triplet , Bayesian inference on the ATE is achieved by deriving the posterior distribution of
(2.3)
where denotes the expectation under . Our aim is to examine the large-sample behavior of the posterior of under the true probability distribution . In the same vein, the true parameter of interest becomes .
The construction of our double robust Bayesian procedure in Section 2.2 has fundamental connection to the efficient influence function. For any paramter η, the efficient influence function ( Hahn ( 1998 ), Hirano, Imbens, and Ridder ( 2003 )) is
(2.4)
for the Riesz representer , which is given by
(2.5)
We write and . Both the prior adjustment and posterior correction of our approach require a pilot estimator for . Under Assumption 1 , the true Riesz representer is well-defined.

2.2 Double Robust Bayesian Point Estimators and Credible Sets

We build upon the ATE expression in ( 2.3 ) to develop our doubly robust inference procedure. Our approach is based on nonparametric prior processes for and . For the latter, we consider the Dirichlet process, which is a default prior on spaces of probability measures. This choice is also convenient for posterior computation via the Bayesian bootstrap; see Remark 2.1 . For the former, we make use of Gaussian process priors, along with an adjustment that involves a preliminary estimator of . Gaussian process priors are also closely related to spline smoothing, as discussed in Wahba ( 1990 ). Their posterior contraction properties (see Ghosal and van der Vaart ( 2017 )), together with excellent finite sample behavior (see Rassmusen and Williams ( 2006 )), make Gaussian process priors popular in the related literature. Since does not depend on , the specification of a prior on the propensity score is not required.

We consider pilot estimators of the propensity score and of the conditional mean function , which both are based on an auxiliary sample. We consider a plug-in estimator for the Riesz representer given by
(2.6)
Below, let denote the sample average of the absolute value of , which we use for scale normalization in our prior adjustment (see Section 4.2 for details). The use of an auxiliary data for pilot estimators simplifies the technical analysis related to the propensity score adjusted priors; see Ray and van der Vaart ( 2020 ). Also, it provides an effective way to control some negligible higher-order terms. See our Lemma C.2 in the Supplemental Material ( Breunig, Liu, and Yu ( 2025 )) and the related discussion about the sample splitting in the DML type methods on page C6 of Chernozhukov et al. ( 2018 ). In practice, we use the full data twice and do not split the sample, as we have not observed any overfitting or loss of coverage thereby. Algorithm 1 describes our double robust Bayesian inference procedure.
Details are in the caption following the image

Double Robust Bayesian Procedure.

Given the draws from the corrected posterior calculated in Algorithm 1 , we obtain the point estimate and credible set as follows. The Bayesian point estimator is . The credible set for the ATE parameter is given by
where denotes the ath quantile of .

For the implementation of our pilot estimator given in ( 2.6 ), we recommend using propensity scores estimated by the logistic Lasso. For the implementation of the pilot estimator , we adopt the posterior mean of generated from a Gaussian process prior without adjustment, as in Ghosal and Roy ( 2006 ). Section 4.2 provides more implementation details. To approximate the posterior distribution, we make use of the Laplace approximation, but one can also resort to the Markov Chain Monte Carlo (MCMC) algorithms. The parameter controls the weight placed on the prior adjustment relative to the standard unadjusted prior on (e.g., a Gaussian prior with a squared exponential covariance function). Regarding the tuning parameter , we emphasize that our finite sample results are not sensitive to its choice, as shown in Supplemental Appendix H.

Remark 2.1. (Bayesian Bootstrap)Under unconfoundedness and the reparametrization in ( 2.2 ), the ATE can be written as . With independent priors on and , their posteriors also become independent. It is thus sufficient to consider the posterior for and separately. We place a Dirichlet process prior for with the base measure to be zero. Consequently, the posterior law of coincides with the Bayesian bootstrap introduced by Rubin ( 1981 ); also see Chamberlain and Imbens ( 2003 ). One key advantage of the Bayesian bootstrap is that it allows us to incorporate a broad class of data generating processes, whose posterior can be easily sampled. Replacing by the standard empirical cumulative distribution function does not provide sufficient randomization of , as it yields an underestimation of the asymptotic variance; see Ray and van der Vaart ( 2020 , p. 3008). In principle, one could consider other types of bootstrap weights; however, these generally do not correspond to the posterior of any prior distribution.

3 Main Theoretical Results

In this section, we derive the Bernstein–von Mises (BvM) theorem, which establishes the asymptotic equivalance between our Bayesian procedure and the frequentist-type semiparametric efficient one for the ATE. We consider an asymptotically efficient estimator with the following linear representation:
(3.1)
where is the efficient influence function given in ( 2.4 ). Below, we denote . By virtue of the BvM theorem, two conditional distributions and are asymptotically equivalent. Another important consequence of the BvM theorem is about the asymptotic normality and efficiency of the Bayesian point estimator. That is, is asymptotically normal with mean zero and variance . Thus, achieves the semiparametric efficiency bound of Hahn ( 1998 ).

3.1 Least Favorable Direction

Our prior correction through the Riesz representer is motivated by the least favorable direction of Bayesian submodels. We first provide such least favorable calculations, which are closely linked to the semiparametric efficiency. Consider the one-dimensional submodel defined by the path
(3.2)
for a given direction with . The difficulty of estimating the parameter for the submodels depends on the direction . Among them, let be the least favorable direction that is associated with the most difficult submodel. It yields the largest asymptotic optimal variance for estimating among all submodels. Let denote the joint density of Z depending on . Taking derivative of the logarithmic density with respect to t and evaluating at gives the score operator:
(3.3)
where , , and . The least favorable direction is defined as the solution , which solves the equation ; see Ghosal and van der Vaart ( 2017 , page 370). We immediately obtain the following.

Lemma 3.1.Consider the submodel ( 3.2 ). Let Assumption 1 hold for with any η under consideration, then the least favorable direction for estimating the ATE parameter in ( 2.3 ) is:

(3.4)
where the Riesz representer is given in ( 2.5 ).

Lemma 3.1 motivates the adjustment of the prior distribution as considered in our Bayesian procedure in Section 2.2 . Our prior correction, which takes the form of the (estimated) least favorable direction, provides an exact invariance under a shift of nonparametric components in this direction. It provides additional robustness against posterior inaccuracy in the “most difficult direction,” that is, the one inducing the largest bias in the ATE. We also note that Lemma 3.1 extends the result in Section 2.1 of Ray and van der Vaart ( 2020 ) for the missing data problem, which is equivalent to observing only one arm (either the treatment or control arm), to the context of ATE estimation that involves both arms.

3.2 Assumptions for Inference

We now provide additional notation and assumptions. The posterior distribution plays an important role in the following analysis and is given by
where denotes the conditional density of , obtained by dividing ( 2.1 ) by the marginal density of . We write for the marginal posterior distribution of . We focus on the case that has a prior that is independent of the prior for . Because the likelihood function ( 2.1 ) factorizes into separately, the posterior of is also independent of the posterior for . Due to the fact that does not depend on , it is unnecessary to further discuss a prior or posterior distribution on .

We first introduce high-level assumptions and discuss primitive conditions for those in the next section. Below, we consider some measurable sets of functions such that . We also denote when we index the conditional mean function by its subscript η. We introduce the notation for all , as well as the supremum norm . For two sequences and of positive numbers, we write if , and if and .

Assumption 2. (Rates of Convergence)The estimators and , which are based on an auxiliary sample independent of , satisfy and for :

where and . Further, .

We adopt the standard empirical process notation as follows. For a function h of a random vector that follows distribution , we let , , and . Below, we make use of the notation and .

Assumption 3. (Complexity)For it holds and

(3.5)

Recall the propensity score-adjusted prior on m given by where . The restriction on λ is made through its hyperparameter .

Assumption 4. (Prior Stability)For , is a continuous stochastic process independent of the normal random variable , where , and that satisfies: (i) , for some deterministic sequence and (ii) for any .

Discussion of Assumptions

Assumption 2 imposes sufficiently fast convergence rates for the pilot estimators for the conditional mean function and the propensity score . When considering frequentist pilot estimators, these rate conditions can be justified by adopting the recent proposals of Chernozhukov, Newey, and Singh ( 2022a , b ). One can also use Bayesian point estimators such as the posterior mean of the Gaussian process for and . The posterior convergence rate for the conditional mean can be derived in the same spirit of Ray and van der Vaart ( 2020 ). The rate conditions in Assumption 2 also resemble conditions (i) and (ii) of Theorem 1 of Farrell ( 2015 ) in the context of frequentist estimation. Remark 4.1 illustrates that under classical smoothness assumptions, this assumption is less restrictive than the method of Ray and van der Vaart ( 2020 ) or other approaches for semiparametric estimation of ATEs as found in Chen, Hong, and Tarozzi ( 2008 ) or Farrell, Liang, and Misra ( 2021 ). Assumption 4 incorporates Conditions (3.9) and (3.10) from Theorem 2 in Ray and van der Vaart ( 2020 ), and it is imposed to check the invariance property of the adjusted prior distribution. These restrictions are mild and extend beyond the Gaussian processes that are considered in Section 4 for concreteness.

Assumption 3 restricts the functional class to form a -Glivenko–Cantelli class; see Section 2.4 of van der Vaart and Wellner ( 1996 ). It also imposes a new stochastic equicontinuity condition, as ( 3.5 ) restricts a product structure involving and , which further relaxes the corresponding condition from Ray and van der Vaart ( 2020 ), namely, . In the next section, we demonstrate that our formulation allows for double robustness under Hölder classes (see Remark 4.1 ). Hence, the complexity of the functional class can be compensated by sufficient regularity of the corresponding Riesz representer and vice versa. A condition similar to our Assumption 3 is also used in the frequentist literature; see Section 2 of Benkeser, Carone, van der Laan, and Gilbert ( 2017 ). Nonetheless, the technical argument differs substantially from the frequentist's study, because we mainly need the condition ( 3.5 ) to control changes in the likelihood under perturbations along the estimated and true least favorable directions. This is unique to Bayesian analysis with nonparametric priors.

3.3 A Double Robust Bernstein–von Mises Theorem

We now present a new Bernstein–von Mises theorem, which establishes the asymptotic normality of the posterior distribution, modulo a “bias term.” In a next step, we show that posterior correction, as proposed in our procedure, eliminates this “bias term.” This asymptotic equivalence result is established using the bounded Lipschitz distance. For two probability measures P, Q defined on a metric space , we define the bounded Lipschitz distance as
(3.6)
where
Here, denotes the vector norm.

Below is our main statement about the asymptotic behavior of the posterior distribution of . As in the modern Bayesian paradigm, the exact posterior is rarely of closed form, and one needs to rely on certain Monte Carlo simulations, such as the implementation procedure in Section 2.2 , to approximate this posterior distribution, as well as the resulting point estimator and credible set.

Theorem 3.1.Let Assumptions 14 hold. Then we have

where .

We emphasize that the above BvM theorem is not feasible for applications, because it depends on the “bias term” , which depends on the unknown conditional mean . Nonetheless, it provides an important theoretical benchmark. One can follow the existing literature on semiparametric BvM theorems to impose the so-called “no-bias” condition, but this generally leads to strong smoothness restrictions and may not be satisfied when the dimensionality of covariates is large relative to the smoothness properties of the underlying functions; see the discussion on page 395 of van der Vaart ( 1998 ).

This “bias term” in our context consists of two key components, with the first involving unknown true functions and the second depending on the posterior of . We consider pilot estimators for the unknown functional parameters in . The correction term , as introduced in 2.8, results in a feasible Bayesian procedure that satisfies the BvM theorem under double robustness, as demonstrated below.

Theorem 3.2.Let Assumptions 14 hold. Then we have

We now show how Theorem 3.2 can provide frequentist justification of Bayesian methods to construct the point estimator and the confidence sets. Recall that represents the posterior mean. Introduce a Bayesian credible set for , which satisfies for a given nominal level . The next result shows that also forms a confidence interval in the frequentist sense for the ATE parameter, whose coverage probability under converges to .

Corollary 3.1.Let Assumptions 14 hold. Then under , we have

(3.7)
Also, for any we have .

To the best of our knowledge, this is the first BvM theorem that entails the double robustness. We discuss the distinction from Theorem 2 in Ray and van der Vaart ( 2020 ). Their work laid the theoretical foundation for Bayesian inference based on the propensity score adjusted priors. Specifically, under this prior adjustment, they established a BvM result under weak regularity conditions on the propensity score function, referring to this property as single robustness. Our analysis differs from Ray and van der Vaart ( 2020 ) in two crucial ways. First, we improve on their Lemma 3 by showing that it is possible to verify the prior stability condition for propensity score-adjusted priors under the product structure in Assumption 3 , modulo the “bias term” . This separation is essential to identify the source of the restrictive condition, such as the Donsker property on , which is mainly used to eliminate . Second, our proposal introduces an explicit debiasing step, borrowing key insights from recent developments in the DML literature.

Remark 3.1. (Connection with frequentist robust estimation)In our BvM theorem, we do not restrict the centering estimator , as long as it admits the linear representation as in ( 3.1 ). A popular frequentist estimator for the ATE that achieves double robustness is

(3.8)
based on frequentist-type pilot estimators of the conditional mean function and of the Riesz representer ; see Robins and Rotnitzky ( 1995 ) and more recently Chernozhukov, Newey, and Singh ( 2022a , b ). The double robust or double machine learning estimator ( 3.8 ) recenters the plug-in type functional by an explicit correction factor that depends on the Riesz representer. Our main result establishes the asymptotic equivalence of our estimator and ( 3.8 ). This not only offers frequentist validity to our Bayesian procedure but also provides a Bayesian interpretation for doubly robust frequentist methods.

Remark 3.2. (Parametric Bayesian Methods)A couple of recent papers propose doubly robust Bayesian recipes for ATE inference, under parametric model restrictions. Saarela, Belzile, and Stephens ( 2016 ) considered a Bayesian procedure based on an analog of the double robust frequentist estimator given in Equation ( 3.8 ), replacing the empirical measure with the Bayesian bootstrap measure. However, there was no formal BvM theorem presented therein. Another recent paper by Yiu, Goudie, and Tom ( 2020 ) explored Bayesian exponentially tilted empirical likelihood with a set of moment constraints that are of a double-robust type. They proved a BvM theorem for the posterior constructed from the resulting exponentially tilted empirical likelihood under parametric specifications. Luo, Graham, and McCoy ( 2023 ) provided Bayesian results for ATE estimation in a partial linear model, which implies homogeneous treatment effects. They also assign parametric priors to the propensity score. Their BvM theorem allows for misspecification only in a parametric nonlinear component of the outcome equation. It is not clear how to extend their analysis to incorporate flexible nonparametric modeling strategies.

4 Illustration Using Squared Exponential Process Priors

We illustrate the general methodology by placing a particular Gaussian process prior on in relation to the conditional mean functions for . The Gaussian process regression has been extensively used among the machine learning community, and started to gain popularity among economists; see Kasy ( 2018 ). We provide primitive conditions used in our main results in the previous section. In addition, we provide details on the implementation using Gaussian process priors and discuss the data-driven choices of tuning parameters.

4.1 Asymptotic Results Under Primitive Conditions

Let be a generic centered and homogeneous Gaussian random field with covariance function of the following form , for a given continuous function . We consider as a Borel measurable map in the space of continuous functions on , equipped with the supremum norm . The Gaussian process is completely determined by the covariance function. For example, the covariance function of the squared exponential process is given by , as its name suggests. In this section, we focus on the squared exponential process prior, which is one of the most commonly used priors in applications; see Rassmusen and Williams ( 2006 ) and Murphy ( 2023 ). We also consider a rescaled Gaussian process for some . Intuitively speaking, can be thought as a bandwidth parameter. For a large (or equivalently a small bandwidth), the prior sample path is obtained by shrinking the long sample path . Thus, it employs more randomness and becomes suitable as a prior model for less regular functions; see van der Vaart and van Zanten ( 2008 , 2009 ).

Below, denotes a Hölder space with the smoothness index . Specifically, we illustrate our theory with the case where for . Given such a Hölder-type smoothness condition, we choose
(4.1)
Under ( 4.1 ), a rescaled Gaussian process induces the posterior contraction rate for the conditional mean function to be ; see Section 11.5 of Ghosal and van der Vaart ( 2017 ). The particular choice of mimics the corresponding kernel bandwidth based on kernel smoothing methods. Other choices of will generally make the convergence rate slower. Nonetheless, as long as the propensity score is estimated with a sufficiently fast rate, our BvM theorem still holds. The next proposition illustrates our general theory when we adopt the rescaled squared exponential process prior for the conditional mean function. We use the superscript m for the prior process to signify this relationship.

Proposition 4.1.Let Assumption 1 hold. The estimator satisfies and for some . Suppose for and some with . Also, . Consider the propensity score-dependent prior on m given by , where is the rescaled squared exponential process for , with its rescaling parameter of the order in ( 4.1 ) and for some deterministic sequence , and . Then the corrected posterior distribution for the ATE satisfies Theorem 3.1 .

Remark 4.1. (Double Robust Hölder Smoothness)Proposition 4.1 requires , which represents a trade-off between the smoothness requirement for and . This encapsulates double robustness; that is, a lack of smoothness of the conditional mean function can be mitigated by exploiting the regularity of the propensity score and vice versa. Referring to the Hölder class , its complexity measured by the bracketing entropy of size ε is of order for . One can show that the key stochastic equicontinuity assumption in Ray and van der Vaart ( 2020 ), that is, their condition ( 3.5 ), is violated by exploring the Sudkov lower bound in Han ( 2021 ) when , or equivalently, when . In contrast, our framework accommodates this non-Donsker regime as long as , which enables us to exploit the product structure and a fast convergence rate for estimating the propensity score. Our methodology is not restricted to the case where propensity score belongs to a Hölder class per se. For instance, under a parametric restriction (such as in logistic regression) or an additive model with unknown link function, the possible range of the posterior contraction rate for the conditional mean function can be substantially enlarged. In the case , the bias term becomes asymptotically negligible, that is, . This allows for smoothness robustness only with respect to the propensity score and is also known as single robustness. In this case, no posterior correction is required; see Ray and van der Vaart ( 2020 ).

4.2 Implementation Details

We provide details on the Gaussian process prior placed on and its posterior computation. Algorithm 1 sets the adjusted prior as . In our implementation, we choose the first component to be a zero-mean Gaussian process with the commonly used squared exponential covariance function; see Rassmusen and Williams ( 2006 , p. 83). That is, where the hyperparameter is the kernel variance and are rescaling parameters that reflect the relevance of treatment and each covariate in predicting . They are selected by maximizing the marginal likelihood. Conditional on the data used to obtain the propensity score estimator , the prior for has zero mean and the covariance kernel , which includes an additional term based on the estimated Riesz representer . It is given by , cf. related constructions from Ray and Szabó ( 2019 ) and Ray and van der Vaart ( 2020 ). The parameter , representing the standard deviation of λ, controls the weight of the prior adjustment relative to the standard Gaussian process. The choice , where , as specified in Algorithm 1 , satisfies the conditions and in Assumption 4 , with probability approaching one. It is similar to the choice suggested by Ray and Szabó ( 2019 , page 6), where is proportional to . The factor normalizes the second term (adjustment term) of to have the same scale as the unadjusted covariance K. Supplemental Appendix H shows that the finite sample performance of the double robust Bayesian approaches remains stable across different choices of .

Utilizing Gaussian process priors with zero mean and covariance function , and incorporating the available data, we generate posterior draws of the vector for . This can be achieved through the Laplace approximation method detailed in Supplemental Appendix G.

For the implementation of the pilot estimator given in ( 2.6 ), we recommend logistic Lasso for the propensity scores, with the penalty parameter chosen by cross-validation, following Friedman, Hastie, and Tibshirani ( 2010 ). As a pilot estimator in Algorithm 1 for posterior correction, we use the uncorrected posterior mean , where is calculated following Step (a) of posterior computation in Algorithm 1 , but with a Gaussian process prior without adjustment, that is, . When the rescaling parameter is as stated in Proposition 4.1 , the convergence rate of is . This can be shown by combining Theorems 11.22, 11.55, and 8.8 from Ghosal and van der Vaart ( 2017 ).

5 Numerical Results

In this section, we apply our method to one version of the Lalonde–Dehejia–Wahba data that contains a treated sample of 185 men from the National Supported Work (NSW) experiment and a control sample of 2490 men from the Panel Study of Income Dynamics (PSID). The data has been used by LaLonde ( 1986 ), Dehejia and Wahba ( 1999 ), Abadie and Imbens ( 2011 ), and Armstrong and Kolesár ( 2021 ), among others. We refer readers to LaLonde ( 1986 ), and Dehejia and Wahba ( 1999 ) for reviews of the data.

5.1 Simulations

In this section, we consider a simulation study where the observations are randomly drawn from a large sample generated by applying the Wasserstein Generative Adversarial Networks (WGAN) method to the Lalonde–Dehejia–Wahba data; see Athey, Imbens, Metzger, and Munro ( 2024 ). We view their simulated data as the population and repeatedly draw our simulation samples (each consisting of 185 treated observations and 2490 control observations) for each of the 1000 Monte Carlo replications. We slightly depart from previous studies by focusing on a binary outcome Y: the employment indicator for the year 1978, which is defined as an indicator for positive earnings. The treatment D is the participation in the NSW program. We are interested in the average treatment effect of the NSW program on the employment status. For the set of covariates, we follow Abadie and Imbens ( 2011 ) and include nine variables: age, education, black, Hispanic, married, earnings in 1974, earnings in 1975, unemployed in 1974, and unemployed in 1975. We implement our double robust Bayesian method (DR Bayes) following Algorithm 1 , using posterior draws and the pilot estimator and , as detailed at the end of Section 4.2 . We compare DR Bayes to two other Bayesian procedures: First, we consider the prior adjusted Bayesian method (PA Bayes) proposed by Ray and van der Vaart ( 2020 ), which constructs the point estimate and credible interval based on in (2.8). Second, we examine an unadjusted Bayesian method (Bayes), which is also based on but generated using Gaussian process priors without adjustment.

We also compare our method to frequentist estimators. Match/Match BC corresponds to the nearest neighbor matching estimator and its bias-corrected version by Abadie and Imbens ( 2011 ), which adjusts for differences in covariate values through regression. DR TMLE corresponds to the doubly robust targeted maximum likelihood estimator by Benkeser et al. ( 2017 ). DML refers to the double/debiased machine learning estimator from Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey ( 2017 ), where the nuisance functions and are estimated using random forests (which outperformed DML combined with other nuisance function estimators, such as Lasso, in our simulation setup). Since the job-training data contains a sizable proportion of units with propensity score estimates very close to 0 and 1, we follow Crump, Hotz, Imbens, and Mitnik ( 2009 ) and discard observations with the estimated propensity score outside the range , with the trimming threshold .

Table I presents the finite sample performance of the Bayesian and frequentist methods mentioned above. We use the full data twice in computing the prior/posterior adjustments and the posterior distribution of the conditional mean function. Supplemental Appendix H reports the performance of DR Bayes using sample splitting, which results in similar coverage but a larger credible interval length due to the halved sample size.

TABLE I. Simulation Results Using WGAN-Generated Data: trimming is based on the estimated propensity score within [t,1 − t], the average sample size after trimming, CP = coverage probability of 95% credible/confidence interval, CIL = average length of the 95% credible/confidence interval.

Bias

CP

CIL

Bias

CP

CIL

Bias

CP

CIL

Methods

()

()

()

Bayes

−0.040

0.683

0.147

−0.010

0.841

0.149

−0.006

0.911

0.120

PA Bayes

−0.008

0.981

0.260

0.033

0.949

0.254

0.047

0.897

0.308

DR Bayes

−0.024

0.983

0.223

0.014

0.970

0.221

0.023

0.952

0.258

Match

0.027

0.933

0.334

0.048

0.908

0.323

0.033

0.965

0.323

Match BC

0.040

0.880

0.347

0.065

0.816

0.334

0.083

0.804

0.339

DR TMLE

0.015

0.832

0.300

0.039

0.746

0.282

0.039

0.668

0.242

DML

0.045

0.927

0.524

0.052

0.870

0.393

0.054

0.918

0.522

Concerning the Bayesian methods for estimating the ATE, Table I reveals that unadjusted Bayes yields highly inaccurate coverage except for the case with trimming constant . If the prior is corrected using the propensity score adjustment, the results improve significantly. Nevertheless, our DR Bayes method demonstrates two further improvements: First, DR Bayes leads to smaller average confidence lengths in each case while simultaneously improving the coverage probability. This can be attributed to a reduction in bias and/or more accurate uncertainty quantification via our posterior correction. Second, when the trimming threshold is small (i.e., ), propensity score estimators can be less accurate, leading to reduced coverage probabilities of PA Bayes. Our double robust Bayesian method, on the other hand, is still able to provide accurate coverage probabilities. In other words, DR Bayes exhibits more stable performance than PA Bayes with respect to the trimming threshold.

Our DR Bayes also exhibits encouraging performances when compared to frequentist methods. It provides a more accurate coverage than bias-corrected matching, DR TMLE and DML. Compared to the matching estimator without bias correction, which achieves similarly good coverage, DR Bayes yields considerably shorter credible intervals.

5.2 An Empirical Illustration

We apply the Bayesian and frequentist methods considered above to the Lalonde-Dehejia-Wahba data. Similar to the simulation exercise, we consider a varying choice of the trimming threshold . The ATE point estimates and confidence intervals are presented in Table II . As a benchmark, the experimental data that uses both treated and control groups in NSW () yields an ATE estimate (treated-control mean difference) of 0.111 with a 95% confidence interval .

TABLE II. Estimates of ATE for the Lalonde-Dehejia-Wahba data: Trimming is based on the estimated propensity score within [t,1 − t], sample size after trimming. ATE = point estimate, 95% CI = 95% credible/confidence interval, CIL = 95% credible/confidence interval length.

()

()

()

Methods

ATE

95% CI

CIL

ATE

95% CI

CIL

ATE

95% CI

CIL

Bayes

0.213

[0.120, 0.301]

0.181

0.214

[0.132, 0.292]

0.161

0.198

[0.140, 0.251]

0.112

PA Bayes

0.158

[0.019, 0.288]

0.270

0.170

[0.045, 0.281]

0.236

0.090

[−0.078, 0.233]

0.311

DR Bayes

0.178

[0.061, 0.293]

0.231

0.184

[0.064, 0.294]

0.230

0.121

[−0.031, 0.250]

0.281

Match

0.188

[0.022, 0.355]

0.333

0.140

[−0.029, 0.309]

0.338

0.079

[−0.111, 0.269]

0.380

Match BC

0.157

[−0.006, 0.321]

0.327

0.145

[−0.021, 0.310]

0.331

0.180

[−0.004, 0.365]

0.369

DR TMLE

−0.023

[−0.171, 0.125]

0.296

0.073

[−0.074, 0.220]

0.294

0.071

[−0.146, 0.289]

0.435

DML

0.172

[0.018, 0.327]

0.308

0.150

[−0.010, 0.310]

0.320

0.258

[−0.183, 0.699]

0.882

As we see from Table II , the unadjusted Bayesian method yields larger estimates. The adjusted Bayesian methods (PA and DR Bayes), on the other hand, produce estimates comparable to the experimental estimate. PA Bayes finds that the job training program enhanced the employment by 9.0% to 17.0% across different trimming thresholds, and DR Bayes estimates the effect from 12.1% to 18.4%. Among frequentist estimators, the matching estimator and its bias-corrected version produce similar estimates as PA and DR Bayes, but with wider confidence intervals. DR TMLE produces negative estimates for when all other estimates are positive. For and 0.05, DML yields similar point estimates as PA and DR Bayes, but with less estimation precision. In the case , where the overlap condition is nearly violated, its point estimate and confidence interval length become considerably larger than those of other methods.

6 Extensions

This section extends the binary variable Y to encompass general cases, including continuous, counting, and multinomial outcomes. First, we examine the class of single-parameter exponential families, where the conditional density function is solely determined by the nonparmatric conditional mean function. This covers continuous outcomes and counting variables. Second, we consider the “vector” case of exponential families for multinomial outcomes. For both classes, we derive the novel correction to the Bayesian procedure and delegate more technical discussions to Supplemental Appendices D and F. Additionally, we outline extensions to other causal parameters of interest.

6.1 A Single-Parameter Exponential Family

In this part, we assume that the distribution of conditional on and belongs to the “single-parameter” exponential family, where the unknown parameter is the nonparametric conditional mean function . The conditional density function is given by
(6.1)
where , and the function links the mean to the “natural parameter” of the exponential family. We also restrict the sufficient statistic to be linear in y.
The family ( 6.1 ) not only encompasses the Bernoulli distribution (with , , and ), as considered in the previous sections, but also allows for counting and continuous outcomes. For instance, when , the Poisson distribution corresponds to the choices , , and , while the exponential distribution is represented by , , and . Furthermore, the normal distribution with for some , is captured by , , , and . We emphasize that model ( 6.1 ) does not impose functional form assumptions on the conditional mean function m. The joint density of can be written as
(6.2)
We consider the same reparametrization of as in ( 2.2 ) except that now the second component of η uses the general link function q satisfying . We now state the least favorable direction for the exponential family case, which serves as motivation for the prior adjustment.

Lemma 6.1.Let Assumption 1 hold for with any η under consideration. Then, for the joint distribution ( 6.2 ) and the submodel defined by the path with as defined in ( 3.2 ), the least favorable direction for estimating the ATE parameter in ( 2.3 ) is

(6.3)
where the Riesz representer is given in ( 2.5 ).

For the outcome family with , which includes Bernoulli, Poisson, and exponential distributions, the least favorable direction for ATE estimation coincides with the one as given in Lemma 3.1 . To implement the double robust Bayesian procedure for general outcomes, one can still follow Algorithm 1 , with the logistic function Ψ replaced by the inverse link function . For the normal (homoscedastic) outcome where prior adjustment in Algorithm 1 becomes , the hyperparameter a can be determined together with other parameters of the Gaussian process by optimizing the marginal likelihood as in Ray and Szabó ( 2019 ). Proposition F.1 in the Supplemental Material provides primitive conditions for the BvM theorem to hold under double robust smoothness conditions.

6.2 Multinomial Outcomes

We now assume that the dependent variable takes values in a finite set, specifically . The ATE can then be written as , where the choice probabilities are with the multinomial logit specification:
for . The multinomial logit specification implies . We now provide the least favorable direction for multinomial outcomes in the presence of multinomial outcomes and discuss its consequences for prior adjustment below.

Lemma 6.2.Consider the submodel defined by the path , , with as defined in ( 3.2 ). Let Assumption 1 hold for with any η under consideration, then the least favorable direction for estimating the ATE parameter is

where the Riesz representer is given in ( 2.5 ).

We emphasize that the least favorable direction calculation is not a trivial extension of Hahn ( 1998 ) or Ray and van der Vaart ( 2020 ). This is because there are J nonparametric components involved in the conditional probability function of the multinomial outcomes given covariates, and we need to consider the perturbation of those J components together. Nonetheless, we show that the efficient influence function is of the same generic form as derived in Hahn ( 1998 ). In the proof of Lemma 6.2 , we compute the derivative of the parameter mapping along the path considered herein. We derive inner products involving the least favorable direction for each nonparametric component consisting of the conditional choice probabilities. The extension to the multinomial case had not been considered in the literature to our knowledge, and it offers a result of independent interest.
Lemma 6.2 motivates the following modification of our double robust Bayesian estimator based on the propensity score-dependent prior on for :
where is a continuous stochastic process independent for . We may then follow the implementation as described in Section 2.2 using .

6.3 Other Causal Parameters

We now extend our procedure to general linear functionals of the conditional mean function. We do so only for binary outcomes, as the modification to other types of outcomes follows as above. Recall that the observable data consists of . observations of . The causal parameter of interest is , where the function ψ is linear with respect to the conditional mean function . We introduce the Riesz representer satisfying . Let and be pilot estimators for the conditional mean and Riesz representer, respectively, computed over an auxiliary sample. Our double robust Bayesian procedure can be extended by considering the corrected posterior distribution for as follows: , , where here . The derivations of the least favorable directions in the following two examples are provided in Supplemental Appendix E.

Example 6.1. (Average Policy Effects)The policy effect from changing the distribution of X is , where the known distribution functions and have their supports contained in the support of the marginal covariate distribution . Following the general setup, with its Riesz representer , where and stand for the density function of and , respectively.

Example 6.2. (Average Derivative)For a continuous scalar (treatment) variable D, the average derivative is given by , where denotes the partial derivatives of m with respect to the continuous treatment D. Thus, we have with its Riesz representer given by , where here denotes the conditional density function of D given X.

  • 1 Strictly speaking, the main objective in Ray and van der Vaart ( 2020 ) concerns the mean response in a missing data model, which is equivalent to observing one arm (either the treatment or control) of the causal setup.
  • 2 If does not have a density, we can simply consider the conditional density of given instead of the joint density of .
  • 3 Another popular method in the statistics literature is the targeted learning approach ( van der Laan and Rose ( 2011 ), Benkeser et al. ( 2017 )).
  • 4 The data is available on Dehejia's website: http://users.nber.org/~rdehejia/nswdata2.html.
  • 5 Crump et al. ( 2009 ) suggested a simple rule of thumb with a threshold of , while Athey et al. ( 2024 ) used . Applying the optimal trimming rule proposed by Crump et al. ( 2009 ) to our simulated samples yields an average optimal trimming threshold 0.073.
  • 6 In additional simulations without trimming (), we find that all double robust methods, including DR Bayes, substantially undercover and/or inflate the length of their confidence intervals. This is consistent with Crump et al. ( 2009 ), who point out that propensity score estimates close to the boundaries tend to induce substantial bias and large variances in estimating the ATE. We also note that unadjusted Bayes severely undercovers in this case.
  • 7 Applying the optimal trimming rule proposed by Crump et al. ( 2009 ) yields an optimal threshold of 0.064.
  • Appendix A: Proofs of Main Results

    In the 6.2 , denotes a generic constant, whose value might change line by line. We introduce additional subscripts when there are multiple constant terms in the same display. In the following, we denote the log-likelihood based on as
    where each term is the logarithm of the factors involving only π or m or f. Recall the definition of the measurable sets of functions such that . We introduce the conditional prior . The following posterior Laplace transform of given by
    (A.1)
    plays a crucial role in establishing the BvM theorem ( Castillo ( 2012 ), Castillo and Rousseau ( 2015 ), Ray and van der Vaart ( 2020 )). To abuse the notation slightly, we define a perturbation of along the least favorable direction, restricted to the components corresponding to π and m:
    (A.2)
    We explicitly write the perturbation of by . Recall that coincides with the Riesz representer by Lemma 3.1 . In addition, we introduce the following notation:
    (A.3)
    Also, recall the notation , which is used in the following. In the proofs below, we make use of Lemmas C.1–C.9 which can be found in Supplemental Appendix C.

    Proof of Theorem 3.1.Since the estimated least favorable direction is based on observations that are independent of , we may apply Lemma 2 of Ray and van der Vaart ( 2020 ). It suffices to handle the posterior distribution with set equal to a deterministic function . By Lemma 1 of Castillo and Rousseau ( 2015 ), it is sufficient to show that the Laplace transform given in ( A.1 ) satisfies

    (A.4)
    for every t in a neighborhood of 0, where the limit at the right-hand side of ( A.4 ) is the Laplace transform of the distribution. Note that we can write . Further, let , which satisfies ( 3.1 ).

    The Laplace transform can thus be written as

    The expansion in Lemma B.1 gives the following identity for all t in a sufficiently small neighborhood around zero and uniformly for :
    where we make use of the notation and the score operator defined through ( 3.3 ).

    Next, we plug this into the exponential part in the definition of , which then gives

    By Fubini's theorem, the double integral of the previous expression coincides with
    By the assumed -Glivenko–Cantelli property for in Assumption 3 , that is, , and the boundedness of , we apply Lemma C.4, which establishes the convergence of the Laplace transform for the Dirichlet posterior process. Specifically, it implies the convergence in probability of to uniformly over , using the notation and . Further, we may apply the convergence of imposed in Assumption 2 , so that the above display becomes
    We now analyze the empirical process term in the integral and examine its relationship with the bias term . To do so, we calculate
    where the last line follows from the definition of the bias term, that is, .

    Further, observe that and by the definition of the efficient influence function given in ( 2.4 ). As we insert these in the previous expression for , we obtain for all t in a sufficiently small neighborhood around zero and uniformly for :

    where the last equality follows from the prior invariance condition established in Lemma B.2 . This implies ( A.4 ) using that by the Lemma 3.1 . Q.E.D.

    Proof of Theorem 3.2.It is sufficient to show that , where and . We make use of the decomposition

    (A.5)
    Consider the first summand on the right-hand side of the previous equation. We have uniformly for :
    where the last equation follows from the following derivation:
    using the Cauchy–Schwarz inequality, Assumption 2 , and Assumption 3 . Consider the second summand on the right-hand side of ( A.5 ). From Lemma C.8, we infer
    Consequently, decomposition ( A.5 ) together with the asymptotic expansion of each summand yields
    where the last equation is due to the equation (C.6) in Supplemental Appendix C. Q.E.D.

    Proof of Corollary 3.1.The weak convergence of the Bayesian point estimator directly follows from our asymptotic characterization of the posterior and the argmax theorem; see the proof of Theorem 10.8 in van der Vaart ( 1998 ). The corrected Bayesian credible set satisfies for any . In particular, we have

    Now the definition of the estimator given in ( 3.1 ) yields . For any set A, we write . Theorem 3.1 implies
    We may thus write for some set satisfying . Therefore, the frequentist coverage of the Bayesian credible set is
    noting that is asymptotically normal with mean zero and variance under . Q.E.D.

    Proof of Proposition 4.1.Note that is based on an auxiliary sample, and hence, we can treat below as a deterministic function denoted by satisfying the rate restrictions and . Regarding the conditional mean functions, we consider the set , where for and some constant :

    (A.6)
    where in the first restriction for the Gaussian process is a regularity class of functions defined in the equation (C.7) in Supplemental Appendix C. We write .

    We first verify Assumption 2 with . The posterior contraction rate is shown in our Lemma C.3. Referring to the product rate condition, that is, for . This is satisfied if , which can be rewritten as .

    We now verify Assumption 3 . It is sufficient to deal with the resulting empirical process . Note that the Cauchy–Schwarz inequality implies

    Consequently, from Lemma C.5 we infer
    Note that if , from Lemma C.9 we infer . Thus it remains to consider the case . By the entropy bound presented in the proof of Lemma C.3, we have , with modulo some logn term on the right-hand side of the bound. Because is monotone and Lipschitz, a set of ε-covers in for translates into a set of ε-covers for . In this case, the empirical process bound of Han ( 2021 , p. 2644) yields
    where represents a term that diverges at certain polynomial order of logn. Consequently, we obtain
    which is satisfied under the smoothness restriction or equivalently . This condition automatically holds given .

    Finally, it remains to verify Assumption 4 . By the univariate Gaussian tail bound, the prior mass of the set satisfies . Also, the Kullback–Leibler neighborhood around has prior probability at least . We may thus apply Lemma 4 of Ray and van der Vaart ( 2020 ), which yields , as imposed in Assumption 4 (i).

    Regarding Assumption 4 (ii), we need to show the posterior probability of the shifted version of is tending to one. Considering itself, the first set in the intersection of ( A.6 ) that defines is seen to have posterior probability tending to one by the result in (II) of Lemma C.3, combined with the univariate Gaussian tail probability bound

    The second set in the intersection of ( A.6 ) has posterior probability tending to one by Lemma 17 of Ray and van der Vaart ( 2020 ). Hence, has posterior probability going to one. Next, we consider , for any . To slightly abuse the notation, we write for in the sequel. By the Lipschitz continuity of the logistic link function, we have for . Therefore, we get with probability approaching one, where
    with and denotes the norm of the Reproducing Kernel Hilbert Space associated with the squared exponential process; see Supplemental Appendix C for a formal definition. Because and , the posterior probability of tends to one following similar arguments concerning the set , after replacing with a multiple of itself for . Hence, the posterior probability of is seen to tend to one, which completes the proof. Q.E.D.

    Appendix B: Key Lemmas

    We now present key lemmas used in the derivation of our BvM theorem. We introduce where
    (B.1)
    This defines a path from to . We also write , for , so that ; cf. the proof of Theorem 1 in Ray and van der Vaart ( 2020 ).

    Lemma B.1.Let Assumptions 1 and 2 hold. Then we have uniformly for :

    Proof.We start with the following decomposition:

    From the calculation in the proof of Lemma C.1, we have . Then we infer for the stochastic equicontinuity term that
    uniformly in . We can thus write uniformly in :
    The rest of the proof involves a standard Taylor expansion for the third term on the right- hand side of the above equation. By the equation (C.4) in the proof of Lemma C.1, we get
    by the fact that and the definition of the Riesz representer in ( 2.5 ). Regarding the second-order term in the Taylor expansion in the equation (C.5) of the proof of Lemma C.1, we get
    Considering the score operator defined in ( 3.3 ), we have
    Consequently, by the unconfoundedness imposed in Assumption 1 (i) and the binary nature of Y, we have . We thus obtain
    Then, by employing Assumption 1 (ii), that is, for all x, it yields uniformly for :
    where the last equation is due to the posterior contraction rate of the conditional mean function imposed in Assumption 2 . Consequently, we obtain, uniformly for ,
    which leads to the desired result. Q.E.D.

    The next lemma verifies the prior stability condition under our double robust smoothness conditions.

    Lemma B.2.Let Assumptions 14 hold. Then we have

    (B.2)
    for a sequence of measurable sets such that .

    Proof.Since is based on an auxiliary sample, it is sufficient to consider deterministic functions with the same rates of convergence as . Denote the corresponding propensity score by . By Assumption 4 , we have and

    (B.3)
    where denotes the probability density function of a random variable and the set is defined by where imposed in Assumption 4 and .

    Considering the log likelihood ratio of two normal densities together with the constraint , it is shown on page 3015 of Ray and van der Vaart ( 2020 ) that

    We show at the end of the proof that , uniformly for . Consequently, the numerator of this leading term in ( B.3 ) becomes
    By the change of variables on the numerator and using the notation , the prior invariance property becomes
    The desired result would follow from and . The first convergence directly follows from Assumption 4 . The set is the intersection of these two conditions in Assumption 4 , except that the restriction on λ in is instead of . By construction, we have , so that .

    We complete the proof by establishing the following result:

    (B.4)
    We denote and . Consider the following decomposition of the log-likelihood:
    Next, we apply third-order Taylor expansions in Lemma C.1 separately to the two terms in the brackets of the above display making use of the notation :
    for some intermediate points ; cf. the equation ( B.1 ). Combining the previous calculation yields
    In order to control , we evaluate
    Note that the first term is centered, so it becomes . We apply Lemma C.2 to conclude that it is of smaller order. The middle term is negligible by our Assumption 3 . Referring to the last term, the Cauchy–Schwarz inequality yields
    where the last equality is due to Assumption 2 . We thus obtain uniformly in . Consider . We note that uniformly in . Hence, we obtain
    as in -norm by Assumption 2 . Thus, uniformly in . Finally, we control by evaluating uniformly in , which shows ( B.4 ). Q.E.D.

  • The replication package for this paper is available at https://doi.org/10.5281/zenodo.14015435. The Journal checked the data and codes included in the package for their ability to reproduce the results in the paper and approved online appendices.
    • The full text of this article hosted at iucr.org is unavailable due to technical difficulties.