We thank the anonymous reviewers, as well as Xiaohong Chen, Yanqin Fan, Essie Maasoumi, Yichong Zhang, and numerous seminar and conference participants for helpful comments and illuminating discussions. Breunig gratefully acknowledges the support of the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC-2047/1 – 390685813. Yu gratefully acknowledges the support of JSPS KAKENHI Grant Number 21K01419. Funding Statement: Open Access funding enabled and organized by Projekt DEAL. WOA Institution: Rheinische Friedrich-Wilhelms-Universität Bonn Consortia Name: Projekt DEAL.

About

Sections

PDF

Tools

Share a link

Email
Wechat
Bluesky

Abstract

We propose a double robust Bayesian inference procedure on the average treatment effect (ATE) under unconfoundedness. For our new Bayesian approach, we first adjust the prior distributions of the conditional mean functions, and then correct the posterior distribution of the resulting ATE. Both adjustments make use of pilot estimators motivated by the semiparametric influence function for ATE estimation. We prove asymptotic equivalence of our Bayesian procedure and efficient frequentist ATE estimators by establishing a new semiparametric Bernstein–von Mises theorem under double robustness; that is, the lack of smoothness of conditional mean functions can be compensated by high regularity of the propensity score and vice versa. Consequently, the resulting Bayesian credible sets form confidence intervals with asymptotically exact coverage probability. In simulations, our method provides precise point estimates of the ATE through the posterior mean and delivers credible intervals that closely align with the nominal coverage probability. Furthermore, our approach achieves a shorter interval length in comparison to existing methods. We illustrate our method in an application to the National Supported Work Demonstration following LaLonde (1986) and Dehejia and Wahba (1999).

1 Introduction

This paper proposes a double robust Bayesian approach for estimating the average treatment effect (ATE) under unconfoundedness, given a set of pretreatment covariates. Our new Bayesian procedure involves both prior and posterior adjustments. First, following Ray and van der Vaart ( 2020 ), we adjust the prior distributions of the conditional mean function using an estimator of the propensity score. Second, we use this propensity score estimator together with a pilot estimator of the conditional mean to correct the posterior distribution of the ATE. The adjustments in both steps are closely related to the functional form of the semiparametric influence function for ATE estimation under unconfoundedness. They do not only shift the center but also change the shape of the posterior distribution. For our robust Bayesian procedure, we derive a new Bernstein–von Mises (BvM) theorem, which means that this posterior distribution, when centered at any efficient estimator, is asymptotically normal with the efficient variance in the semiparametric sense. The key innovation of our paper is that this result holds under double robust smoothness assumptions within the Bayesian framework.

Despite the recent success of Bayesian methods, the literature on ATE estimation is predominantly frequentist-based. For the missing data problem specifically, it was shown that conventional Bayesian approaches (i.e., using uncorrected priors) can produce inconsistent estimates, unless some unnecessarily strong smoothness conditions on the underlying functions were imposed; see the results and discussion in Robins and Ritov ( 1997 ) or Ritov, Bickel, Gamst, and Kleijn ( 2014 ). Once the prior distribution was adjusted using some preestimated propensity score, Ray and van der Vaart ( 2020 ) recently established a novel semiparametric BvM theorem under weaker smoothness requirement for the propensity score function. ¹ However, a minimum differentiability of order is still required for the conditional mean function in the outcome equation, where p denotes the dimensionality of covariates. In this paper, we are interested in Bayesian inference under double robustness that allows for a trade-off between the required levels of smoothness in the propensity score and the conditional mean functions.

Under double robust smoothness conditions, we show that Bayesian methods, which use propensity score adjusted priors as in Ray and van der Vaart ( 2020 ), satisfy the BvM theorem only up to a “bias term” depending on the unknown true conditional mean and propensity score functions. In this paper, our robust Bayesian approach accounts for this bias term in the BvM theorem by considering an explicit posterior correction. Both the prior adjustment and the posterior correction are based on functional forms that are closely related to the efficient influence function for the ATE in Hahn ( 1998 ). We show that the corrected posterior satisfies the BvM theorem under double robust smoothness assumptions. Our novel procedure combines the advantages of Bayesian methodology with the robustness features that are the strengths of frequentist procedures. Our credible intervals are Bayesianly justifiable in the sense of Rubin ( 1984 ), as the uncertainty quantification is conducted conditionally on the observed data and can also be interpreted as frequentist confidence intervals with asymptotically exact coverage probability. Our procedure is inspired by insights from the double machine learning (DML) literature, as well as the bias-corrected matching approach of Abadie and Imbens ( 2011 ), since our robustification of an initial procedure removes some nonnegligible bias and remains asymptotically valid under weaker regularity conditions. While the main part of our theoretical analysis focuses on the ATE of binary outcomes, also considered by Ray and van der Vaart ( 2020 ), we outline extensions of our methodology to continuous and multinomial cases, as well as to other causal parameters.

In both simulations and an empirical illustration using the National Supported Work Demonstration data, we provide evidence that our procedure performs well compared to existing Bayesian and frequentist approaches. In our Monte Carlo simulations, we find that our method results in improved empirical coverage probabilities, while maintaining very competitive lengths for confidence intervals. This finite sample advantage is also observed over Bayesian methods that rely solely on prior corrections. In particular, we note that our approach leads to more accurate uncertainty quantification and is less sensitive to estimated propensity scores being close to boundary values.

The BvM theorem for parametric Bayesian models is well established; see, for instance, van der Vaart ( 1998 ). Its semiparametric version is still being studied very actively when nonparametric priors are used ( Castillo ( 2012 ), Castillo and Rousseau ( 2015 ), Ray and van der Vaart ( 2020 )). To the best of our knowledge, our new semiparametric BvM theorem is the first one that possesses the double robustness property. Our paper is also connected to another active research area concerning Bayesian inference for parameters in econometric models, which is robust to partial or weak identification ( Chen, Christensen, and Tamer ( 2018 ), Giacomini and Kitagawa ( 2021 ), Andrews and Mikusheva ( 2022 )). The framework and the approach we take is different. Nonetheless, they share the same scope of tailoring the Bayesian inference procedure to new challenges in contemporary econometrics.

2 Setup and Implementation

This section provides the main setup of the average treatment effect (ATE). We motivate the new Bayesian methodology and detail the practical implementation.

2.1 Setup

We consider a family of probability distributions for some parameter space , where the (possibly infinite dimensional) parameter η characterizes the probability model. Let be the true value of the parameter and denote , which corresponds to the frequentist distribution generating the observed data.

For individual i, consider a treatment indicator

. The observed outcome

is determined by

where

are the potential outcomes of individual i associated with

or 0. We now focus on the binary outcome case where both

and

take values in

. An extension to multinomial or continuous outcomes is provided in Section 6 . The covariates for individual i are denoted by

, a vector of dimension p, with the distribution

and the density

. ² Let

denote the propensity score and

the conditional mean. Suppose that the researcher observes independent and identically distributed (i.i.d.) observations of

for

. The joint density of

is given by

where

(2.1)

The parameter of interest is the ATE given by

, where

denotes the expectation under

. For its identification, we impose the following standard assumption of unconfoundedness and overlap ( Rosenbaum and Rubin ( 1984 ), Imbens ( 2004 ), Imbens and Rubin ( 2015 )).

Assumption 1.(i) and (ii) there exists such that for all x in the support of .

We introduce additional notation from the Bayesian perspective, following the similar setup from Ray and van der Vaart ( 2020 ). For the purpose of assigning prior distributions to

in the Bayesian procedure, it is convenient to transform them by a link function. We make use of the Logistic function

here. Specifically, we consider the reparametrization of

given by

. We index the probability model as

, in line with the notation introduced at the first paragraph of this section, where

(2.2)

Below, we write

, and

to make the dependence on η explicit. Given any prior on the triplet

, Bayesian inference on the ATE is achieved by deriving the posterior distribution of

(2.3)

where

denotes the expectation under

. Our aim is to examine the large-sample behavior of the posterior of

under the true probability distribution

. In the same vein, the true parameter of interest becomes

The construction of our double robust Bayesian procedure in Section 2.2 has fundamental connection to the efficient influence function. For any paramter η, the efficient influence function ( Hahn ( 1998 ), Hirano, Imbens, and Ridder ( 2003 )) is

(2.4)

for the Riesz representer

, which is given by

(2.5)

We write

and

. Both the prior adjustment and posterior correction of our approach require a pilot estimator for

. Under Assumption 1 , the true Riesz representer

is well-defined.

2.2 Double Robust Bayesian Point Estimators and Credible Sets

We build upon the ATE expression in ( 2.3 ) to develop our doubly robust inference procedure. Our approach is based on nonparametric prior processes for and . For the latter, we consider the Dirichlet process, which is a default prior on spaces of probability measures. This choice is also convenient for posterior computation via the Bayesian bootstrap; see Remark 2.1 . For the former, we make use of Gaussian process priors, along with an adjustment that involves a preliminary estimator of . Gaussian process priors are also closely related to spline smoothing, as discussed in Wahba ( 1990 ). Their posterior contraction properties (see Ghosal and van der Vaart ( 2017 )), together with excellent finite sample behavior (see Rassmusen and Williams ( 2006 )), make Gaussian process priors popular in the related literature. Since does not depend on , the specification of a prior on the propensity score is not required.

We consider pilot estimators

of the propensity score

and

of the conditional mean function

, which both are based on an auxiliary sample. We consider a plug-in estimator for the Riesz representer

given by

(2.6)

Below, let

denote the sample average of the absolute value of

, which we use for scale normalization in our prior adjustment (see Section 4.2 for details). The use of an auxiliary data for pilot estimators simplifies the technical analysis related to the propensity score adjusted priors; see Ray and van der Vaart ( 2020 ). Also, it provides an effective way to control some negligible higher-order terms. See our Lemma C.2 in the Supplemental Material ( Breunig, Liu, and Yu ( 2025 )) and the related discussion about the sample splitting in the DML type methods on page C6 of Chernozhukov et al. ( 2018 ). In practice, we use the full data twice and do not split the sample, as we have not observed any overfitting or loss of coverage thereby. Algorithm 1 describes our double robust Bayesian inference procedure.

Details are in the caption following the image — **Algorithm 1**
Open in figure viewer PowerPoint

Double Robust Bayesian Procedure.

Given the draws from the corrected posterior calculated in Algorithm 1 , we obtain the point estimate and credible set as follows. The Bayesian point estimator is

. The

credible set for the ATE parameter

is given by

where

denotes the ath quantile of

For the implementation of our pilot estimator given in ( 2.6 ), we recommend using propensity scores estimated by the logistic Lasso. For the implementation of the pilot estimator , we adopt the posterior mean of generated from a Gaussian process prior without adjustment, as in Ghosal and Roy ( 2006 ). Section 4.2 provides more implementation details. To approximate the posterior distribution, we make use of the Laplace approximation, but one can also resort to the Markov Chain Monte Carlo (MCMC) algorithms. The parameter controls the weight placed on the prior adjustment relative to the standard unadjusted prior on (e.g., a Gaussian prior with a squared exponential covariance function). Regarding the tuning parameter , we emphasize that our finite sample results are not sensitive to its choice, as shown in Supplemental Appendix H.

Remark 2.1. (Bayesian Bootstrap)Under unconfoundedness and the reparametrization in ( 2.2 ), the ATE can be written as . With independent priors on and , their posteriors also become independent. It is thus sufficient to consider the posterior for and separately. We place a Dirichlet process prior for with the base measure to be zero. Consequently, the posterior law of coincides with the Bayesian bootstrap introduced by Rubin ( 1981 ); also see Chamberlain and Imbens ( 2003 ). One key advantage of the Bayesian bootstrap is that it allows us to incorporate a broad class of data generating processes, whose posterior can be easily sampled. Replacing by the standard empirical cumulative distribution function does not provide sufficient randomization of , as it yields an underestimation of the asymptotic variance; see Ray and van der Vaart ( 2020 , p. 3008). In principle, one could consider other types of bootstrap weights; however, these generally do not correspond to the posterior of any prior distribution.

3 Main Theoretical Results

In this section, we derive the Bernstein–von Mises (BvM) theorem, which establishes the asymptotic equivalance between our Bayesian procedure and the frequentist-type semiparametric efficient one for the ATE. We consider an asymptotically efficient estimator

with the following linear representation:

(3.1)

where

is the efficient influence function given in ( 2.4 ). Below, we denote

. By virtue of the BvM theorem, two conditional distributions

and

are asymptotically equivalent. Another important consequence of the BvM theorem is about the asymptotic normality and efficiency of the Bayesian point estimator. That is,

is asymptotically normal with mean zero and variance

. Thus,

achieves the semiparametric efficiency bound of Hahn ( 1998 ).

3.1 Least Favorable Direction

Our prior correction through the Riesz representer

is motivated by the least favorable direction of Bayesian submodels. We first provide such least favorable calculations, which are closely linked to the semiparametric efficiency. Consider the one-dimensional submodel

defined by the path

(3.2)

for a given direction

with

. The difficulty of estimating the parameter

for the submodels depends on the direction

. Among them, let

be the least favorable direction that is associated with the most difficult submodel. It yields the largest asymptotic optimal variance for estimating

among all submodels. Let

denote the joint density of Z depending on

. Taking derivative of the logarithmic density

with respect to t and evaluating at

gives the score operator:

(3.3)

where

, and

. The least favorable direction is defined as the solution

, which solves the equation

; see Ghosal and van der Vaart ( 2017 , page 370). We immediately obtain the following.

Lemma 3.1.Consider the submodel ( 3.2 ). Let Assumption 1 hold for with any η under consideration, then the least favorable direction for estimating the ATE parameter in ( 2.3 ) is:

(3.4)

where the Riesz representer

is given in ( 2.5 ).

Lemma 3.1 motivates the adjustment of the prior distribution as considered in our Bayesian procedure in Section 2.2 . Our prior correction, which takes the form of the (estimated) least favorable direction, provides an exact invariance under a shift of nonparametric components in this direction. It provides additional robustness against posterior inaccuracy in the “most difficult direction,” that is, the one inducing the largest bias in the ATE. We also note that Lemma 3.1 extends the result in Section 2.1 of Ray and van der Vaart ( 2020 ) for the missing data problem, which is equivalent to observing only one arm (either the treatment or control arm), to the context of ATE estimation that involves both arms.

3.2 Assumptions for Inference

We now provide additional notation and assumptions. The posterior distribution plays an important role in the following analysis and is given by

where

denotes the conditional density of

, obtained by dividing ( 2.1 ) by the marginal density of

. We write

for the marginal posterior distribution of

. We focus on the case that

has a prior that is independent of the prior for

. Because the likelihood function ( 2.1 ) factorizes into

separately, the posterior of

is also independent of the posterior for

. Due to the fact that

does not depend on

, it is unnecessary to further discuss a prior or posterior distribution on

We first introduce high-level assumptions and discuss primitive conditions for those in the next section. Below, we consider some measurable sets of functions such that . We also denote when we index the conditional mean function by its subscript η. We introduce the notation for all , as well as the supremum norm . For two sequences and of positive numbers, we write if , and if and .

Assumption 2. (Rates of Convergence)The estimators and , which are based on an auxiliary sample independent of , satisfy and for :

where

and

. Further,

We adopt the standard empirical process notation as follows. For a function h of a random vector

that follows distribution

, we let

, and

. Below, we make use of the notation

and

Assumption 3. (Complexity)For it holds and

(3.5)

Recall the propensity score-adjusted prior on m given by

where

. The restriction on λ is made through its hyperparameter

Assumption 4. (Prior Stability)For , is a continuous stochastic process independent of the normal random variable , where , and that satisfies: (i) , for some deterministic sequence and (ii) for any .

Discussion of Assumptions

Assumption 2 imposes sufficiently fast convergence rates for the pilot estimators for the conditional mean function and the propensity score . When considering frequentist pilot estimators, these rate conditions can be justified by adopting the recent proposals of Chernozhukov, Newey, and Singh ( 2022a , b ). One can also use Bayesian point estimators such as the posterior mean of the Gaussian process for and . The posterior convergence rate for the conditional mean can be derived in the same spirit of Ray and van der Vaart ( 2020 ). The rate conditions in Assumption 2 also resemble conditions (i) and (ii) of Theorem 1 of Farrell ( 2015 ) in the context of frequentist estimation. Remark 4.1 illustrates that under classical smoothness assumptions, this assumption is less restrictive than the method of Ray and van der Vaart ( 2020 ) or other approaches for semiparametric estimation of ATEs as found in Chen, Hong, and Tarozzi ( 2008 ) or Farrell, Liang, and Misra ( 2021 ). Assumption 4 incorporates Conditions (3.9) and (3.10) from Theorem 2 in Ray and van der Vaart ( 2020 ), and it is imposed to check the invariance property of the adjusted prior distribution. These restrictions are mild and extend beyond the Gaussian processes that are considered in Section 4 for concreteness.

Assumption 3 restricts the functional class to form a -Glivenko–Cantelli class; see Section 2.4 of van der Vaart and Wellner ( 1996 ). It also imposes a new stochastic equicontinuity condition, as ( 3.5 ) restricts a product structure involving and , which further relaxes the corresponding condition from Ray and van der Vaart ( 2020 ), namely, . In the next section, we demonstrate that our formulation allows for double robustness under Hölder classes (see Remark 4.1 ). Hence, the complexity of the functional class can be compensated by sufficient regularity of the corresponding Riesz representer and vice versa. A condition similar to our Assumption 3 is also used in the frequentist literature; see Section 2 of Benkeser, Carone, van der Laan, and Gilbert ( 2017 ). Nonetheless, the technical argument differs substantially from the frequentist's study, because we mainly need the condition ( 3.5 ) to control changes in the likelihood under perturbations along the estimated and true least favorable directions. This is unique to Bayesian analysis with nonparametric priors.

3.3 A Double Robust Bernstein–von Mises Theorem

We now present a new Bernstein–von Mises theorem, which establishes the asymptotic normality of the posterior distribution, modulo a “bias term.” In a next step, we show that posterior correction, as proposed in our procedure, eliminates this “bias term.” This asymptotic equivalence result is established using the bounded Lipschitz distance. For two probability measures P, Q defined on a metric space

, we define the bounded Lipschitz distance as

(3.6)

where

Here,

denotes the vector

norm.

Below is our main statement about the asymptotic behavior of the posterior distribution of . As in the modern Bayesian paradigm, the exact posterior is rarely of closed form, and one needs to rely on certain Monte Carlo simulations, such as the implementation procedure in Section 2.2 , to approximate this posterior distribution, as well as the resulting point estimator and credible set.

Theorem 3.1.Let Assumptions 1 – 4 hold. Then we have

where

We emphasize that the above BvM theorem is not feasible for applications, because it depends on the “bias term”

, which depends on the unknown conditional mean

. Nonetheless, it provides an important theoretical benchmark. One can follow the existing literature on semiparametric BvM theorems to impose the so-called “no-bias” condition, but this generally leads to strong smoothness restrictions and may not be satisfied when the dimensionality of covariates is large relative to the smoothness properties of the underlying functions; see the discussion on page 395 of van der Vaart ( 1998 ).

This “bias term” in our context consists of two key components, with the first involving unknown true functions and the second depending on the posterior of . We consider pilot estimators for the unknown functional parameters in . The correction term , as introduced in 2.8, results in a feasible Bayesian procedure that satisfies the BvM theorem under double robustness, as demonstrated below.

Theorem 3.2.Let Assumptions 1 – 4 hold. Then we have

We now show how Theorem 3.2 can provide frequentist justification of Bayesian methods to construct the point estimator and the confidence sets. Recall that represents the posterior mean. Introduce a Bayesian credible set for , which satisfies for a given nominal level . The next result shows that also forms a confidence interval in the frequentist sense for the ATE parameter, whose coverage probability under converges to .

Corollary 3.1.Let Assumptions 1 – 4 hold. Then under , we have

(3.7)

Also, for any

we have

To the best of our knowledge, this is the first BvM theorem that entails the double robustness. We discuss the distinction from Theorem 2 in Ray and van der Vaart ( 2020 ). Their work laid the theoretical foundation for Bayesian inference based on the propensity score adjusted priors. Specifically, under this prior adjustment, they established a BvM result under weak regularity conditions on the propensity score function, referring to this property as single robustness. Our analysis differs from Ray and van der Vaart ( 2020 ) in two crucial ways. First, we improve on their Lemma 3 by showing that it is possible to verify the prior stability condition for propensity score-adjusted priors under the product structure in Assumption 3 , modulo the “bias term” . This separation is essential to identify the source of the restrictive condition, such as the Donsker property on , which is mainly used to eliminate . Second, our proposal introduces an explicit debiasing step, borrowing key insights from recent developments in the DML literature.

Remark 3.1. (Connection with frequentist robust estimation)In our BvM theorem, we do not restrict the centering estimator , as long as it admits the linear representation as in ( 3.1 ). A popular frequentist estimator for the ATE that achieves double robustness is

(3.8)

based on frequentist-type pilot estimators

of the conditional mean function

and

of the Riesz representer

; see Robins and Rotnitzky ( 1995 ) and more recently Chernozhukov, Newey, and Singh ( 2022a , b ). The double robust or double machine learning estimator ( 3.8 ) recenters the plug-in type functional by an explicit correction factor that depends on the Riesz representer. ³ Our main result establishes the asymptotic equivalence of our estimator and ( 3.8 ). This not only offers frequentist validity to our Bayesian procedure but also provides a Bayesian interpretation for doubly robust frequentist methods.

Remark 3.2. (Parametric Bayesian Methods)A couple of recent papers propose doubly robust Bayesian recipes for ATE inference, under parametric model restrictions. Saarela, Belzile, and Stephens ( 2016 ) considered a Bayesian procedure based on an analog of the double robust frequentist estimator given in Equation ( 3.8 ), replacing the empirical measure with the Bayesian bootstrap measure. However, there was no formal BvM theorem presented therein. Another recent paper by Yiu, Goudie, and Tom ( 2020 ) explored Bayesian exponentially tilted empirical likelihood with a set of moment constraints that are of a double-robust type. They proved a BvM theorem for the posterior constructed from the resulting exponentially tilted empirical likelihood under parametric specifications. Luo, Graham, and McCoy ( 2023 ) provided Bayesian results for ATE estimation in a partial linear model, which implies homogeneous treatment effects. They also assign parametric priors to the propensity score. Their BvM theorem allows for misspecification only in a parametric nonlinear component of the outcome equation. It is not clear how to extend their analysis to incorporate flexible nonparametric modeling strategies.

4 Illustration Using Squared Exponential Process Priors

We illustrate the general methodology by placing a particular Gaussian process prior on in relation to the conditional mean functions for . The Gaussian process regression has been extensively used among the machine learning community, and started to gain popularity among economists; see Kasy ( 2018 ). We provide primitive conditions used in our main results in the previous section. In addition, we provide details on the implementation using Gaussian process priors and discuss the data-driven choices of tuning parameters.

4.1 Asymptotic Results Under Primitive Conditions

Let be a generic centered and homogeneous Gaussian random field with covariance function of the following form , for a given continuous function . We consider as a Borel measurable map in the space of continuous functions on , equipped with the supremum norm . The Gaussian process is completely determined by the covariance function. For example, the covariance function of the squared exponential process is given by , as its name suggests. In this section, we focus on the squared exponential process prior, which is one of the most commonly used priors in applications; see Rassmusen and Williams ( 2006 ) and Murphy ( 2023 ). We also consider a rescaled Gaussian process for some . Intuitively speaking, can be thought as a bandwidth parameter. For a large (or equivalently a small bandwidth), the prior sample path is obtained by shrinking the long sample path . Thus, it employs more randomness and becomes suitable as a prior model for less regular functions; see van der Vaart and van Zanten ( 2008 , 2009 ).

Below,

denotes a Hölder space with the smoothness index

. Specifically, we illustrate our theory with the case where

for

. Given such a Hölder-type smoothness condition, we choose

(4.1)

Under ( 4.1 ), a rescaled Gaussian process

induces the posterior contraction rate for the conditional mean function

to be

; see Section 11.5 of Ghosal and van der Vaart ( 2017 ). The particular choice of

mimics the corresponding kernel bandwidth based on kernel smoothing methods. Other choices of

will generally make the convergence rate slower. Nonetheless, as long as the propensity score is estimated with a sufficiently fast rate, our BvM theorem still holds. The next proposition illustrates our general theory when we adopt the rescaled squared exponential process prior for the conditional mean function. We use the superscript m for the prior process

to signify this relationship.

Proposition 4.1.Let Assumption 1 hold. The estimator satisfies and for some . Suppose for and some with . Also, . Consider the propensity score-dependent prior on m given by , where is the rescaled squared exponential process for , with its rescaling parameter of the order in ( 4.1 ) and for some deterministic sequence , and . Then the corrected posterior distribution for the ATE satisfies Theorem 3.1 .

Remark 4.1. (Double Robust Hölder Smoothness)Proposition 4.1 requires , which represents a trade-off between the smoothness requirement for and . This encapsulates double robustness; that is, a lack of smoothness of the conditional mean function can be mitigated by exploiting the regularity of the propensity score and vice versa. Referring to the Hölder class , its complexity measured by the bracketing entropy of size ε is of order for . One can show that the key stochastic equicontinuity assumption in Ray and van der Vaart ( 2020 ), that is, their condition ( 3.5 ), is violated by exploring the Sudkov lower bound in Han ( 2021 ) when , or equivalently, when . In contrast, our framework accommodates this non-Donsker regime as long as , which enables us to exploit the product structure and a fast convergence rate for estimating the propensity score. Our methodology is not restricted to the case where propensity score belongs to a Hölder class per se. For instance, under a parametric restriction (such as in logistic regression) or an additive model with unknown link function, the possible range of the posterior contraction rate for the conditional mean function can be substantially enlarged. In the case , the bias term becomes asymptotically negligible, that is, . This allows for smoothness robustness only with respect to the propensity score and is also known as single robustness. In this case, no posterior correction is required; see Ray and van der Vaart ( 2020 ).

4.2 Implementation Details

We provide details on the Gaussian process prior placed on and its posterior computation. Algorithm 1 sets the adjusted prior as . In our implementation, we choose the first component to be a zero-mean Gaussian process with the commonly used squared exponential covariance function; see Rassmusen and Williams ( 2006 , p. 83). That is, where the hyperparameter is the kernel variance and are rescaling parameters that reflect the relevance of treatment and each covariate in predicting . They are selected by maximizing the marginal likelihood. Conditional on the data used to obtain the propensity score estimator , the prior for has zero mean and the covariance kernel , which includes an additional term based on the estimated Riesz representer . It is given by , cf. related constructions from Ray and Szabó ( 2019 ) and Ray and van der Vaart ( 2020 ). The parameter , representing the standard deviation of λ, controls the weight of the prior adjustment relative to the standard Gaussian process. The choice , where , as specified in Algorithm 1 , satisfies the conditions and in Assumption 4 , with probability approaching one. It is similar to the choice suggested by Ray and Szabó ( 2019 , page 6), where is proportional to . The factor normalizes the second term (adjustment term) of to have the same scale as the unadjusted covariance K. Supplemental Appendix H shows that the finite sample performance of the double robust Bayesian approaches remains stable across different choices of .

Utilizing Gaussian process priors with zero mean and covariance function , and incorporating the available data, we generate posterior draws of the vector for . This can be achieved through the Laplace approximation method detailed in Supplemental Appendix G.

For the implementation of the pilot estimator given in ( 2.6 ), we recommend logistic Lasso for the propensity scores, with the penalty parameter chosen by cross-validation, following Friedman, Hastie, and Tibshirani ( 2010 ). As a pilot estimator in Algorithm 1 for posterior correction, we use the uncorrected posterior mean , where is calculated following Step (a) of posterior computation in Algorithm 1 , but with a Gaussian process prior without adjustment, that is, . When the rescaling parameter is as stated in Proposition 4.1 , the convergence rate of is . This can be shown by combining Theorems 11.22, 11.55, and 8.8 from Ghosal and van der Vaart ( 2017 ).

5 Numerical Results

In this section, we apply our method to one version of the Lalonde–Dehejia–Wahba data that contains a treated sample of 185 men from the National Supported Work (NSW) experiment and a control sample of 2490 men from the Panel Study of Income Dynamics (PSID). The data has been used by LaLonde ( 1986 ), Dehejia and Wahba ( 1999 ), Abadie and Imbens ( 2011 ), and Armstrong and Kolesár ( 2021 ), among others. We refer readers to LaLonde ( 1986 ), and Dehejia and Wahba ( 1999 ) for reviews of the data. ⁴

5.1 Simulations

In this section, we consider a simulation study where the observations are randomly drawn from a large sample generated by applying the Wasserstein Generative Adversarial Networks (WGAN) method to the Lalonde–Dehejia–Wahba data; see Athey, Imbens, Metzger, and Munro ( 2024 ). We view their simulated data as the population and repeatedly draw our simulation samples (each consisting of 185 treated observations and 2490 control observations) for each of the 1000 Monte Carlo replications. We slightly depart from previous studies by focusing on a binary outcome Y: the employment indicator for the year 1978, which is defined as an indicator for positive earnings. The treatment D is the participation in the NSW program. We are interested in the average treatment effect of the NSW program on the employment status. For the set of covariates, we follow Abadie and Imbens ( 2011 ) and include nine variables: age, education, black, Hispanic, married, earnings in 1974, earnings in 1975, unemployed in 1974, and unemployed in 1975. We implement our double robust Bayesian method (DR Bayes) following Algorithm 1 , using posterior draws and the pilot estimator and , as detailed at the end of Section 4.2 . We compare DR Bayes to two other Bayesian procedures: First, we consider the prior adjusted Bayesian method (PA Bayes) proposed by Ray and van der Vaart ( 2020 ), which constructs the point estimate and credible interval based on in (2.8). Second, we examine an unadjusted Bayesian method (Bayes), which is also based on but generated using Gaussian process priors without adjustment.

We also compare our method to frequentist estimators. Match/Match BC corresponds to the nearest neighbor matching estimator and its bias-corrected version by Abadie and Imbens ( 2011 ), which adjusts for differences in covariate values through regression. DR TMLE corresponds to the doubly robust targeted maximum likelihood estimator by Benkeser et al. ( 2017 ). DML refers to the double/debiased machine learning estimator from Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey ( 2017 ), where the nuisance functions and are estimated using random forests (which outperformed DML combined with other nuisance function estimators, such as Lasso, in our simulation setup). Since the job-training data contains a sizable proportion of units with propensity score estimates very close to 0 and 1, we follow Crump, Hotz, Imbens, and Mitnik ( 2009 ) and discard observations with the estimated propensity score outside the range , with the trimming threshold . ⁵

Table I presents the finite sample performance of the Bayesian and frequentist methods mentioned above. We use the full data twice in computing the prior/posterior adjustments and the posterior distribution of the conditional mean function. Supplemental Appendix H reports the performance of DR Bayes using sample splitting, which results in similar coverage but a larger credible interval length due to the halved sample size.

TABLE I. Simulation Results Using WGAN-Generated Data: trimming is based on the estimated propensity score within [t,1 − t],

the average sample size after trimming, CP = coverage probability of 95% credible/confidence interval, CIL = average length of the 95% credible/confidence interval.

Methods	()			()			()
	Bias	CP	CIL	Bias	CP	CIL	Bias	CP	CIL
Bayes	−0.040	0.683	0.147	−0.010	0.841	0.149	−0.006	0.911	0.120
PA Bayes	−0.008	0.981	0.260	0.033	0.949	0.254	0.047	0.897	0.308
DR Bayes	−0.024	0.983	0.223	0.014	0.970	0.221	0.023	0.952	0.258
Match	0.027	0.933	0.334	0.048	0.908	0.323	0.033	0.965	0.323
Match BC	0.040	0.880	0.347	0.065	0.816	0.334	0.083	0.804	0.339
DR TMLE	0.015	0.832	0.300	0.039	0.746	0.282	0.039	0.668	0.242
DML	0.045	0.927	0.524	0.052	0.870	0.393	0.054	0.918	0.522

Concerning the Bayesian methods for estimating the ATE, Table I reveals that unadjusted Bayes yields highly inaccurate coverage except for the case with trimming constant . If the prior is corrected using the propensity score adjustment, the results improve significantly. Nevertheless, our DR Bayes method demonstrates two further improvements: First, DR Bayes leads to smaller average confidence lengths in each case while simultaneously improving the coverage probability. This can be attributed to a reduction in bias and/or more accurate uncertainty quantification via our posterior correction. Second, when the trimming threshold is small (i.e., ), propensity score estimators can be less accurate, leading to reduced coverage probabilities of PA Bayes. Our double robust Bayesian method, on the other hand, is still able to provide accurate coverage probabilities. In other words, DR Bayes exhibits more stable performance than PA Bayes with respect to the trimming threshold. ⁶

Our DR Bayes also exhibits encouraging performances when compared to frequentist methods. It provides a more accurate coverage than bias-corrected matching, DR TMLE and DML. Compared to the matching estimator without bias correction, which achieves similarly good coverage, DR Bayes yields considerably shorter credible intervals.

5.2 An Empirical Illustration

We apply the Bayesian and frequentist methods considered above to the Lalonde-Dehejia-Wahba data. Similar to the simulation exercise, we consider a varying choice of the trimming threshold . ⁷ The ATE point estimates and confidence intervals are presented in Table II . As a benchmark, the experimental data that uses both treated and control groups in NSW () yields an ATE estimate (treated-control mean difference) of 0.111 with a 95% confidence interval .

TABLE II. Estimates of ATE for the Lalonde-Dehejia-Wahba data: Trimming is based on the estimated propensity score within [t,1 − t],

sample size after trimming. ATE = point estimate, 95% CI = 95% credible/confidence interval, CIL = 95% credible/confidence interval length.

Methods	ATE	95% CI	CIL	ATE	95% CI	CIL	ATE	95% CI	CIL
	()			()			()
Bayes	0.213	[0.120, 0.301]	0.181	0.214	[0.132, 0.292]	0.161	0.198	[0.140, 0.251]	0.112
PA Bayes	0.158	[0.019, 0.288]	0.270	0.170	[0.045, 0.281]	0.236	0.090	[−0.078, 0.233]	0.311
DR Bayes	0.178	[0.061, 0.293]	0.231	0.184	[0.064, 0.294]	0.230	0.121	[−0.031, 0.250]	0.281
Match	0.188	[0.022, 0.355]	0.333	0.140	[−0.029, 0.309]	0.338	0.079	[−0.111, 0.269]	0.380
Match BC	0.157	[−0.006, 0.321]	0.327	0.145	[−0.021, 0.310]	0.331	0.180	[−0.004, 0.365]	0.369
DR TMLE	−0.023	[−0.171, 0.125]	0.296	0.073	[−0.074, 0.220]	0.294	0.071	[−0.146, 0.289]	0.435
DML	0.172	[0.018, 0.327]	0.308	0.150	[−0.010, 0.310]	0.320	0.258	[−0.183, 0.699]	0.882

As we see from Table II , the unadjusted Bayesian method yields larger estimates. The adjusted Bayesian methods (PA and DR Bayes), on the other hand, produce estimates comparable to the experimental estimate. PA Bayes finds that the job training program enhanced the employment by 9.0% to 17.0% across different trimming thresholds, and DR Bayes estimates the effect from 12.1% to 18.4%. Among frequentist estimators, the matching estimator and its bias-corrected version produce similar estimates as PA and DR Bayes, but with wider confidence intervals. DR TMLE produces negative estimates for when all other estimates are positive. For and 0.05, DML yields similar point estimates as PA and DR Bayes, but with less estimation precision. In the case , where the overlap condition is nearly violated, its point estimate and confidence interval length become considerably larger than those of other methods.

6 Extensions

This section extends the binary variable Y to encompass general cases, including continuous, counting, and multinomial outcomes. First, we examine the class of single-parameter exponential families, where the conditional density function is solely determined by the nonparmatric conditional mean function. This covers continuous outcomes and counting variables. Second, we consider the “vector” case of exponential families for multinomial outcomes. For both classes, we derive the novel correction to the Bayesian procedure and delegate more technical discussions to Supplemental Appendices D and F. Additionally, we outline extensions to other causal parameters of interest.

6.1 A Single-Parameter Exponential Family

In this part, we assume that the distribution of

conditional on

and

belongs to the “single-parameter” exponential family, where the unknown parameter is the nonparametric conditional mean function

. The conditional density function is given by

(6.1)

where

, and the function

links the mean to the “natural parameter” of the exponential family. We also restrict the sufficient statistic to be linear in y.

The family ( 6.1 ) not only encompasses the Bernoulli distribution (with

, and

), as considered in the previous sections, but also allows for counting and continuous outcomes. For instance, when

, the Poisson distribution corresponds to the choices

, and

, while the exponential distribution is represented by

, and

. Furthermore, the normal distribution with

for some

, is captured by

, and

. We emphasize that model ( 6.1 ) does not impose functional form assumptions on the conditional mean function m. The joint density of

can be written as

(6.2)

We consider the same reparametrization of

as in ( 2.2 ) except that now the second component of η uses the general link function q satisfying

. We now state the least favorable direction for the exponential family case, which serves as motivation for the prior adjustment.

Lemma 6.1.Let Assumption 1 hold for with any η under consideration. Then, for the joint distribution ( 6.2 ) and the submodel defined by the path with as defined in ( 3.2 ), the least favorable direction for estimating the ATE parameter in ( 2.3 ) is

(6.3)

where the Riesz representer

is given in ( 2.5 ).

For the outcome family with

, which includes Bernoulli, Poisson, and exponential distributions, the least favorable direction for ATE estimation coincides with the one as given in Lemma 3.1 . To implement the double robust Bayesian procedure for general outcomes, one can still follow Algorithm 1 , with the logistic function Ψ replaced by the inverse link function

. For the normal (homoscedastic) outcome where prior adjustment

in Algorithm 1 becomes

, the hyperparameter a can be determined together with other parameters of the Gaussian process by optimizing the marginal likelihood as in Ray and Szabó ( 2019 ). Proposition F.1 in the Supplemental Material provides primitive conditions for the BvM theorem to hold under double robust smoothness conditions.

6.2 Multinomial Outcomes

We now assume that the dependent variable

takes values in a finite set, specifically

. The ATE can then be written as

, where the choice probabilities are

with the multinomial logit specification:

for

. The multinomial logit specification implies

. We now provide the least favorable direction for multinomial outcomes in the presence of multinomial outcomes and discuss its consequences for prior adjustment below.

Lemma 6.2.Consider the submodel defined by the path , , with as defined in ( 3.2 ). Let Assumption 1 hold for with any η under consideration, then the least favorable direction for estimating the ATE parameter is

where the Riesz representer

is given in ( 2.5 ).

We emphasize that the least favorable direction calculation is not a trivial extension of Hahn ( 1998 ) or Ray and van der Vaart ( 2020 ). This is because there are J nonparametric components involved in the conditional probability function of the multinomial outcomes given covariates, and we need to consider the perturbation of those J components together. Nonetheless, we show that the efficient influence function is of the same generic form as derived in Hahn ( 1998 ). In the proof of Lemma 6.2 , we compute the derivative of the parameter mapping along the path considered herein. We derive inner products involving the least favorable direction for each nonparametric component consisting of the conditional choice probabilities. The extension to the multinomial case had not been considered in the literature to our knowledge, and it offers a result of independent interest.

Lemma 6.2 motivates the following modification of our double robust Bayesian estimator based on the propensity score-dependent prior on

for

where

is a continuous stochastic process independent

for

. We may then follow the implementation as described in Section 2.2 using

6.3 Other Causal Parameters

We now extend our procedure to general linear functionals of the conditional mean function. We do so only for binary outcomes, as the modification to other types of outcomes follows as above. Recall that the observable data consists of . observations of . The causal parameter of interest is , where the function ψ is linear with respect to the conditional mean function . We introduce the Riesz representer satisfying . Let and be pilot estimators for the conditional mean and Riesz representer, respectively, computed over an auxiliary sample. Our double robust Bayesian procedure can be extended by considering the corrected posterior distribution for as follows: , , where here . The derivations of the least favorable directions in the following two examples are provided in Supplemental Appendix E.

Example 6.1. (Average Policy Effects)The policy effect from changing the distribution of X is , where the known distribution functions and have their supports contained in the support of the marginal covariate distribution . Following the general setup, with its Riesz representer , where and stand for the density function of and , respectively.

Example 6.2. (Average Derivative)For a continuous scalar (treatment) variable D, the average derivative is given by , where denotes the partial derivatives of m with respect to the continuous treatment D. Thus, we have with its Riesz representer given by , where here denotes the conditional density function of D given X.

1 Strictly speaking, the main objective in Ray and van der Vaart ( 2020 ) concerns the mean response in a missing data model, which is equivalent to observing one arm (either the treatment or control) of the causal setup.

2 If

does not have a density, we can simply consider the conditional density of

given

instead of the joint density of

3 Another popular method in the statistics literature is the targeted learning approach ( van der Laan and Rose ( 2011 ), Benkeser et al. ( 2017 )).

4 The data is available on Dehejia's website: http://users.nber.org/~rdehejia/nswdata2.html.

5 Crump et al. ( 2009 ) suggested a simple rule of thumb with a threshold of

, while Athey et al. ( 2024 ) used

. Applying the optimal trimming rule proposed by Crump et al. ( 2009 ) to our simulated samples yields an average optimal trimming threshold 0.073.

6 In additional simulations without trimming (

), we find that all double robust methods, including DR Bayes, substantially undercover and/or inflate the length of their confidence intervals. This is consistent with Crump et al. ( 2009 ), who point out that propensity score estimates close to the boundaries tend to induce substantial bias and large variances in estimating the ATE. We also note that unadjusted Bayes severely undercovers in this case.

7 Applying the optimal trimming rule proposed by Crump et al. ( 2009 ) yields an optimal threshold of 0.064.

Appendix A: Proofs of Main Results

In the 6.2 ,

denotes a generic constant, whose value might change line by line. We introduce additional subscripts when there are multiple constant terms in the same display. In the following, we denote the log-likelihood based on

where each term is the logarithm of the factors involving only π or m or f. Recall the definition of the measurable sets

of functions

such that

. We introduce the conditional prior

. The following posterior Laplace transform of

given by

(A.1)

plays a crucial role in establishing the BvM theorem ( Castillo ( 2012 ), Castillo and Rousseau ( 2015 ), Ray and van der Vaart ( 2020 )). To abuse the notation slightly, we define a perturbation of

along the least favorable direction, restricted to the components corresponding to π and m:

(A.2)

We explicitly write the perturbation of

. Recall that

coincides with the Riesz representer

by Lemma 3.1 . In addition, we introduce the following notation:

(A.3)

Also, recall the notation

, which is used in the following. In the proofs below, we make use of Lemmas C.1–C.9 which can be found in Supplemental Appendix C.

Proof of Theorem 3.1.Since the estimated least favorable direction is based on observations that are independent of , we may apply Lemma 2 of Ray and van der Vaart ( 2020 ). It suffices to handle the posterior distribution with set equal to a deterministic function . By Lemma 1 of Castillo and Rousseau ( 2015 ), it is sufficient to show that the Laplace transform given in ( A.1 ) satisfies

(A.4)

for every t in a neighborhood of 0, where the limit at the right-hand side of ( A.4 ) is the Laplace transform of the

distribution. Note that we can write

. Further, let

, which satisfies ( 3.1 ).

The Laplace transform can thus be written as

The expansion in Lemma B.1 gives the following identity for all t in a sufficiently small neighborhood around zero and uniformly for

where we make use of the notation

and the score operator

defined through ( 3.3 ).

Next, we plug this into the exponential part in the definition of , which then gives

By Fubini's theorem, the double integral of the previous expression coincides with

By the assumed

-Glivenko–Cantelli property for

in Assumption 3 , that is,

, and the boundedness of

, we apply Lemma C.4, which establishes the convergence of the Laplace transform for the Dirichlet posterior process. Specifically, it implies the convergence in probability of

uniformly over

, using the notation

and

. Further, we may apply the convergence of

imposed in Assumption 2 , so that the above display becomes

We now analyze the empirical process term in the integral and examine its relationship with the bias term

. To do so, we calculate

where the last line follows from the definition of the bias term, that is,

Further, observe that and by the definition of the efficient influence function given in ( 2.4 ). As we insert these in the previous expression for , we obtain for all t in a sufficiently small neighborhood around zero and uniformly for :

where the last equality follows from the prior invariance condition established in Lemma B.2 . This implies ( A.4 ) using that

by the Lemma 3.1 . Q.E.D.

Proof of Theorem 3.2.It is sufficient to show that , where and . We make use of the decomposition

(A.5)

Consider the first summand on the right-hand side of the previous equation. We have uniformly for

where the last equation follows from the following derivation:

using the Cauchy–Schwarz inequality, Assumption 2 , and Assumption 3 . Consider the second summand on the right-hand side of ( A.5 ). From Lemma C.8, we infer

Consequently, decomposition ( A.5 ) together with the asymptotic expansion of each summand yields

where the last equation is due to the equation (C.6) in Supplemental Appendix C. Q.E.D.

Proof of Corollary 3.1.The weak convergence of the Bayesian point estimator directly follows from our asymptotic characterization of the posterior and the argmax theorem; see the proof of Theorem 10.8 in van der Vaart ( 1998 ). The corrected Bayesian credible set satisfies for any . In particular, we have

Now the definition of the estimator

given in ( 3.1 ) yields

. For any set A, we write

. Theorem 3.1 implies

We may thus write

for some set

satisfying

. Therefore, the frequentist coverage of the Bayesian credible set is

noting that

is asymptotically normal with mean zero and variance

under

. Q.E.D.

Proof of Proposition 4.1.Note that is based on an auxiliary sample, and hence, we can treat below as a deterministic function denoted by satisfying the rate restrictions and . Regarding the conditional mean functions, we consider the set , where for and some constant :

(A.6)

where

in the first restriction for the Gaussian process

is a regularity class of functions defined in the equation (C.7) in Supplemental Appendix C. We write

We first verify Assumption 2 with . The posterior contraction rate is shown in our Lemma C.3. Referring to the product rate condition, that is, for . This is satisfied if , which can be rewritten as .

We now verify Assumption 3 . It is sufficient to deal with the resulting empirical process . Note that the Cauchy–Schwarz inequality implies

Consequently, from Lemma C.5 we infer

Note that if

, from Lemma C.9 we infer

. Thus it remains to consider the case

. By the entropy bound presented in the proof of Lemma C.3, we have

, with

modulo some logn term on the right-hand side of the bound. Because

is monotone and Lipschitz, a set of ε-covers in

for

translates into a set of ε-covers for

. In this case, the empirical process bound of Han ( 2021 , p. 2644) yields

where

represents a term that diverges at certain polynomial order of logn. Consequently, we obtain

which is satisfied under the smoothness restriction

or equivalently

. This condition automatically holds given

Finally, it remains to verify Assumption 4 . By the univariate Gaussian tail bound, the prior mass of the set satisfies . Also, the Kullback–Leibler neighborhood around has prior probability at least . We may thus apply Lemma 4 of Ray and van der Vaart ( 2020 ), which yields , as imposed in Assumption 4 (i).

Regarding Assumption 4 (ii), we need to show the posterior probability of the shifted version of is tending to one. Considering itself, the first set in the intersection of ( A.6 ) that defines is seen to have posterior probability tending to one by the result in (II) of Lemma C.3, combined with the univariate Gaussian tail probability bound

The second set in the intersection of ( A.6 ) has posterior probability tending to one by Lemma 17 of Ray and van der Vaart ( 2020 ). Hence,

has posterior probability going to one. Next, we consider

, for any

. To slightly abuse the notation, we write

for

in the sequel. By the Lipschitz continuity of the logistic link function, we have

for

. Therefore, we get

with probability

approaching one, where

with

and

denotes the norm of the Reproducing Kernel Hilbert Space associated with the squared exponential process; see Supplemental Appendix C for a formal definition. Because

and

, the posterior probability of

tends to one following similar arguments concerning the set

, after replacing

with a multiple of itself for

. Hence, the posterior probability of

is seen to tend to one, which completes the proof. Q.E.D.

Appendix B: Key Lemmas

We now present key lemmas used in the derivation of our BvM theorem. We introduce

where

(B.1)

This defines a path from

. We also write

, for

, so that

; cf. the proof of Theorem 1 in Ray and van der Vaart ( 2020 ).

Lemma B.1.Let Assumptions 1 and 2 hold. Then we have uniformly for :

Proof.We start with the following decomposition:

From the calculation in the proof of Lemma C.1, we have

. Then we infer for the stochastic equicontinuity term that

uniformly in

. We can thus write uniformly in

The rest of the proof involves a standard Taylor expansion for the third term on the right- hand side of the above equation. By the equation (C.4) in the proof of Lemma C.1, we get

by the fact that

and the definition of the Riesz representer

in ( 2.5 ). Regarding the second-order term in the Taylor expansion in the equation (C.5) of the proof of Lemma C.1, we get

Considering the score operator

defined in ( 3.3 ), we have

Consequently, by the unconfoundedness imposed in Assumption 1 (i) and the binary nature of Y, we have

. We thus obtain

Then, by employing Assumption 1 (ii), that is,

for all x, it yields uniformly for

where the last equation is due to the posterior contraction rate of the conditional mean function

imposed in Assumption 2 . Consequently, we obtain, uniformly for

which leads to the desired result. Q.E.D.

The next lemma verifies the prior stability condition under our double robust smoothness conditions.

Lemma B.2.Let Assumptions 1 – 4 hold. Then we have

(B.2)

for a sequence of measurable sets

such that

Proof.Since is based on an auxiliary sample, it is sufficient to consider deterministic functions with the same rates of convergence as . Denote the corresponding propensity score by . By Assumption 4 , we have and

(B.3)

where

denotes the probability density function of a

random variable and the set

is defined by

where

imposed in Assumption 4 and

Considering the log likelihood ratio of two normal densities together with the constraint , it is shown on page 3015 of Ray and van der Vaart ( 2020 ) that

We show at the end of the proof that

, uniformly for

. Consequently, the numerator of this leading term in ( B.3 ) becomes

By the change of variables

on the numerator and using the notation

, the prior invariance property becomes

The desired result would follow from

and

. The first convergence directly follows from Assumption 4 . The set

is the intersection of these two conditions in Assumption 4 , except that the restriction on λ in

instead of

. By construction, we have

, so that

We complete the proof by establishing the following result:

(B.4)

We denote

and

. Consider the following decomposition of the log-likelihood:

Next, we apply third-order Taylor expansions in Lemma C.1 separately to the two terms in the brackets of the above display making use of the notation

for some intermediate points

; cf. the equation ( B.1 ). Combining the previous calculation yields

In order to control

, we evaluate

Note that the first term is centered, so it becomes

. We apply Lemma C.2 to conclude that it is of smaller order. The middle term is negligible by our Assumption 3 . Referring to the last term, the Cauchy–Schwarz inequality yields

where the last equality is due to Assumption 2 . We thus obtain

uniformly in

. Consider

. We note that

uniformly in

. Hence, we obtain

-norm by Assumption 2 . Thus,

uniformly in

. Finally, we control

by evaluating

uniformly in

, which shows ( B.4 ). Q.E.D.

Supporting Information

References

Abadie, Alberto, and Guido W. Imbens (2011): “Bias-Corrected Matching Estimators for Average Treatment Effects,” Journal of Business & Economic Statistics, 29 (1), 1–11.
10.1198/jbes.2009.07333
Web of Science® Google Scholar
Andrews, Isaiah, and Anna Mikusheva (2022): “Optimal Decision Rules for Weak GMM,” Econometrica, 90 (2), 715–748.
10.3982/ECTA18678
Web of Science® Google Scholar
Armstrong, Timothy B., and Michal Kolesár (2021): “Finite-Sample Optimal Estimation and Inference on Average Treatment Effects Under Unconfoundedness,” Econometrica, 89 (3), 1141–1177.
10.3982/ECTA16907
Web of Science® Google Scholar
Athey, Susan, Guido W. Imbens, Jonas Metzger, and Evan Munro (2024): “Using Wasserstein Generative Adversarial Networks for the Design of Monte Carlo Simulations,” Journal of Econometrics, 240 (2), 105076.
10.1016/j.jeconom.2020.09.013
Web of Science® Google Scholar
Benkeser, David, Marco Carone, Mark van der Laan, and Peter Gilbert (2017): “Doubly Robust Nonparametric Inference on the Average Treatment Effect,” Biometrika, 104 (4), 863–880.
10.1093/biomet/asx053
CAS PubMed Web of Science® Google Scholar
Breunig, Christoph, Ruixuan Liu, and Zhengfei Yu (2025): “ Supplement to ‘Double Robust Bayesian Inference on Average Treatment Effects’,” Econometrica Supplemental Material, 93, https://doi.org/10.3982/ECTA21442.
10.3982/ECTA21442
Google Scholar
Castillo, Ismaël (2012): “A Semiparametric Bernstein–von Mises Theorem for Gaussian Process Priors,” Probability Theory and Related Fields, 152, 53–99.
10.1007/s00440-010-0316-5
Web of Science® Google Scholar
Castillo, Ismaël, and Judith Rousseau (2015): “A Bernstein–von Mises Theorem for Smooth Functionals in Semiparametric Models,” The Annals of Statistics, 43 (6), 2353–2383.
10.1214/15-AOS1336
Web of Science® Google Scholar
Chamberlain, Gary, and Guido W. Imbens (2003): “Nonparametric Applications of Bayesian Inference,” Journal of Business & Economic Statistics, 21 (1), 12–18.
10.1198/073500102288618711
Web of Science® Google Scholar
Chen, Xiaohong, Timothy M. Christensen, and Elie Tamer (2018): “Monte Carlo Confidence Sets for Identified Sets,” Econometrica, 86 (6), 1965–2018.
10.3982/ECTA14525
Web of Science® Google Scholar
Chen, Xiaohong, Han Hong, and Alessandro Tarozzi (2008): “Semiparametric Efficiency in GMM Models With Auxiliary Data,” The Annals of Statistics, 36 (2), 808–843.
10.1214/009053607000000947
Web of Science® Google Scholar
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey (2017): “Double/Debiased/Neyman Machine Learning of Treatment Effects,” American Economic Review, 107 (5), 261–265.
10.1257/aer.p20171038
Web of Science® Google Scholar
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins (2018): “Double/Debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics Journal, 21 (1), C1–C68.
10.1111/ectj.12097
Web of Science® Google Scholar
Chernozhukov, Victor, Whitney Newey, and Rahul Singh (2022a): “Automatic Debiased Machine Learning of Causal and Structural Effects,” Econometrica, 90 (3), 967–1027.
10.3982/ECTA18515
Web of Science® Google Scholar
Chernozhukov, Victor, Whitney Newey, and Rahul Singh (2022b): “De-Biased Machine Learning of Global and Local Parameters Using Regularized Riesz Representers,” The Econometrics Journal, 25 (3), 576–601.
10.1093/ectj/utac002
Web of Science® Google Scholar
Crump, Richard K., V. Joseph Hotz, Guido W. Imbens, and Oscar A. Mitnik (2009): “Dealing With Limited Overlap in Estimation of Average Treatment Effects,” Biometrika, 96 (1), 187–199.
10.1093/biomet/asn055
Web of Science® Google Scholar
Dehejia, Rajeev H., and Sadek Wahba (1999): “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs,” Journal of the American Statistical Association, 94 (448), 1053–1062.
10.1080/01621459.1999.10473858
Web of Science® Google Scholar
Farrell, Max H. (2015): “Robust Inference on Average Treatment Effects With Possibly More Covariates Than Observations,” Journal of Econometrics, 189 (1), 1–23.
10.1016/j.jeconom.2015.06.017
Web of Science® Google Scholar
Farrell, Max H., Tengyuan Liang, and Sanjog Misra (2021): “Deep Neural Networks for Estimation and Inference,” Econometrica, 89 (1), 181–213.
10.3982/ECTA16901
Web of Science® Google Scholar
Friedman, Jerome H., Trevor Hastie, and Rob Tibshirani (2010): “Regularization Paths for Generalized Linear Models via Coordinate Descent,” Journal of Statistical Software, 33 (1), 1–22.
10.18637/jss.v033.i01
PubMed Web of Science® Google Scholar
Ghosal, Subhashis, and Anindya Roy (2006): “Posterior Consistency of Gaussian Process Prior for Nonparametric Binary Regression,” The Annals of Statistics, 34 (5), 2413–2429.
10.1214/009053606000000795
Web of Science® Google Scholar
Ghosal, Subhashis, and Aad W. van der Vaart (2017): Fundamentals of Nonparametric Bayesian Inference, Vol. 44. Cambridge University Press.
10.1017/9781139029834
Google Scholar
Giacomini, Raffaella, and Toru Kitagawa (2021): “Robust Bayesian Inference for Set-Identified Models,” Econometrica, 89 (4), 1519–1556.
10.3982/ECTA16773
Web of Science® Google Scholar
Hahn, Jinyong (1998): “On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects,” Econometrica, 66 (2), 315–331.
10.2307/2998560
Web of Science® Google Scholar
Han, Qiyang (2021): “Set Structured Global Empirical Risk Minimizers Are Rate Optimal in General Dimensions,” The Annals of Statistics, 49 (5), 2642–2671.
10.1214/21-AOS2049
Web of Science® Google Scholar
Hirano, Keisuke, Guido W. Imbens, and Geert Ridder (2003): “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score,” Econometrica, 71 (4), 1161–1189.
10.1111/1468-0262.00442
Web of Science® Google Scholar
Imbens, Guido W. (2004): “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review,” Review of Economics and Statistics, 86 (1), 4–29.
10.1162/003465304323023651
Web of Science® Google Scholar
Imbens, Guido W., and Donald B. Rubin (2015): Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
10.1017/CBO9781139025751
Google Scholar
Kasy, Maximilian (2018): “Optimal Taxation and Insurance Using Machine Learning—Sufficient Statistics and Beyond,” Journal of Public Economics, 167, 205–219.
10.1016/j.jpubeco.2018.09.002
Web of Science® Google Scholar
LaLonde, Robert J. (1986): “Evaluating the Econometric Evaluations of Training Programs With Experimental Data,” American Economic Review, 76 (4), 604–620.
Web of Science® Google Scholar
Luo, Yu, Daniel J. Graham, and Emma J. McCoy (2023): “Semiparametric Bayesian Doubly Robust Causal Estimation,” Journal of Statistical Planning and Inference, 225, 171–187.
10.1016/j.jspi.2022.12.005
Web of Science® Google Scholar
Murphy, Kevin P. (2023): Probabilistic Machine Learning: Advanced Topics. MIT Press.
Google Scholar
Rassmusen, Carl E., and Christopher K. I. Williams (2006): Gaussian Processes for Machine Learning. MIT.
Google Scholar
Ray, Kolyan, and Botond Szabó (2019): “ Debiased Bayesian Inference for Average Treatment Effects,” Advances in Neural Information Processing Systems, 32.
Google Scholar
Ray, Kolyan, and Aad W. van der Vaart (2020): “Semiparametric Bayesian Causal Inference,” The Annals of Statistics, 48 (5), 2999–3020.
10.1214/19-AOS1919
Web of Science® Google Scholar
Ritov, Ya'acov, Peter J. Bickel, Anthony C. Gamst, and Bas J. K. Kleijn (2014): “The Bayesian Analysis of Complex, High-Dimensional Models: Can It Be Coda?” Statistical Science, 29 (4), 619–639.
10.1214/14-STS483
Web of Science® Google Scholar
Robins, James M., and Ya'acov Ritov (1997): “Toward a Curse of Dimensionality Appropriate (Coda) Asymptotic Theory for Semi-Parametric Models,” Statistics in Medicine, 16 (3), 285–319.
10.1002/(SICI)1097-0258(19970215)16:3<285::AID-SIM535>3.0.CO;2-#
CAS PubMed Web of Science® Google Scholar
Robins, James M., and Andrea Rotnitzky (1995): “Semiparametric Efficiency in Multivariate Regression Models With Missing Data,” Journal of the American Statistical Association, 90 (429), 122–129.
10.1080/01621459.1995.10476494
Web of Science® Google Scholar
Rosenbaum, Paul R., and Donald B. Rubin (1984): “Reducing Bias in Observational Studies Using Subclassification on the Propensity Score,” Journal of the American Statistical Association, 79 (387), 516–524.
10.1080/01621459.1984.10478078
Web of Science® Google Scholar
Rubin, Donald B. (1981): “Bayesian Bootstrap,” The Annals of Statistics, 9 (1), 130–134.
10.1214/aos/1176345338
Web of Science® Google Scholar
Rubin, Donald B. (1984): “Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician,” The Annals of Statistics, 12 (4), 1151–1172.
10.1214/aos/1176346785
Web of Science® Google Scholar
Saarela, Olli, Léo R. Belzile, and David A. Stephens (2016): “A Bayesian View of Doubly Robust Causal Inference,” Biometrika, 103 (3), 667–681.
10.1093/biomet/asw025
Web of Science® Google Scholar
van der Laan, Mark J., and Sherri Rose (2011): Targeted Learning: Causal Inference for Observational and Experimental Data. Springer.
10.1007/978-1-4419-9782-1
Web of Science® Google Scholar
van der Vaart, Aad W. (1998): Asymptotic Statistics. Cambridge University Press.
10.1017/CBO9780511802256
Google Scholar
van der Vaart, Aad W., and Jon A. Wellner (1996): Weak Convergence and Empirical Processes. Springer.
10.1007/978-1-4757-2545-2
Web of Science® Google Scholar
van der Vaart, Aad W., and J. Harry van Zanten (2008): “Rates of Contraction of Posterior Distributions Based on Gaussian Process Priors,” The Annals of Statistics, 36 (3), 1435–1463.
10.1214/009053607000000613
Web of Science® Google Scholar
van der Vaart, Aad W., and J. Harry van Zanten (2009): “Adaptive Bayesian Estimation Using a Gaussian Random Field With Inverse Gamma Bandwidth,” The Annals of Statistics, 37 (5B), 2655–2675.
10.1214/08-AOS678
Web of Science® Google Scholar
Wahba, Grace (1990): Spline Models for Observational Data. SIAM.
10.1137/1.9781611970128
PubMed Google Scholar
Yiu, Andrew, Robert J. Goudie, and Brian D. Tom (2020): “Inference Under Unequal Probability Sampling With the Bayesian Exponentially Tilted Empirical Likelihood,” Biometrika, 107 (4), 857–873.
10.1093/biomet/asaa028
CAS PubMed Web of Science® Google Scholar

The replication package for this paper is available at https://doi.org/10.5281/zenodo.14015435. The Journal checked the data and codes included in the package for their ability to reproduce the results in the paper and approved online appendices.

Citing Literature

Volume93, Issue2

March 2025

Pages 539-568

Double Robust Bayesian Inference on Average Treatment Effects

Abstract

1 Introduction

2 Setup and Implementation

2.1 Setup

2.2 Double Robust Bayesian Point Estimators and Credible Sets

3 Main Theoretical Results

3.1 Least Favorable Direction

3.2 Assumptions for Inference

Discussion of Assumptions

3.3 A Double Robust Bernstein–von Mises Theorem

4 Illustration Using Squared Exponential Process Priors

4.1 Asymptotic Results Under Primitive Conditions

4.2 Implementation Details

5 Numerical Results

5.1 Simulations

5.2 An Empirical Illustration

6 Extensions

6.1 A Single-Parameter Exponential Family

6.2 Multinomial Outcomes

6.3 Other Causal Parameters

Appendix A: Proofs of Main Results

Appendix B: Key Lemmas

Supporting Information

References

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Double Robust Bayesian Inference on Average Treatment Effects

Abstract

1 Introduction

2 Setup and Implementation

2.1 Setup

2.2 Double Robust Bayesian Point Estimators and Credible Sets

3 Main Theoretical Results

3.1 Least Favorable Direction

3.2 Assumptions for Inference

Discussion of Assumptions

3.3 A Double Robust Bernstein–von Mises Theorem

4 Illustration Using Squared Exponential Process Priors

4.1 Asymptotic Results Under Primitive Conditions

4.2 Implementation Details

5 Numerical Results

5.1 Simulations

5.2 An Empirical Illustration

6 Extensions

6.1 A Single-Parameter Exponential Family

6.2 Multinomial Outcomes

6.3 Other Causal Parameters

Appendix A: Proofs of Main Results

Appendix B: Key Lemmas

Supporting Information

References

Citing Literature

Figures

References

Related

Information