NCB and RC were supported by the MUR PRIN grant 2022EKNE5K (Learning in Markets and Society), by the FAIR (Future Artificial Intelligence Research) project, funded by the NextGenerationEU program within the PNRR-PE-AI scheme, and by the EU Horizon research and innovation action under grant agreement 101120237, project ELIAS (European Lighthouse of AI for Sustainability). Maximilian Kasy was supported by the Alfred P. Sloan Foundation, under the grant “Social foundations for statistics and machine learning.”

About

Sections

PDF

Tools

Share a link

Email
Wechat
Bluesky

Abstract

We consider the problem of repeatedly choosing policies to maximize social welfare. Welfare is a weighted sum of private utility and public revenue. Earlier outcomes inform later policies. Utility is not observed, but indirectly inferred. Response functions are learned through experimentation.

We derive a lower bound on regret, and a matching adversarial upper bound for a variant of the Exp3 algorithm. Cumulative regret grows at a rate of T^2/3. This implies that (i) welfare maximization is harder than the multiarmed bandit problem (with a rate of T^1/2 for finite policy sets), and (ii) our algorithm achieves the optimal rate. For the stochastic setting, if social welfare is concave, we can achieve a rate of T^1/2 (for continuous policy sets), using a dyadic search algorithm.

We analyze an extension to nonlinear income taxation, and sketch an extension to commodity taxation. We compare our setting to monopoly pricing (which is easier), and price setting for bilateral trade (which is harder).

1 Introduction

Consider a policymaker who aims to maximize social welfare, defined as a weighted sum of utility across individuals. The policymaker can choose a policy parameter such as a sales tax rate, an unemployment benefit level, a health-insurance copay rate, etc. The policymaker does not directly observe the welfare resulting from their policy choices. They do, however, observe behavioral outcomes such as consumption of the taxed good, labor market participation, or health care expenditures. They can revise their policy choices over time in light of observed outcomes. How should such a policymaker act? To address this question, we bring together insights from welfare economics (in particular optimal taxation, Ramsey (1927), Mirrlees (1971), Baily (1978), Saez (2001), Chetty (2009)) with insights from machine learning (in particular online learning and multiarmed bandits, see Slivkins (2019), Lattimore and Szepesvári (2020) for recent reviews, and Thompson (1933), Lai and Robbins (1985) for classic contributions).

In our baseline model, individuals arrive sequentially and make a single binary decision. In each period, the policymaker chooses a tax rate that applies to this binary decision, and then observes the individual's response. They do not observe the individual's private utility. Social welfare is given by a weighted sum of private utility and public revenue. Later, we extend our model to nonlinear income taxation, where welfare weights vary as a function of individual earnings capacity, and sketch an extension to commodity taxation, where individual decisions involve a continuous consumption vector.

Our goal is to give guidance to the policymaker. We propose algorithms to maximize cumulative social welfare, and we provide (adversarial and stochastic) guarantees for the performance of these algorithms. In doing so, we also show that welfare maximization is a harder learning problem than reward maximization in the multiarmed bandit setting. Private utility in our baseline model is equal to consumer surplus, which is given by the integral of demand. In order to learn this integral, we need to learn demand for counterfactual, suboptimal tax rates. This drives the difficulty of the learning problem.

Why Welfare, Why Adversarial Guarantees?

Our algorithms are designed to maximize social welfare, which is not directly observable, rather than maximizing outcomes that are directly observable. The definition of social welfare as an aggregation of individual utilities is at the heart of welfare economics in general, and of optimal tax theory in particular. The distinction between utility and observable outcomes is important in practice. To illustrate, consider the example of a policymaker who chooses the level of unemployment benefits, where the observable outcome is employment. The policymaker could use an algorithm that adaptively maximizes employment. The problem with this approach is that employment might be maximized by making the unemployed as miserable as possible. This is not normatively appealing. Such an algorithm would minimize the utility of the unemployed, rather than maximizing social welfare. Similar examples can be given for many domains of public policy, including health, education, and criminal justice. In contrast to observable outcomes such as employment, welfare is improved by increasing the choice sets of those affected, not by reducing these choice sets.

Our theoretical analysis provides not only stochastic but also adversarial guarantees, which hold for arbitrary sequences of preference parameters. Adversarial guarantees for algorithms promise robustness against deviations from the assumption that heterogeneity is independently and identically distributed over time. Possible deviations from this assumption include autocorrelation, trends, heteroskedasticity, more general nonstationarity, and other concerns of time-series econometrics. In the employment example, such deviations might for instance be due to the business cycle. One might fear that adversarial robustness is achieved at the price of worsened performance for the i.i.d. setting, relative to less robust algorithms. That this is not the case follows from our theoretical characterizations.

Lower and Upper Bounds on Regret

Our main theorems provide lower and upper bounds on cumulative regret. Cumulative regret is defined as the difference in welfare between the chosen sequence of policies and the best possible constant policy. We consider both stochastic and adversarial regret. A lower bound on stochastic regret satisfies that, for any algorithm, there exists some stationary distribution of preference parameters for which the algorithm has to suffer at least a certain amount of regret. An upper bound on adversarial regret has to hold for a given algorithm and any sequence of preference parameters.

For a given algorithm, stochastic regret, averaged over i.i.d. sequences of preference parameters, is always less or equal than adversarial regret, for the worst-case sequence. A lower bound on stochastic regret (for any algorithm) therefore implies a corresponding lower bound on adversarial regret, and an upper bound on adversarial regret (for a given algorithm) immediately implies an upper bound on stochastic regret. When an adversarial upper bound coincides with a stochastic lower bound, in terms of rates of regret, it follows that the proposed algorithm is rate efficient, for both stochastic and adversarial regret. It follows, furthermore, that the bounds are sharp.

A Lower Bound on Stochastic Regret

We first prove a stochastic (and thus also adversarial) lower bound on regret, for any possible algorithm in the welfare maximization problem. Our proof of this bound constructs a family of possible distributions for preferences. This family is such that there are two candidate policies, which are potentially optimal. The difference in welfare between these two policies depends on the integral of demand over intermediate policy values. In order to learn which of the two candidate policies is optimal, we need to learn behavioral responses for intermediate policies, which are strictly suboptimal. Because of the need to probe these suboptimal policies sufficiently often, we obtain a lower bound on regret, which grows at a rate of , even if we restrict our attention to settings with finite, known support for preference parameters and policies. This rate is worse than the worst-case rate for bandits of .

A Matching Upper Bound on Adversarial Regret for Modified Exp3

We next propose an algorithm for the adaptive maximization of social welfare. Our algorithm is a modification of the Exp3 algorithm (Auer, Cesa-Bianchi, Freund, and Schapire (2002)). Exp3 is based on an unbiased estimate of cumulative welfare for each policy. The probability of choosing a given policy is proportional to the exponential of this estimate of cumulative welfare, times some rate parameter. Relative to Exp3, we require two modifications for our setting. First, we need to discretize the continuous policy space. Second, and more interestingly, we need additional exploration of counterfactual policies, including some policies that are clearly suboptimal, in order to learn welfare for the policies, which are contenders for the optimum. This need for additional exploration again arises because of the dependence of welfare on the integral of demand over counterfactual policy choices. For our modified Exp3 algorithm, we prove an adversarial (and thus also stochastic) upper bound on regret. We show that, for an appropriate choice of tuning parameters, worst-case cumulative regret over all possible sequences of preference parameters grows at a rate of , up to a logarithmic term. The algorithm thus achieves the best possible rate. Since the rates for our stochastic lower and adversarial upper bound coincide, up to a logarithmic term, we have a complete characterization of learning rates for the welfare maximization problem.

Improved Stochastic Bounds for Concave Social Welfare

The proof of our lower bound on regret is based on the construction of a distribution of preferences which delivers a nonconcave social welfare function. If we restrict attention to the stochastic setting, where preferences are i.i.d. over time, and if we assume that social welfare is concave, then we can improve upon this bound on regret. We prove a lower bound on stochastic regret, under the assumption of concavity, which grows at the rate of . We then propose a dyadic search algorithm, which achieves this rate, up to logarithmic terms. This dyadic search algorithm maintains an “active interval,” containing the optimal policy with high probability, which is narrowed down over time. Only policies within the active interval are sampled.

Extensions to Nonlinear Income Taxation and to Commodity Taxation

Our discussion up to this point focuses on a minimal, stylized case of an optimal tax problem, where individual actions are binary, and the policy imposes a tax on this binary action. Our arguments generalize, however, to more complicated and practically relevant settings. This includes optimal nonlinear income taxation, as in Mirrlees (1971) and Saez (2001), and commodity taxation for a bundle of goods, as in Ramsey (1927). For nonlinear income taxation, different tax rates apply at different income levels, and welfare weights depend on individual earnings capacity. In Section 5, we discuss an extension of our tempered Exp3 algorithm to nonlinear income taxation, and characterize its regret. For commodity taxation, different tax rates apply to different goods, and consumption decisions are continuous vectors. In Section 6, we sketch an extension of our algorithm to commodity taxation, but leave its characterization for future research.

Roadmap

The rest of this paper proceeds as follows. We conclude this introduction with a discussion of some related work and relevant references. Section 2 introduces our setup, formally defines the adversarial and stochastic settings, and compares our setup to related learning problems. Section 3 provides lower and upper bounds on regret in the adversarial and stochastic settings. Section 4 restricts attention to the stochastic setting with concave social welfare, and provides improved regret bounds for this setting. Section 5 discusses an extension of our baseline model to nonlinear income taxation. Section 6 sketches another extension of our baseline model to commodity taxation. Section 7 concludes, and discusses some possible applications of our algorithm, as well as an alternative Bayesian approach to adaptive welfare maximization. The proofs of Theorem 1 and Theorem 2 can be found in Appendix A. The proofs of our remaining theorems and proofs of technical lemmas are discussed in the Online Supplement (Cesa-Bianchi, Colomboni, and Kasy (2025)).

1.1 Background and Literature

To put our work in context, it is useful to contrast our framework with the standard approach in public finance and optimal tax theory, and with the frameworks considered in machine learning and the multiarmed bandit literature.

Optimal Taxation

Optimal tax theory, and optimal policy theory more generally, is concerned with the maximization of social welfare, where social welfare is understood as a (weighted) sum of subjective utility across individuals (Ramsey (1927), Mirrlees (1971), Baily (1978), Saez (2001), Chetty (2009)). A key tradeoff in such models is between, first, redistribution to those with higher welfare weights, and second, the efficiency cost of behavioral responses to tax increases. Such behavioral responses might reduce the tax base.

Optimal tax problems are defined by normative parameters (such as welfare weights for different individuals), as well as empirical parameters (such as the elasticity of the tax base with respect to tax rates). The typical approach in public finance uses historical or experimental variation to estimate the relevant empirical parameters (causal effects, elasticities). These estimated parameters are then plugged into formulas for optimal policy choice, which are derived from theoretical models. The implied optimal policies are finally implemented, without further experimental variation.

Multiarmed Bandits

The standard approach of public finance, which separates elasticity estimation from policy choice, contrasts with the adaptive approach that characterizes decision-making in many branches of AI, including online learning, multi-armed bandits, and reinforcement learning. Multiarmed bandit algorithms, in particular, trade off exploration and exploitation over time (Slivkins (2019), Lattimore and Szepesvári (2020)). Exploration here refers to the acquisition of information for better future policy decisions, while exploitation refers to the use of currently available information for optimal policy decisions at the present moment. The goal of bandit algorithms is to maximize a stream of rewards, which requires an optimal balance between exploration and exploitation. Bandit algorithms for the stochastic setting are characterized by optimism in the face of uncertainty: Policies with uncertain payoff should be tried until their expected payoff is clearly suboptimal.

Bandit algorithms (and similarly, adaptive experimental designs for informing policy choice, as in Russo (2020), Kasy and Sautmann (2021)) are not directly applicable to social welfare maximization problems, such as those of optimal tax theory. The reason is that bandit algorithms maximize a stream of observed rewards. By contrast, social welfare as conceived in welfare economics is based on unobserved subjective utility.

Adversarial Decision-Making

Adversarial models for sequential decision-making find their roots in repeated game theory (Hannan (1957)), while related settings were independently studied in information theory (Cover (1965)) and computer science (Vovk (1990), Littlestone and Warmuth (1994)). Regret minimization, also in a bandit setting, was investigated as a tool to prove convergence of uncoupled dynamics to equilibria in N-person games (Hart and Mas-Colell (2000, 2001))—the exponential weighting scheme used by Exp3 is also known as smooth fictitious play in the game-theoretic literature (Fudenberg and Levine (1995)). Recent works (Seldin and Slivkins (2014), Zimmert and Seldin (2021)) show that simple variants of Exp3 simultaneously achieve essentially optimal regret bounds in adversarial, stochastic, and contaminated settings, without prior knowledge of the actual regime. This suggests that algorithms designed for adversarial environments can behave well in more benign settings, whereas the opposite is provably not true.

Bandit Approaches for Economic Problems

Bandit-type approaches have been applied to a number of other economic and financial scenarios in the literature where rewards are observable. These include monopoly pricing (Kleinberg and Leighton (2003)) (see also the survey by den Boer (2015)), second-price auctions (Weed, Perchet, and Rigollet (2016)), first-price auctions (Han, Zhou, and Weissman (2020), Han, Zhou, Flores, Ordentlich, and Weissman (2020), Achddou, Cappé, and Garivier (2021)); see also Kolumbus and Nisan (2022), Feng, Podimata, and Syrgkanis (2018), Feng, Guruganesh, Liaw, Mehta, and Sethi (2021), and combinatorial auctions (Daskalakis and Syrgkanis (2022)). Bandit-type approaches have also been applied to some settings where rewards are not directly observable, including bilateral trading (Cesa-Bianchi, Cesari, Colomboni, Fusco, and Leonardi (2024a, 2024b), and the newsvendor problem (Lugosi, Markakis, and Neu (2023)).

Bandit algorithms are widely used in online advertising and recommendation. Online learning methods are successfully used for tuning the bids made by autobidders (a service provided by advertising platforms) (Lucier, Pattathil, Slivkins, and Zhang (2024)). While these algorithms are analyzed in adversarial environments, the extent to which they are deployed in commercial products remains unclear.

2 Setup

At each time

, one individual arrives who is characterized by an unknown willingness to pay

. This individual is exposed to a tax rate

, and makes a binary decision

. The implied public revenue is

. The implied private welfare is

. We define social welfare as a weighted sum of public revenue and private welfare, with a weight

for the latter. Social welfare for time period i is therefore given by

(1)

After period i, we observe

and the tax rate

, but nothing else. In particular, we do not observe welfare

We can rewrite social welfare

as follows. Denote

, so that

. This is the individual demand function. Then private welfare can be written as

. That is, private welfare is given by integrated demand.¹ This representation of private welfare implies

(2)

We consider algorithms for the choice of

, which might depend on the observable history

, as well as possibly a randomization device.

Notation

For the adversarial setting, we will consider cumulative demand and welfare, denoted by blackboard bold letters, summing across

. In particular,

and

are cumulative demand and welfare for a counterfactual, fixed policy x.

, without an argument, is the cumulative welfare for the policies

actually chosen.

For the stochastic setting, we will analogously consider expected demand and expected welfare, denoted by boldface letters. The expectation is taken across some stationary distribution μ of

, where

is statistically independent of

, and of

for

. In particular,

2.1 Regret

The Adversarial Case

Following the literature, we consider regret for both the adversarial and the stochastic setting. In the adversarial setting, we allow for arbitrary sequences of willingness to pay,

. We compare the expected performance of any given algorithm for choosing

to the performance of the best possible constant policy x. This comparison yields cumulative expected regret, which is given by

(3)

The expectation in this expression is taken over any possible randomness in the tax rates

chosen by the algorithm; there is no other source of randomness.

The Stochastic Case

We also consider the stochastic setting. In this setting, we add structure by assuming that the

are i.i.d. draws from some distribution μ on

, with implied demand function

. This demand function is identified by the regression

The expectation in this expression is taken over the distribution of

, which is presumed to be statistically independent of the tax rate

. Expected welfare for this distribution of

is given by

Cumulative expected regret in the stochastic case equals

(4)

The expectation in this expression is taken over both any possible randomness in the tax rates

, and the i.i.d. draws

2.2 Comparison to Related Learning Problems

Before proceeding with our analysis of regret, we take a step back, and compare our learning problem to two related problems that have received some attention in the literature. The first of these is the adaptive monopoly pricing problem; see, for instance, Kleinberg and Leighton (2003). This problem is equivalent to our setting when we set

, interpret x as a price, and

as monopolist profits (neglecting production costs):

(5)

As in our adaptive taxation setting, the feedback received at the end of period i is

Another related problem is price setting for bilateral trade; see, for instance, Cesa-Bianchi et al. (2024a). In this problem, welfare

is given by the sum of seller and buyer welfare. Trade happens if and only if both sides agree to transact at the proposed price. Buyer willingness to pay is given by

, while the seller is willing to trade at prices above

(6)

Feedback in this case is a little richer: We observe both whether the buyer b would have accepted the posted price, and whether the seller would have accepted this price,

Lipschitzness and Information Requirements

The difficulty of the learning problem in each of these models critically depends on (i) the Lipschitz properties of the welfare function, and (ii) the information required to evaluate welfare at a point.

We say that a generic welfare function is one-sided Lipschitz if for all and all . One-sided Lipschitzness allows us to bound the approximation error of a learning algorithm operating on a finite subset of the set of policies. One-sided Lipschitzness is an intrinsic property of both the monopoly pricing and the optimal taxation problem; it is not an assumption that is additionally imposed. To see this for monopoly pricing, note that, for , . For social welfare, .

We say that learning requires only pointwise information if is a function of , and does not depend on otherwise. Pointwise information allows us to avoid exploring policies that are clearly suboptimal, when we aim to learn the optimal policy.

Table I summarizes the Lipschitz properties and information requirements in each of the three models; the following justifies the claims made in Table I:

1. For monopoly pricing, welfare is one-sided Lipschitz and only depends on pointwise.
2. For optimal taxation, welfare is one-sided Lipschitz and depends on both at the given x (pointwise), and on an integral of for a range of values of (nonpointwise).
3. For bilateral trade, welfare is not one-sided Lipschitz and depends on both and (pointwise), as well as the integrals of and (nonpointwise).

These properties suggest a ranking in terms of the difficulty of the corresponding learning problems, and in particular in terms of the rates of divergence of cumulative regret: The information requirements of optimal taxation are stronger than those of monopoly pricing, but its continuity properties are more favorable than those of bilateral trade. This intuition is correct, as shown by Table I. The rates for monopoly pricing and for bilateral trade are known (or can be easily adapted) from the literature. In this paper, we prove corresponding rates for optimal taxation.

TABLE I. Regret rates for different learning problems.

Model	Discrete	Continuous	Pointwise	One-Sided Lipschitz
	Policy Space		Objective Function
Monopoly price setting	T^1/2	T^2/3	Yes	Yes
Optimal taxation	T^2/3	T^2/3	No	Yes
Bilateral trade	T^2/3	T	No	No

Note: This table shows the efficient rates of regret for different learning problems. Rates are up to logarithmic terms, and apply to both the stochastic and the adversarial setting. Regret rates are shown for the discrete case, where the space of policies x is restricted to a finite set, and the continuous case, where x can take any value in . The columns on the right describe the properties of the objective function in each problem, which drive the differences in regret rates.Rates for the optimal taxation case are proven in this paper. Rates for the continuous monopoly price setting case are from Kleinberg and Leighton (2003); the discrete case reduces to a standard bandit problem. Rates for the continuous bilateral trade case are from Cesa-Bianchi et al. (2024a); the discrete case can be deduced by adapting the arguments in the same paper (for the stochastic i.i.d. case with independent sellers' and buyers' valuations), or by adapting the techniques in Cesa-Bianchi et al. (2024b) (for the adversarial case, allowing the learner to use weakly budget balanced mechanisms).

In comparing optimal taxation and monopoly pricing to conventional multiarmed bandits, it is worth emphasizing that there are two distinct reasons for the slower rate of convergence. First, the continuous support of x, as opposed to a finite number of arms, which is shared by optimal taxation and monopoly pricing. Second, the requirement of additional exploration of suboptimal policies for the optimal tax problem. As shown in Table I, the continuous support alone is enough to slow down convergence, with no extra penalty for the additional exploration requirement, in terms of rates. If, however, we restrict our attention to a discrete set of feasible policies x, then monopoly pricing reduces to a multiarmed bandit problem, with a minimax regret rate of . The optimal tax problem, by contrast, still has a rate of , even if we restrict our attention to the case of finite known support for v and x, as shown by the proof of Theorem 1 below.

Hannan Consistency

The cumulative regret of any nonadaptive algorithm necessarily grows at a rate of T. This includes, in particular, randomized experiments where the policy is chosen uniformly at random, from a fixed policy set, in every period. Algorithms for which adversarial regret (and thus also stochastic regret) grows at a rate less than T, so that per period regret goes to 0 as T increases, are known as Hannan consistent. Nonadaptive algorithms are not Hannan consistent. Table I implies that Hannan consistent algorithms exist in all settings considered, with the exception of Bilateral trade and continuous policy spaces.

3 Stochastic and Adversarial Regret Bounds

We now turn to our main theoretical results, lower and upper bounds on stochastic and adversarial regret for the problem of social welfare maximization. We first prove a lower bound on stochastic regret, which applies to any algorithm, and which immediately implies a lower bound on adversarial regret. We then introduce the algorithm Tempered Exp3 for Social Welfare. We show that, for an appropriate choice of tuning parameters, this algorithm achieves the rates of the lower bound on regret, up to a logarithmic term. Formal proofs of these bounds can be found in Appendix A.

3.1 Lower Bound

Theorem 1. (Lower Bound on Regret)Consider the setup of Section 2. There exists a constant such that, for any randomized algorithm for the choice of and any time horizon , the following holds:

1. There exists a distribution μ on with associated demand function G for which the stochastic cumulative expected regret is at least .
2. There exists a sequence for which the adversarial cumulative expected regret is at least .

The proof of Theorem 1 can be found in Appendix A. The adversarial lower bound follows immediately from the stochastic lower bound, since worst-case regret (over possible sequences of ) is bounded below by average regret (over i.i.d. draws of ), for any distribution of .

Sketch of Proof

To prove the stochastic lower bound, we construct a family of distributions

for

, indexed by a parameter

. The distributions in this family have four points of support,

. The probability of these points is given by

The values of a and b are chosen such that (i) the two middle points

are far from optimal, for any value of ϵ, and (ii) learning which of the two end points

is optimal requires sampling from the middle.² For each

, denote the demand function associated to

, and the expected social welfare associated to

. Property (ii) holds because of the integral term

, which shows up in

. This construction is illustrated in Figure 1. This figure shows plots of

and of

for

and

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Construction for proving the lower bound on regret. *Notes*: This figure illustrates our construction for proving the lower bound on regret. The relative social welfare of policies 1 and .25 depends on the sign of ϵ. The solid line corresponds to ϵ = −1, the dashed line to ϵ = 1. In order to distinguish between these two, we must learn demand in the intermediate interval [0.5,0.75].

The difference in welfare of the two candidate optimal policies and 1 depends on the sign of ϵ. In order not to suffer expected regret that grows as , any learning algorithm needs to sample policies from points that are informative about the sign of ϵ. The only points that are informative are those in the region , where welfare is bounded away from optimal welfare.

More specifically, the learning algorithm has to sample on the order of times from the region , to be able to detect the sign of ϵ, incurring regret on the order of in the process. Any learning algorithm therefore incurs regret on the order of , which for , leads to the conclusion.

3.2 An Algorithm That Achieves the Lower Bound

We next introduce Algorithm 1, which allows us to essentially achieve the lower bound on regret, in terms of rates.

Conventional Exp3

Algorithm 1 is a modification of the Exp3 algorithm. Conventional Exp3 (Auer et al. (2002)) is designed to maximize the standard bandit objective, . Exp3 maintains an unbiased running estimate of the cumulative payoff of each arm k, calculated using inverse probability weighting, . In period i, arm k is chosen with probability , where η and γ are tuning parameters. is thus increasing in the estimated average performance of arm k in prior periods. Because is not normalized by the number of time periods i, more weight is given to the best performing arms over time, as estimation uncertainty for average performance decreases. In both these aspects, Exp3 is similar to the popular Upper Confidence Bound algorithm (UCB) for stochastic bandit problems (Lai (1987), Agrawal (1995), Auer, Cesa-Bianchi, and Fischer (2002)). In contrast to UCB, Exp3 is a randomized algorithm. Randomization is required for adversarial performance guarantees. This is analogous to the necessity of mixed strategies for zero-sum games.

Modifications Relative to Conventional Exp3

Relative to this algorithm, we require three modifications. First, we discretize the continuous support of x, restricting attention to the grid of policy values . Second, since welfare is not directly observed for the chosen policy x, we need to estimate it indirectly. In particular, we first form an estimate of cumulative demand for each of the policy values , using inverse probability weighting. We then use this estimated demand, interpolated using a step-function, to form estimates of cumulative social welfare, . Third, we require additional exploration, relative to Exp3. Since social welfare depends on demand for counterfactual policy choices, we need to explore policies that are away from the optimum, in order to learn the relative welfare of approximately optimal policy choices. The mixing weight γ, which determines the share of policies sampled from the uniform distribution, needs to be larger relative to conventional Exp3, to ensure sufficient exploration away from the optimum.

Theorem 2. (Adversarial Upper Bound on Regret of Tempered Exp3 for Social Welfare)Consider the setup of Section 2, and Algorithm 1. Assume that .

Then for any sequence expected regret is bounded above by

(7)

Suppose additionally that

are constants. Then there exists a constant

such that, if we set

, and

, the expected regret

is bounded above by

(8)

Corollary 1. (Stochastic Upper Bound on Regret of Tempered Exp3 for Social Welfare)Under the assumptions of Theorem 2, suppose additionally that is drawn i.i.d. from some distribution with associated demand function G. Then expected regret is bounded above by the same expressions as in Theorem 2.

The proof of Theorem 2 can again be found in Appendix A.

Tuning

The statement of the theorem leaves the constants

in the definition of the tuning parameters unspecified. Suppose we wish to choose the tuning parameters so as to optimize the upper bound obtained in Theorem 2. Ignoring the rounding of K, an approximate solution to this problem is given by

where

This solution is obtained by taking the upper bound in Equation (10), approximating

and

, and solving the first-order conditions with respect to the three tuning parameters. This approximation, and the tuning parameters specified above, yield an approximate upper bound on regret of

Unknown Time Horizon

Note that the proposed tuning depends crucially on knowledge of the time horizon T at which regret is to be evaluated. In order to extend our rate results to the case of unknown time horizons, we can use the so-called doubling trick; cf. Section 2.3 of Cesa-Bianchi and Lugosi (2006): Consider a sequence of epochs (intervals of time periods) of exponentially increasing length, and rerun Algorithm 1 for each time period separately, tuning the parameters over the current epoch length. This construction converts Algorithm 1 into an “anytime algorithm,” which enjoys the same regret guarantees of Theorem 2, up to a multiplicative constant factor. Another more efficient strategy to achieve the same goal is to modify Algorithm 1, allowing the parameters η and γ to change at each iteration, and splitting each bin associated with the discretization parameter K whenever more precision is required.

The Extra Term

There is a rate discrepancy between our our upper and lower bounds on regret, corresponding to the term in our upper bound. We conjecture the existence of an alternative algorithm that can eliminate this extra logarithmic term, albeit at the cost of reduced computational efficiency and a less transparent theoretical analysis. Our conjecture is based on known results for the standard multi-armed bandit problem with K arms. The Exp3 algorithm achieves an upper bound of order for this problem, which includes an extra logarithmic factor compared to the known lower bound of order . Exp3 is an instance of the Follow-The-Regularized-Leader (FTRL) algorithm with importance weighting and the negative entropy as the regularizer. It is known that using the -Tsallis entropy as the regularizer in the FTRL algorithm with importance weighting results in regret guarantees of order for the bandit problem (Lattimore and Szepesvári (2020)). However, unlike Exp3, FTRL with Tsallis entropy involves a more complex proof. Analogous statements might be true for our setting.

Numerical Example

For illustration, Figure 2 plots the cumulative average regret of Tempered Exp3 for Social Welfare for the case where is sampled uniformly at random from each time period. Initially, the performance of our algorithm is, by construction, equal to the performance of choosing a policy uniformly at random. Over time, however, the average regret of our algorithm drops by more than half, in this numerical example. Note that the rate at which cumulative regret declines in Figure 2 (for i.i.d. sampling from a fixed distribution) is unrelated to the regret rate of Theorem 2 (for the worst-case sequence of , for each time horizon T).

Alternative Algorithms

Theorem 2 shows that Tempered Exp3 for Social Welfare achieves the lower bound for adversarial regret. The same might be true for other algorithms. Any alternative algorithm that shares this property needs to be randomized. The need for randomization parallels the need for mixed strategies in both static and dynamic zero-sum games; it excludes deterministic algorithms such as UCB. For the bandit setting, the Tsallis-INF algorithm (Zimmert and Seldin (2021)), of which Exp3 is a special case, is furthermore the only algorithm known to be rate optimal in both stochastic and adversarial regimes.

For our adaptive welfare problem, any algorithm that achieves the optimal rate is not only required to randomize; any such algorithm also needs to sample suboptimal policies at a sufficient rate; cf. the proof of Theorem 1. Tempered Exp3 for Social Welfare does so by sampling policies uniformly at random, with probability γ. In the conclusion, we propose a similar modification for Thompson sampling.

A possible improvement to uniform sampling across all policies, as in Tempered Exp3 for Social Welfare, could be to only sample policies uniformly at random from the range of potentially optimal policies: Demand outside this range is irrelevant for welfare comparisons within this range. This idea is implemented in the algorithm that we introduce in Section 4 for the stochastic concave setting.

4 Stochastic Regret Bounds for Concave Social Welfare

Theorem 1 in Section 3 provides a lower bound proportional to for adversarial and stochastic regret in social welfare maximization. The proof of this lower bound constructs a distribution for the . This distribution is such that expected social welfare is nonconcave, as a function of x; two global optima are separated by a region of lower welfare. In order to learn which of two candidates for the globally optimal policy is actually optimal, it is necessary to sample policies in between. These intermediate policies yield lower welfare, and sampling them contributes to cumulative regret. This construction is illustrated in Figure 1.

Given that the construction relies on nonconcavity of expected social welfare, could we achieve lower regret if we knew that social welfare is actually concave? The answer turns out to be yes, for the stochastic setting (in the adversarial setting, cumulative welfare is necessarily nonconcave). One reason is that concavity ensures that the function is unimodal. To estimate the difference in social welfare between two policies, it therefore suffices to sample policies that lie in the interval between them. These in-between policies yield social welfare exceeding the minimum of the two boundary policies. A second reason is that concavity prevents unexpected spikes in social welfare. This property allows us to test carefully chosen triples of points for extended periods, to ensure that one of them is suboptimal, without incurring significant regret.

For the stochastic setting with concave social welfare, we present an algorithm that achieves a bound on regret of order , up to logarithmic terms. Before describing our proposed algorithm, Dyadic Search for Social Welfare, let us formally state the improved regret bounds. The proofs of these lower and upper bounds can be found in the Online Supplement.

Theorem 3. (Lower bound on regret for the concave case)Consider the setup of Section 2. There exists a constant such that, for any randomized algorithm for the choice of and any time horizon , the following holds.

There exists a distribution μ on with associated demand function G and concave social welfare function U, for which the stochastic cumulative expected regret is at least .

Theorem 4. (Stochastic Upper Bound on Regret of Dyadic Search for Social Welfare)Consider the stochastic setup of Section 2 and time horizon . If Algorithm 2 is run with confidence parameter , and if the social welfare function U is concave, then the expected regret is of order at most , up to logarithmic terms.

Dyadic Search

Our algorithm is based on a modification of dyadic search, as discussed in Bachoc, Cesari, Colomboni, and Paudice (2022a, 2022b). At any point in time, this algorithm maintains an active interval , which contains the optimal policy with high probability. Only policies within this interval are sampled going forward. As evidence accumulates, this interval is trimmed down, by excluding policies that are suboptimal with high probability.

The algorithm proceeds in epochs τ. At the start of each epoch, a subinterval is formed, with mid-point . The points are in a dyadic grid, that is, they are of the form . After sampling from , we calculate confidence intervals , , and for the welfare differences , , and , where .

If the confidence interval or lies above 0, concavity implies that the optimal policy cannot lie to the left of l; we can thus trim the active interval by dropping all points to the left of l. Symmetrically, if the confidence interval or lies below 0, we can trim by dropping all points to the right of r.

Confidence Intervals for Welfare Differences

This procedure requires the construction of confidence intervals for welfare differences of the form

(9)

At time t, we estimate demand

, for policies x chosen in previous periods, as³

We similarly estimate integrated demand

times the average of realized demand

for observations

in the open interval

. We have to be careful, however, to use a sample of

that is (approximately) uniformly distributed over this interval. This can be achieved for our dyadic search procedure, as specified in Algorithm 2, by truncating the time index used to estimate this average.⁴ Let

We define

At each round, Algorithm 2 maintains estimates for welfare differences among three points

(for left, center and right, respectively). The estimate of the welfare difference between

and

(or between

and

) is given by

(10)

while the estimate of the welfare difference between r and l is given by

(11)

To construct confidence intervals for

, we also need to quantify the uncertainty of our demand estimates. We use the following interval half-lengths for confidence intervals for tax revenue at x, and for the private welfare difference between

and x:

Using the shorthand

, our confidence interval for

, where

and

(or

and

) is given by

(12)

while our confidence interval for

is given by

(13)

With these preliminaries, we are now ready to state our algorithm, Dyadic Search for Social Welfare, in Algorithm 2.

Before concluding this section, we highlight two features of Algorithm 2. First, two of the three points , and the corresponding estimates of demand, are kept from each epoch to the next. Second, estimation of the integral term is performed by querying points following a fixed and balanced design on the dyadic grid—instead of, for example, using a randomized Monte Carlo procedure, which may lead to unbalanced exploration. This implies that the points queried to estimate the integral terms can be easily reused to obtain other integral estimates from each epoch to the next. These two features combined ensure that Algorithm 2 recycles information very efficiently to prune the active interval as quickly as possible, which leads to better regret.

5 Income Taxation

We discuss two extensions of the baseline model of optimal taxation that we introduced in Section 2. These extensions incorporate features that are important in more realistic models of optimal taxation. For both of these extensions, we propose a properly modified version of Algorithm 1. The first extension, discussed in this section, is a variant of the Mirrlees model of optimal income taxation (Mirrlees (1971), Saez (2001, 2002)). The second extension, discussed in Section 6 is a variant of the Ramsey model of commodity taxation (Ramsey (1927)).

Our model of income taxation generalizes our baseline model by allowing for heterogeneous wages , welfare weights , extensive-margin labor supply responses determined by the cost of participation , and nonlinear income taxes . Two simplifications are maintained in this model, relative to a more general model of income taxation. First, only extensive margin responses (participation decisions) by individuals are allowed; there are no intensive margin responses (hours adjustments). Second, as in the baseline model of Section 2, there are no income effects. In imposing these assumptions, our model mirrors the model of optimal income taxation discussed in Section II.2 of Saez (2002).

Setup

At each time , one individual arrives who is characterized by (i) a potential wage , and (ii) an unknown cost of participation . This individual makes a binary labor supply decision . If they participate in the labor market (), they earn , but pay a tax according to the tax rate on their earnings . They furthermore incur a nonmonetary cost of participation .

Their optimal labor supply decision is therefore given by , and private welfare equals . The implied public revenue is equal to the tax on earnings if , and 0 otherwise.

We define social welfare as a weighted sum of public revenue and private welfare, with a weight

for the latter. Typically, ω is a decreasing function of w, reflecting a preference for redistribution toward those with lower earnings potential; cf. Saez and Stantcheva (2016). Social welfare for time period i, as a function of the tax schedule

, is therefore given by

(14)

After period i, we observe

and the tax schedule

. If

, we also observe

. Nothing else is observed.⁵

Piecewise Constant Tax Schedules

We next construct a generalization of Algorithm 1 based on piecewise constant tax schedules, with tax rates changing at the grid-points

, where

. Formally, define

, rounding the wage w down to the nearest grid point in

,⁶ Denote

, and let

For

and any

, denote

so that

is the individual labor supply function, in monetary units, interacted with an indicator for whether the wage

falls into the tax bracket starting at w. With this notation, we can rewrite

For piecewise constant tax rates

, we get

(15)

Cumulative social welfare is given by

, and we correspondingly define cumulative expected regret, in the adversarial setting, as

The supremum here is taken over all tax schedules

that are piecewise constant between the grid points

Algorithm

Algorithm 3 generalizes Algorithm 1 to this setting. As before, we form an unbiased estimate of using inverse probability weighting, map this estimate into a corresponding estimate of , based on Equation (18), and cumulate across time periods to obtain . Note that is observed whenever . This implies that the estimate is in fact a function of observables, and the same holds for .

Algorithm 3 keeps track of estimated demand and social welfare for each bin (“tax bracket”), as defined by the grid points . The algorithm then constructs a distribution over tax rates given w, using the tempered Exp3 distribution. The tax schedule is sampled according to these (marginal) distributions of tax rates for each bracket. Though immaterial for the following theorem, we choose the perfectly correlated coupling, across brackets, of these marginal distributions, which is implemented using the random variable in Algorithm 3.

Theorem 5. (Adversarial Upper Bound on Regret of Tempered Exp3 for Optimal Income Taxation)Consider the setup of Section 5, and Algorithm 3. Assume that , and that for all w.

Then for any sequence expected regret is bounded above by

(16)

Suppose additionally⁷ that

, and

, for some constants

. Then expected regret

is bounded above by

(17)

for some constant

6 Commodity Taxation

In this section, we generalize our baseline model of optimal taxation to a model of commodity taxation with multiple goods and continuous demand functions , where is a vector of tax rates. We again assume that there are no income effects. Our setup is a version of the classic Ramsey model (Ramsey (1927)). We propose a generalization of Tempered Exp3 for Social Welfare to this setting. In the following, we use to denote the Euclidean inner product between x and y.

Setup

At each time

, one individual arrives who is characterized by a utility function

. This individual is exposed to a tax vector

, and makes a continuous consumption decision

. Public revenue is given by

. Private utility is given by

plus their consumption of a numeraire good, which has price normalized to 1 and enters utility additively. The individual consumption choice

costs

, where p is the (exogenously given) vector of pre-tax prices. This cost of purchasing

reduces the consumption of the numeraire good. The optimal individual decision is therefore given by

(18)

The implied private welfare is

where we have added a constant

, chosen such that

; this is just a normalization to simplify notation below.

We define social welfare as a weighted sum of public revenue and private welfare, with a weight λ for the latter. Social welfare for time period i, as a function of the tax vector x, is therefore given by

(19)

After period i, we observe

and the tax vector

. Nothing else is observed. Algorithm 4 adapts our approach to this setting. This algorithm requires a mapping Π from (estimated) demand to welfare.

Mapping Demand to Welfare

By the envelope theorem (Milgrom and Segal (2002)),

Let

be the set of differentiable functions v on

such that

, and such that

. Consider the following operator, mapping the demand function G into the corresponding indirect utility function v:

(20)

We can think of the operator Π as combining two operators. First, the function G is projected on the subspace of functions on

, which can be written as the gradient of some function v. Second, the projected G is then integrated to get

for any x. Integration here is along some curve in

from 0 to x. Given the first projection, the choice of curve does not matter for the resulting function v. A formal analysis of Tempered Exp3 for Commodity Taxation would need to prove existence of the projection. We leave such a formal analysis, including lower and upper regret bounds, for future research.

7 Conclusion

Possible Applications

The setup introduced in Section 2 was deliberately stylized, to allow for a clear exposition of the conceptual issues that arise when adaptively maximizing social welfare. The algorithm that we proposed for this setup, and the generalizations discussed later in the paper, are nonetheless directly practically relevant. They remain appropriate in economic settings that are considerably more general than the setting described by our model.

The reasons for this generality have been elucidated by the public finance literature, cf. Chetty (2009), building on the generality of the envelope theorem; cf. Milgrom and Segal (2002), Sinander (2022). By the envelope theorem, the welfare impact of a marginal tax change on private welfare can be calculated ignoring any behavioral responses to the tax change. This holds in generalizations of our setup that allow for almost arbitrary action spaces (including discrete and continuous, multidimensional, and dynamic actions), and for arbitrary preference heterogeneity. The expressions for social welfare that justify our algorithms remain unchanged under such generalizations. That said, the validity of these expressions does require the absence of income effects and of externalities. If there are income effects or externalities, the algorithms need to be modified.

Our approach is motivated by applications of algorithmic decision-making for public policy, where a policymaker cares about welfare, but also faces a government budget constraint. Possible application domains of our algorithm include the following. In public health and development economics, field experiments such as Cohen and Dupas (2010) vary the level of a subsidy for goods such as insecticide-treated bed nets (ITNs), estimating the impact on demand. Our algorithm could be used to find the optimal subsidy level quickly and apply it to experimental participants. A term capturing positive externalities of the use of ITNs could be added to social welfare, leaving the algorithm otherwise unchanged. In educational economics, many studies evaluate the impact of financial aid on college enrollment (Dynarski, Page, and Scott-Clayton (2023)). An adaptive experiment might vary the level of aid provided, where aid is conditional on college attendance and conditional on pre-determined criteria of need or merit. In such an experiment, a variant of our algorithm for optimal income taxation might be used, where the welfare weights ω are a function of need or merit, and the outcome y is college attendance. In environmental economics, many experiments (e.g., Lee, Miguel, and Wolfram (2020)) study the impact of electricity pricing on household electricity consumption. Once again, our baseline algorithm (for binary household decisions about connecting to the grid) or our algorithm for commodity taxation (for continuous household decisions about consumption levels) could be applied, to learn optimal prices, taking into account both distributional considerations and externalities.

These examples are all drawn from public policy, where there is an intrinsic concern for social welfare. This contrasts with commercial applications, where the goal is typically to maximize (directly observable) profits by monopolist pricing (den Boer (2015)), or more generally by reserve price setting in auctions (Nedelec, Calauzènes, El Karoui, and Perchet (2022). Adaptive pricing algorithms are used in applications such as online ad auctions. A concern for welfare might enter in such commercial settings if there is a participation constraint that needs to be satisfied for consumers. Suppose, for example, that consumers or service providers need to first sign up for a platform, say for e-commerce or for gig work, and then repeatedly engage in transactions on this platform. To sign up in the first place, their expected welfare needs to exceed their outside option. This constraint might then enter the platform provider's objective, in Lagrangian form, adding a term for welfare, and leading to objectives such as those maximized by our algorithms.

An Alternative Approach: Thompson Sampling

The main algorithm proposed in this paper, Tempered Exp3 for Social Welfare, is designed to perform well in the adversarial setting. In the construction of this algorithm, no probabilistic assumptions were made about the distribution of . In the stochastic framework, a sampling distribution is assumed, for instance, that the be i.i.d. over time. The Bayesian framework completes this by assuming a prior distribution over the parameters, which govern the sampling distribution.

One popular heuristic for adaptive policy choice in the Bayesian framework is Thompson sampling (Thompson (1933), Russo, Van Roy, Kazerouni, Osband, and Wen (2018)), also known as probability matching, which assigns a policy with probability equal to the posterior probability that this policy is optimal. In our setting, Thompson sampling could be implemented as follows. First, form a posterior for the demand function , based on all the data available from previous periods . Sample one draw from this posterior. Map this draw into a draw from the posterior for via . Find the maximizer . This is the policy recommended by Thompson sampling. We conjecture that this algorithm will outperform random assignment, but will underexplore relative to the optimal algorithm. Adding further forced exploration to this algorithm might improve cumulative welfare. A formal analysis of algorithms of this type is left for future research.

A natural class of priors for G are Gaussian process priors (Williams and Rasmussen (2006)). If outcomes y are conditionally normal (rather than binary, as in our baseline model), then the posterior for demand is available in closed form, and the posterior mean is equal to the best linear predictor given past outcomes . Furthermore, since social welfare is a linear transformation of demand, the posterior for U is then also linear and available in closed form. For details, see Kasy (2018).

1 This reflects the absence of income effects in our model, which implies that private utility, consumer surplus, compensating variation, and equivalent variation all coincide.

2 Specifically,

, and

. These two constants are strictly greater than zero, and satisfy

3 We use the convention

and

whenever

. Furthermore, every summation over an empty set of indices is understood to have value 0.

4 The sampling procedure in Algorithm 2 samples sequentially from the dyadic grid in the active interval, refining the grid in subsequent iterations.

provides a truncation of the time index such that one round of such dyadic sampling has been completed.

5 It should be noted that in this model we take the transfer

for individuals without other income as given. The effective tax owed by an employed individual equals

. The “unconditional basic income”

does not affect labor supply, given our assumption that there are no income effects, and it enters social welfare additively. It is therefore without loss of generality to omit

from our model.

6 Here, we use slightly nonstandard notation, where

denotes rounding down to the nearest grid-point, rather than the nearest integer.

7 for simplicity, we assume that in the following tuning K is an integer. If not, round K to the closest integer.

Appendix A: Proofs

A.1 Theorem 1 (Lower Bound on Regret)

Defining a Family of Distributions for v

Recall that, for each

, the probability distribution

is defined as the probability measure supported on

with masses

, where

Furthermore, for each

, recall that

and

are respectively the demand function and the expected social welfare associated to

(see Figure 1 for an illustration). Let

be the sequence of individual valuations. For each

, consider a distribution

such that the individual valuations

form a

-i.i.d. sequence (independent of the randomization used by the algorithm) with common distribution

Explicit Lower Bound on Regret That Will Be Proven

Define

We will prove that, for any randomized algorithm and any time horizon

, there exists

such that

where

(21)

Fix a randomized algorithm to choose the policies , and fix a time horizon .

Number of Mistakes and Lower Bound on Regret

We need to count the random number of times the algorithm has played in the regions

and

up to time T. This can be done relying on the following random variables:

Notice that since the intervals

and

form a partition of

, we have that

(22)

For each

, denote by

the expectation taken with respect to the distribution

. Notice that, for each

, the expected regret when the underlying distribution is

equals

(23)

Algebraic calculations show that, for each

(24)

(25)

Further calculations show also that

(26)

(27)

Equations (34), (35), (36), and (37) imply that

(28)

It follows that, if

(29)

Notice that inequality (39) quantifies how much regret the algorithm is going to suffer in terms of the expected number of times it plays in the wrong regions, when the demand function is

and

In the same way inequality (39) was proven, we can prove that, if

(30)

which again quantifies how much regret the algorithm is going to suffer in terms of the expected number of times it plays in the wrong regions, when the demand function is

and

Intuition for the Remainder of the Proof

At high level, inequalities (39) and (40) tell us that, if

is not negligible, the algorithm has to play a substantially different number of times in the region

, depending on the sign of ϵ, not to suffer significant regret when the demand function is

. The crucial idea is that the only way for the algorithm to present this different behavior is by playing in the only informative region about the sign of ϵ, that is, the region

. However, as shown in (39), selecting policies in this region comes at a cost in terms of regret. To relate quantitatively the number of times the algorithm has to play in this costly region with the difference in the expected number of times the algorithm selects policies in the region

is the last missing ingredient that we can obtain relying on information theoretic techniques: It can be proved (and a formal proof is provided in the Online Supplement, in Section B.1, that, for each

(31)

Now, if the algorithm is going to suffer low regret when

, then by (39) we have an upper bound on the number of times the algorithm plays in the region

and a lower bound on the number of times it plays in the region

, whenever

. In turn, by (41), this gives a lower bound on the number of times the algorithm plays in the suboptimal region

when

. Then, relying on (40), we have an explicit lower bound on how much regret the algorithm is going to suffer when

. We will now carry out this plan—and prove the theorem—as follows.

Low Regret Cannot Be Achieved for Both Positive and Negative ϵ

To get a contradiction, suppose that

(32)

It follows from (39) that, for each

(33)

This implies, relying also on (40) and (41), that for each

we have

(34)

Pick

. First, note that since

we have that

. Substituting this value of ϵ in (44) leads to

(35)

where the second to last inequality follows from

, while the last inequality follows from

. Rearranging inequality (45) leads to the contradiction

Since (42) leads to a contradiction, it follows that there exists

such that

. Given that the time horizon T and the randomized algorithm were arbitrarily fixed, the theorem is proved.

A.2 Theorem 2 (Adversarial Upper Bound on Regret)

The proof of this theorem builds upon the proof of Theorem 6.5 in Cesa-Bianchi and Lugosi (2006). Relative to this theorem, we need to additionally consider the discretization error introduced by Algorithm 1, and explicitly control the variance of estimated welfare.

Recall our notation and for realized cumulative welfare, and for cumulative welfare for the counterfactual, fixed policy x. We further abbreviate . Throughout this proof, the sequence is given and conditioned on in any expectations.

1.
Discretization

Recall that . Let
(this is rounded down to the next grid point ), and denote
as well as . Then it is immediate that ,
and and, therefore,
2.
Unbiasedness

At the end of period i, is an unbiased estimator of for all k. Therefore, for all i and k.
3.
Upper bound on optimal welfare

Define , and .

It is immediate that
Furthermore,
Given our initialization of the algorithm, .
4.
Lower bound on estimated welfare

Denote , where ,

so that , and .

By definition of ,
Since for all k, for all i and k and, therefore, (where the last inequality holds by assumption). Using for any yields
Therefore,
The second inequality follows from .
5.
Connecting the first-order term to welfare

Note that, by definition, . Therefore,
and thus
where we have used the fact that for all k, given our definition of Ũ, and the fact that is distributed according to , by construction.
6.
Bounding the second moment of estimated welfare

It remains to bound the term . As in the preceding item, we have

We can rewrite

Bounding immediately gives
and, therefore,
7.
Collecting inequalities

Combining the preceding items, we get

Multiplying by and dividing by η, adding to both sides and subtracting , bounding , and (from Item 1), yields
(36)
This proves the first claim of the theorem.
8.
Optimizing tuning parameters

Suppose now that we choose the tuning parameters as follows:
Substituting we get
The second claim of the theorem follows.

Supporting Information

References

Achddou, Juliette, Olivier Cappé, and Aurélien Garivier (2021): Fast Rate Learning in Stochastic First Price Bidding. Asian Conference on Machine Learning.
Google Scholar
Agrawal, Rajeev (1995): “Sample Mean Based Index Policies by Regret for the Multi-Armed Bandit Problem,” Advances in applied probability, 27 (4), 1054–1078.
10.2307/1427934
Google Scholar
Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer (2002): “Finite-Time Analysis of the Multiarmed Bandit Problem,” Machine learning, 47, 235–256.
10.1023/A:1013689704352
Web of Science® Google Scholar
Auer, Peter, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire (2002): “The Nonstochastic Multiarmed Bandit Problem,” SIAM journal on computing, 32 (1), 48–77.
10.1137/S0097539701398375
Web of Science® Google Scholar
Bachoc, François, Tommaso Cesari, Roberto Colomboni, and Andrea Paudice (2022a): “ A Near-Optimal Algorithm for Univariate Zeroth-Order Budget Convex Optimization.”
Google Scholar
Bachoc, François, Tommaso Cesari, Roberto Colomboni, and Andrea Paudice (2022b): “ Regret Analysis of Dyadic Search,” arXiv preprint arXiv:2209.00885.
Google Scholar
Baily, Martin N. (1978): “Some Aspects of Optimal Unemployment Insurance,” Journal of Public Economics, 10 (3), 379–402.
10.1016/0047-2727(78)90053-1
Web of Science® Google Scholar
Cesa-Bianchi, Nicolo, and Gábor Lugosi (2006): Prediction, Learning, and Games. Cambridge University Press.
10.1017/CBO9780511546921
Google Scholar
Cesa-Bianchi, Nicolò, Tommaso Cesari, Roberto Colomboni, Federico Fusco, and Stefano Leonardi (2024a): “Bilateral Trade: A Regret Minimization Perspective,” Mathematics of Operations Research, 49 (1), 171–203.
10.1287/moor.2023.1351
Web of Science® Google Scholar
Cesa-Bianchi, Nicolò, Tommaso Cesari, Roberto Colomboni, Federico Fusco, and Stefano Leonardi (2024b): “Regret Analysis of Bilateral Trade With a Smoothed Adversary,” Journal of Machine Learning Research, 25 (234), 1–36.
Google Scholar
Cesa-Bianchi, Nicolò, Roberto Colomboni, and Maximilian Kasy (2025): “ Supplement to ‘Adaptive Maximization of Social Welfare’,” Econometrica Supplemental Material, 93, https://doi.org/10.3982/ECTA22351.
10.3982/ECTA22351
Google Scholar
Chetty, Raj (2009): “Sufficient Statistics for Welfare Analysis: A Bridge Between Structural and Reduced-Form Methods,” Annual Review of Economics, 1 (1), 451–488.
10.1146/annurev.economics.050708.142910
Web of Science® Google Scholar
Cohen, Jessica, and Pascaline Dupas (2010): “Free Distribution or Cost-Sharing? Evidence From a Randomized Malaria Prevention Experiment,” The Quarterly Journal of Economics, 125 (1), 1–45.
10.1162/qjec.2010.125.1.1
Google Scholar
Cover, Thomas (1965): “ Behavior of Sequential Predictors of Binary Sequences,” in Proceedings of the 4th Prague Conference on Information Theory, Statistical Decision Functions, Random Processes. Prague: Publishing House of the Czechoslovak Academy of Sciences, 263–272.
Google Scholar
Daskalakis, Constantinos, and Vasilis Syrgkanis (2022): “ Learning in Auctions: Regret Is Hard, Envy Is Easy,” Games and Economic Behavior.
Google Scholar
den Boer, Arnoud V. (2015): “ Dynamic Pricing and Learning: Historical Origins, Current Research, and New Directions,” Surveys in Operations Research and Management Science.
Google Scholar
Dynarski, Susan, Lindsay Page, and Judith Scott-Clayton (2023): “ College Costs, Financial Aid, and Student Decisions,” in Handbook of the Economics of Education, Vol. 7. Elsevier, 227–285.
Google Scholar
Feng, Zhe, Guru Guruganesh, Christopher Liaw, Aranyak Mehta, and Abhishek Sethi (2021): “ Convergence Analysis of No-Regret Bidding Algorithms in Repeated Auctions,” in Proceedings of the AAAI Conference on Artificial Intelligence, 5399–5406.
Google Scholar
Feng, Zhe, Chara Podimata, and Vasilis Syrgkanis (2018): “ Learning to Bid Without Knowing Your Value,” in Proceedings of the 2018 ACM Conference on Economics and Computation, 505–522.
10.1145/3219166.3219208
Google Scholar
Fudenberg, Drew, and David K. Levine (1995): “Consistency and Cautious Fictitious Play,” Journal of Economic Dynamics and Control, 19 (5–7), 1065–1089.
10.1016/0165-1889(94)00819-4
Web of Science® Google Scholar
Han, Yanjun, Zhengyuan Zhou, Aaron Flores, Erik Ordentlich, and Tsachy Weissman (2020): “ Learning to Bid Optimally and Efficiently in Adversarial First-Price Auctions,” arXiv preprint arXiv:2007.04568.
Google Scholar
Han, Yanjun, Zhengyuan Zhou, and Tsachy Weissman (2020): “ Optimal No-Regret Learning in Repeated First-Price Auctions,” arXiv preprint arXiv:2003.09795.
Google Scholar
Hannan, James (1957): “Approximation to Bayes Risk in Repeated Play,” Contributions to the Theory of Games, 3 (2), 97–139.
Google Scholar
Hart, Sergiu, and Andreu Mas-Colell (2000): “A Simple Adaptive Procedure Leading to Correlated Equilibrium,” Econometrica, 68 (5), 1127–1150.
10.1111/1468-0262.00153
Web of Science® Google Scholar
Hart, Sergiu, and Andreu Mas-Colell (2001): “A General Class of Adaptive Strategies,” Journal of Economic Theory, 98 (1), 26–54.
10.1006/jeth.2000.2746
Web of Science® Google Scholar
Kasy, Maximilian (2018): “ Optimal Taxation and Insurance Using Machine Learning—Sufficient Statistics and Beyond,” Journal of Public Economics, 167.
Google Scholar
Kasy, Maximilian, and Anja Sautmann (2021): “Adaptive Treatment Assignment in Experiments for Policy Choice,” Econometrica, 89 (1), 113–132.
10.3982/ECTA17527
Web of Science® Google Scholar
Kleinberg, Robert D., and Frank Thomson Leighton (2003): “ The Value of Knowing a Demand Curve: Bounds on Regret for Online Posted-Price Auctions,” in IEEE Symposium on Foundations of Computer Science, 594–605.
10.1109/SFCS.2003.1238232
Google Scholar
Kolumbus, Yoav, and Noam Nisan (2022): “ Auctions Between Regret-Minimizing Agents,” in Proceedings of the ACM Web Conference 2022, 100–111.
10.1145/3485447.3512055
Google Scholar
Lai, Tze Leung (1987): “ Adaptive Treatment Allocation and the Multi-Armed Bandit Problem,” The Annals of Statistics, 1091–1114.
Google Scholar
Lai, Tze Leung, and Herbert Robbins (1985): “Asymptotically Efficient Adaptive Allocation Rules,” Advances in applied mathematics, 6 (1), 4–22.
10.1016/0196-8858(85)90002-8
Web of Science® Google Scholar
Lattimore, Tor, and Csaba Szepesvári (2020): Bandit Algorithms. Cambridge University Press.
10.1017/9781108571401
Google Scholar
Lee, Kenneth, Edward Miguel, and Catherine Wolfram (2020): “Experimental Evidence on the Economics of Rural Electrification,” Journal of Political Economy, 128 (4), 1523–1565.
10.1086/705417
Web of Science® Google Scholar
Littlestone, Nick, and Manfred Warmuth (1994): “The Weighted Majority Algorithm,” Information and computation, 108 (2), 212–261.
10.1006/inco.1994.1009
Web of Science® Google Scholar
Lucier, Brendan, Sarath Pattathil, Aleksandrs Slivkins, and Mengxiao Zhang (2024): “ Autobidders With Budget and ROI Constraints: Efficiency, Regret, and Pacing Dynamics,” in The Thirty Seventh Annual Conference on Learning Theory, PMLR, 3642–3643.
Google Scholar
Lugosi, Gábor, Mihalis G. Markakis, and Gergely Neu (2023): “ On the Hardness of Learning From Censored and Nonstationary Demand,” INFORMS Journal on Optimization.
Google Scholar
Milgrom, Paul, and Ilya Segal (2002): “Envelope Theorems for Arbitrary Choice Sets,” Econometrica, 70 (2), 583–601.
10.1111/1468-0262.00296
Web of Science® Google Scholar
Mirrlees, James A. (1971): “ An Exploration in the Theory of Optimum Income Taxation,” The Review of Economic Studies, 175–208.
Google Scholar
Nedelec, Thomas, Clément Calauzènes, Noureddine El Karoui, Vianney Perchet et al. (2022): “Learning in Repeated Auctions,” Foundations and Trends® in Machine Learning, 15 (3), 176–334.
10.1561/2200000077
Web of Science® Google Scholar
Ramsey, Frank P. (1927): “A Contribution to the Theory of Taxation,” The economic journal, 37 (145), 47–61.
10.2307/2222721
Web of Science® Google Scholar
Russo, Daniel (2020): “Simple Bayesian Algorithms for Best-Arm Identification,” Operations Research, 68 (6), 1625–1647.
10.1287/opre.2019.1911
Web of Science® Google Scholar
Russo, Daniel J., Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen (2018): “A Tutorial on Thompson Sampling,” Foundations and Trends® in Machine Learning, 11 (1), 1–96.
10.1561/2200000070
Web of Science® Google Scholar
Saez, Emmanuel (2001): “Using Elasticities to Derive Optimal Income Tax Rates,” The Review of Economic Studies, 68 (1), 205–229.
10.1111/1467-937X.00166
Web of Science® Google Scholar
Saez, Emmanuel (2002): “Optimal Income Transfer Programs: Intensive versus Extensive Labor Supply Responses,” The Quarterly Journal of Economics, 117 (3), 1039–1073.
10.1162/003355302760193959
Web of Science® Google Scholar
Saez, Emmanuel, and Stefanie Stantcheva (2016): “Generalized Social Welfare Weights for Optimal Tax Theory,” American Economic Review, 106 (1), 24–45.
10.1257/aer.20141362
Web of Science® Google Scholar
Seldin, Yevgeny, and Aleksandrs Slivkins (2014): “ One Practical Algorithm for Both Stochastic and Adversarial Bandits,” in International Conference on Machine Learning, PMLR, 1287–1295.
Google Scholar
Sinander, Ludvig (2022): “The Converse Envelope Theorem,” Econometrica, 90 (6), 2795–2819.
10.3982/ECTA18119
Web of Science® Google Scholar
Slivkins, Aleksandrs (2019): “ Introduction to Multi-Armed Bandits,” arXiv preprint arXiv:1904.07272.
Google Scholar
Thompson, William R. (1933): “On the Likelihood That One Unknown Probability Exceeds Another in View of the Evidence of Two Samples,” Biometrika, 25 (3/4), 285–294.
10.2307/2332286
Google Scholar
Vovk, Volodimir G. (1990): “ Aggregating Strategies,” in Proceedings of the 3rd AnnualWworkshop on Computational Learning Theory, 371–386.
10.1016/B978-1-55860-146-8.50032-1
Google Scholar
Weed, Jonathan, Vianney Perchet, and Philippe Rigollet (2016): “ Online Learning in Repeated Auctions,” in Conference on Learning Theory, PMLR, 1562–1583.
Google Scholar
Williams, Christopher K. I., and Carl E. Rasmussen (2006): Gaussian Processes for Machine Learning. MIT Press.
Google Scholar
Zimmert, Julian, and Yevgeny Seldin (2021): “Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits,” Journal of Machine Learning Research, 22 (28), 1–49.
Google Scholar

The replication package for this paper is available at https://doi.org/10.5281/zenodo.15042114. The Journal checked the data and codes included in the package for their ability to reproduce the results in the paper and approved online appendices.

Citing Literature

Volume93, Issue3

May 2025

Pages 1073-1104

Adaptive Maximization of Social Welfare

Abstract

1 Introduction

Why Welfare, Why Adversarial Guarantees?

Lower and Upper Bounds on Regret

A Lower Bound on Stochastic Regret

A Matching Upper Bound on Adversarial Regret for Modified Exp3

Improved Stochastic Bounds for Concave Social Welfare

Extensions to Nonlinear Income Taxation and to Commodity Taxation

Roadmap

1.1 Background and Literature

Optimal Taxation

Multiarmed Bandits

Adversarial Decision-Making

Bandit Approaches for Economic Problems

2 Setup

Notation

2.1 Regret

The Adversarial Case

The Stochastic Case

2.2 Comparison to Related Learning Problems

Lipschitzness and Information Requirements

Hannan Consistency

3 Stochastic and Adversarial Regret Bounds

3.1 Lower Bound

Sketch of Proof

3.2 An Algorithm That Achieves the Lower Bound

Conventional Exp3

Modifications Relative to Conventional Exp3

Tuning

Unknown Time Horizon

The Extra Term

Numerical Example

Alternative Algorithms

4 Stochastic Regret Bounds for Concave Social Welfare

Dyadic Search

Confidence Intervals for Welfare Differences

5 Income Taxation

Setup

Piecewise Constant Tax Schedules

Algorithm

6 Commodity Taxation

Setup

Mapping Demand to Welfare

7 Conclusion

Possible Applications

An Alternative Approach: Thompson Sampling

Appendix A: Proofs

A.1 Theorem 1 (Lower Bound on Regret)

Defining a Family of Distributions for v

Explicit Lower Bound on Regret That Will Be Proven

Number of Mistakes and Lower Bound on Regret

Intuition for the Remainder of the Proof

Low Regret Cannot Be Achieved for Both Positive and Negative ϵ

A.2 Theorem 2 (Adversarial Upper Bound on Regret)

Supporting Information

References

Citing Literature

Figures

References

Related

Information