Volume 35, Issue 13 pp. 2117-2132
Research Article
Open Access

Evidence, eminence and extrapolation

Gerald Hlavin

Corresponding Author

Gerald Hlavin

Section for Medical Statistics, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria

Correspondence to: Gerald Hlavin, Section for Medical Statistics, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria.

E-mail: [email protected]

Search for more papers by this author
Franz Koenig

Franz Koenig

Section for Medical Statistics, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria

Search for more papers by this author
Christoph Male

Christoph Male

Department of Paediatrics, Medical University of Vienna, Vienna, Austria

Search for more papers by this author
Martin Posch

Martin Posch

Section for Medical Statistics, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria

Search for more papers by this author
Peter Bauer

Peter Bauer

Section for Medical Statistics, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria

Search for more papers by this author
First published: 11 January 2016
Citations: 24

Abstract

A full independent drug development programme to demonstrate efficacy may not be ethical and/or feasible in small populations such as paediatric populations or orphan indications. Different levels of extrapolation from a larger population to smaller target populations are widely used for supporting decisions in this situation. There are guidance documents in drug regulation, where a weakening of the statistical rigour for trials in the target population is mentioned to be an option for dealing with this problem. To this end, we propose clinical trials designs, which make use of prior knowledge on efficacy for inference. We formulate a framework based on prior beliefs in order to investigate when the significance level for the test of the primary endpoint in confirmatory trials can be relaxed (and thus the sample size can be reduced) in the target population while controlling a certain posterior belief in effectiveness after rejection of the null hypothesis in the corresponding confirmatory statistical test. We show that point-priors may be used in the argumentation because under certain constraints, they have favourable limiting properties among other types of priors. The crucial quantity to be elicited is the prior belief in the possibility of extrapolation from a larger population to the target population. We try to illustrate an existing decision tree for extrapolation to paediatric populations within our framework. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

1 Introduction

One of the most challenging tasks in medicine is clinical research in children. In the following paper, we look at drug development in the paediatric population. For decades, it has been criticized that most medicines have not been authorized for the use in children. Off-label use based on the individual responsibility of the treating paediatrician is often the only way how children can benefit from medicines that are only authorized for adults 1. This relies on the questionable assumption, that children are small adults. There exist several reasons for such a development: clinical research in children is a sensitive area involving emotional and ethical challenges, methodological challenges, for example, the small numbers of children that can be recruited into trials, and on the other hand increased costs that may not be compensated by economic returns if the treated disease is rare in children. In order to improve the situation, new legal requirements have been created in the USA 2, 3 and in the European Union (EU) 4, 5. Essentially, these require companies to agree a plan for developing a medicine in children with the regulatory authorities before authorization in adults. If studies in children performed according to the agreed plan are submitted and lead to authorization in children, patent exclusivity is prolonged as a reward for the extra effort of the drug developer.

The scope of such a paediatric investigation plan (PIP) may reach from a full programme (including pre-clinical research, pharmacokinetics, pharmacodynamics, dose finding studies and two fully powered pivotal phase III studies) for diseases only existing in childhood at the upper end of the spectrum and, for example, a single (pharmacokinetic) case series in children on the lower end of the spectrum. The latter situation is obviously based on the assumption that data and results from adult patients can be extrapolated to the childhood and only very limited additional data from children are necessary before authorization of the treatment also for children. Such extrapolation is only possible in situations where it may be assumed that children are reasonably similar to adults, which, as a general rule is not acceptable, for example, because of differences related to growth and maturation. In order to give some structure in the decision process whether and to what extent extrapolation from adults to children is appropriate, the Food and Drug Administration (FDA) has developed a paediatric study decision tree based on similarity of disease progression, similarity of response to treatment and similar concentration–response relationships 6.

The European Medicines Agency (EMA) has issued a concept paper on extrapolation 7 that – although referring also to extrapolation in other areas of drug development – has been mainly driven by its Paediatric Committee. Extrapolation in the regulatory context is defined by

Extending information and conclusions available from studies in one or more subgroups of the patient population (source population), or in related conditions or with related medicinal products, to make inferences for another subgroup of the population (target population), or condition or product, thus reducing the need to generate additional information (types of studies, design modifications, number of patients required) to reach conclusions for the target population, or condition or medicinal product.

In the same document, it is stated that the

primary rationale for extrapolation is to avoid unnecessary studies in the target population for ethical reasons, for efficiency, and to allocate resources to areas where studies are the most needed.

There are different ways mentioned on how to reduce the evidence required from the paediatric population(s) dependent on the degree of similarity to the source population (e.g. adults): instead of a full development programme, only a reduced set of studies are required, for example, pharmacokinetic/pharmacodynamic studies only, dose-ranging or dose-titration studies, non-controlled descriptive efficacy and/or safety studies, controlled studies but arbitrary sample size, larger significance level, lower coverage probability of confidence intervals, acceptance of surrogate endpoints for the primary analysis, interpolation (bridging), for example, between age subgroups, modelling prior information from existing data sets (Bayesian models and meta-analytic predictive). Some of these proposals have also been mentioned in the previous EMA guideline on clinical trials in small populations 8.

It is obvious that decisions on the extent of extrapolation possible, for example, from adults to children, are generally not conventional statistical decisions. Often it is even hard to find sufficient data on the control therapy the new treatment has to be compared with in children. Generally, no data at all are available from systematic studies with the new (drug) in children, because we still encounter the argument that paediatric studies are not ethical before the drug has been successfully registered in adults. However, the PIP by European law should be laid down as soon as results from early studies in adults become available, hence, certainly before registration in adults. The rationale behind is that drug developers should provide an early commitment for what they are planning regarding development for children. Often clinical data on the efficacy and safety of the new treatment are very limited even in adults so that decisions will have to be grounded predominantly on expert opinion of experienced specialists in the disease area with corresponding expertise in the paediatric population. Clearly, historical data from different sources and of different relevance will in general play an important role in the expert judgement and decisions. As a matter of fact, decisions under uncertainty have to be taken in this area by experts in collaboration with statisticians. If methodologists refuse to deal with such an environment, the paediatricians will individually decide and apply treatments to children off label without being able to refer to any systematic study results.

In this paper, we will try to structure the extrapolation process. Thereby, we concentrate on softening the burden of evidence in paediatric populations by enlarging the significance level in a paediatric clinical trial. We introduce prior probabilities for non-applicability of extrapolation (‘scepticism’) and priors on the hypotheses to be tested. We show, how single standard frequentist tests with an enlarged significance level correspond to Bayesian decision rules based on certain scepticism and priors. In Section 2, we develop the general framework for α-level adjustment by applying Bayesian arguments. In Section 3, we apply this framework to treatment–control comparisons assuming normally distributed outcome variables. In Section 4, we show as an example, how the FDA decision tree for extrapolation may be roughly embedded in our framework. We close with a short discussion in Section 5. In the Appendix, we show the favourable properties of two-point priors we used in the argumentation.

2 A simple framework for α-level adjustment

In our approach, to incorporate prior knowledge in the hypothesis testing problem, two properties of the test function φ have to be considered.
  1. Given H1is true, the test function should reject H0 (which means that φ = 1) with a high probability. Thus, the power P(φ = 1|H1) has to be fixed adequately on a level 1 − β.
  2. Given that sponsor reports φ = 1, the regulators want to be sure that H1 is true. Therefore, the probability P(H1|φ = 1) = 1−γ (which may be interpreted as a ‘positive predictive value’ of a significant test result) has to be controlled.

Here, H0 and H1 indicate the null and alternative hypotheses in, for example, the comparison of the means between an experimental treatment and a control, or in the test of a dose–response relationship.

It should be mentioned that criterion (ii) is in contrast to the usual clinical trial approach, where both P(φ = 1|H1) and P(φ = 1|H0) are adequately controlled. In our framework, the direct control of the power and the positive predictive value of a significant test result is the crucial condition for comparing different tests. In the conventional testing set-up, the control of the latter value is indirectly aimed at by control of the type I error rate. Thus, our setting could be seen as a compromise between the classical framework and the control of both the positive and the negative predictive value as suggested in 9.

The choice of 1 − β is typically based on economical and ethical arguments, and values of β being equal to 0.1 or 0.2 are traditionally taken in phase III studies. In the following, we will consider two scenarios, where the first scenario will help us to choose a reasonable value also for 1 − γ, which then will hold as a standard, whereas the second scenario will eventually motivate an adjustment of the type I error rate in order to maintain this standard.

2.1 Two scenarios to arrive at the same positive predictive value

2.1.1 The benchmark scenario

To motivate the choice of the positive predictive value 1 − γ, we consider the case that a full study programme is conducted only in the target population, and several early phase studies have been conducted with a positive result including a phase II proof of concept study. With each positive result, the belief in H1 may have strengthened, and when planning the phase III study, we have arrived at a prior probability for H1 of 1 − r. We will refer to this scenario as the benchmark scenario. In order to justify this designation, the type I and type II error rates on which urn:x-wiley:sim:media:sim6865:sim6865-math-0001 depends on are assumed to be the traditional values for phase III studies: α may be set to the one-sided levels 0.025, or even 0.0252 (representing the two pivotal studies paradigm), whereas β may be set to, for example, 0.1 or 1 − 0.92=0.19 (again for the two studies paradigm). Note, Pr(H1|φ = 1) is given by Bayes's theorem, and the subscript r indicates that this benchmark belief depends on 1 − r. After choosing error rates that represent the common practice in the given setting, let φb denote a test that controls exactly these α and β levels.

A crucial parameter in this setting is the value of 1 − r, the regulators belief in H1. As 1 − r represents the belief before phase III in a standard drug development programme is started, one may argue that enough evidence has been accumulated so that urn:x-wiley:sim:media:sim6865:sim6865-math-0002. For the choice of 1 − r, it may be possible to derive an average subjective probability in the Bayesian sense for the truth of H1. Another possibility is the deduction of 1 − r in a frequentist framework using the law of total probability (see also 9): let φIII denote a binary function that indicates the success (φIII=1) or failure (φIII=0) of a phase III clinical trial. Here, as an example, we consider drug development in oncology. Investigations of the success rates of these phase III trials in oncology claim that approximately 55–60% of these drugs fail 10, 11. Taking this value into account with P(φIII=1) = 0.40, we have
urn:x-wiley:sim:media:sim6865:sim6865-math-0003(1)
by simple algebra it follows
urn:x-wiley:sim:media:sim6865:sim6865-math-0004(2)
The value 0.5 in the last equation follows, if αis considered to be negligibly small and if β≈0.2. If the success rate in phase III is higher, the aforementioned derivation will result in an increased prior probability 1 − r > 0.5.

2.1.2 The α-level adjustment scenario

After arriving at a benchmark, two different statistical models have to be considered: one for the source population and one for the target population. In the former model, an alternative hypothesis urn:x-wiley:sim:media:sim6865:sim6865-math-0005 is formulated. It is assumed, that it is possible to translate the alternative urn:x-wiley:sim:media:sim6865:sim6865-math-0006 from the source population into a clinically relevant alternative hypothesis H1 for the statistical model in the target population with the following condition: if sufficient similarity between the source and the target population holds with regard to biological correspondence, disease progression, and so on, then all the evidence regarding the truth of urn:x-wiley:sim:media:sim6865:sim6865-math-0007 can be translated into evidence for the truth of H1. An example for such a translation could be a certain functional relation of the effect sizes in both subpopulations. For the statement that there is sufficient similarity, we write E (short for ‘full extrapolation is possible’, which we consider here as an equivalent formulation of the statement), whereas the opposite statement is denoted as urn:x-wiley:sim:media:sim6865:sim6865-math-0008.

To be specific, we consider the problem of extrapolating evidence from adults (source population) to children (target population) and assume that the alternative urn:x-wiley:sim:media:sim6865:sim6865-math-0009 is ‘proven’ in adults by the conduct of a sufficient study programme of clinical trials relying on statistical tests. This means that due to the principle of statistical tests, there remains some uncertainty in the test decision. The assumption of a ‘proven’ urn:x-wiley:sim:media:sim6865:sim6865-math-0010 seems to be realistic because a failure in the proof of efficacy in the adult population in general would stop any further development in the paediatric population. Hence, formally the following arguments condition on the proof of efficacy in the adult population. Next, we model the beliefs in the truth of E and urn:x-wiley:sim:media:sim6865:sim6865-math-0011 as probabilities and write P(E) = 1−s and urn:x-wiley:sim:media:sim6865:sim6865-math-0012 where the latter probability will be denoted as the scepticism. Under the assumption that the way the adult trial was designed and conducted corresponds to the benchmark scenario outlined earlier and that full extrapolation can be applied, the probabilities of urn:x-wiley:sim:media:sim6865:sim6865-math-0013 and H1 are equal such that
urn:x-wiley:sim:media:sim6865:sim6865-math-0014(3)
where φ′ is the indicator for a successful adult development programme (φ′=1 indicates proof of efficacy). The quantity γ quantifies the remaining uncertainty, and we choose 1 − γ to be equal to Pr(H1|φb=1) from the benchmark scenario. For the rest of the paper, we consider this as the basic extrapolation assumption.
The probability that H1 holds conditional on the impossibility of full extrapolation urn:x-wiley:sim:media:sim6865:sim6865-math-0015 is denoted by 1 − q. Now the probability of H1 can be written as
urn:x-wiley:sim:media:sim6865:sim6865-math-0016(4)
urn:x-wiley:sim:media:sim6865:sim6865-math-0017(5)

Note that s refers to the disbelief in the ‘similarity’ (scepticism) between the source and the target population, whereas 1 − q refers to the prior probability of no effect in the target population, if similarity cannot be applied as an argument: how likely is the alternative, if it is found that extrapolation regarding efficacy cannot be applied? One may tend to choose q values close to 1 in such a situation, but certainty with regard to existing differences between the population may not exclude that the drug is working in the subpopulation; hence, values of q < 1 may be reasonable. In particular, there may be some information from past use of the drug in the target population (e.g. from off-label use in the paediatric population). Under our proposed extrapolation assumption 3, a lower boundary urn:x-wiley:sim:media:sim6865:sim6865-math-0018 seems reasonable for logical consistency. Otherwise, the belief in H1 would be higher, if full extrapolation would be regarded as not applicable.

With the aforementioned assumptions and fixed q and s, this describes the α-level adjustment scenario, where a paediatric trial is designed to be conducted after a positive result in the corresponding adult trial. With particular values of q and s, a prior probability of the alternative hypothesis is given by equation 5, which will be denoted urn:x-wiley:sim:media:sim6865:sim6865-math-0019 from now on. The corresponding positive predictive value, derived by the Bayes theorem will be written in a similar fashion as urn:x-wiley:sim:media:sim6865:sim6865-math-0020.

2.2 Evidence based α-level adjustment

We aim to derive a new test φadj with error rates αadj and βadj. This test should provide the same degree of belief urn:x-wiley:sim:media:sim6865:sim6865-math-0021, as a positive test result would provide in the benchmark scenario (=1 − γ):
urn:x-wiley:sim:media:sim6865:sim6865-math-0022(6)
If this equation holds, in our framework, we consider positive test results for φb and φadj to be equally convincing for regulators. By equation 6,
urn:x-wiley:sim:media:sim6865:sim6865-math-0023(7)
where, the left side of equation 7 refers to the benchmark scenario and αadj and βadj are chosen such that the equation holds. Clearly, the type II error rate should not be different from what is usually chosen from the sponsors (and regulators) perspective, because the main objective of the study is to obtain a positive result conditional on the true H1, that is, βadj=β. This means, that αadj is the only remaining variable in 7, from which
urn:x-wiley:sim:media:sim6865:sim6865-math-0024(8)
for r ≠ 1 and urn:x-wiley:sim:media:sim6865:sim6865-math-0025 is now easy to derive. Thus, the α level is raised by a factor, representing the ratio of prior odds in favour of the H1 in the α-level adjustment scenario and in the benchmark.
By using 5, equation 8 can be written as
urn:x-wiley:sim:media:sim6865:sim6865-math-0026(9)

In the last equation, urn:x-wiley:sim:media:sim6865:sim6865-math-0027 is the α-level adjustment factor, which after fixing γ only depends on r, q and s. By using 1 − γ from the benchmark scenario, we see from equation 9 that for s→0, the α-level adjustment factor urn:x-wiley:sim:media:sim6865:sim6865-math-0028 approaches urn:x-wiley:sim:media:sim6865:sim6865-math-0029, so that the maximum value for αadj is 1 − β. Note that a test with a level of 1 − β has a constant rejection probability of 1 − β irrespective of the null hypothesis or the alternative being true. Such a test refers to full extrapolation, because in theory a test with this property can be achieved with a sample size of zero by simply running a Bernoulli experiment with p = 1−β. No evidence is necessary also for the case q = γ, as αadj again takes its maximum value 1 − β. For s→1, αadj approaches urn:x-wiley:sim:media:sim6865:sim6865-math-0030 and hence approaches α for q = r: in case of full scepticism about the similarity (s = 1), the only prior information on H1 to be used is the belief 1 − q. If 1 − q = 1−r, we simply end up with the test in the benchmark scenario (φadj=φb). An interpretation for this particular case q = r is that early phase data in the paediatric population are available such that we are in the same situation as if we were starting a regular phase III drug development programme in the source population.

In Table 1, the values of the α-level adjustment factor urn:x-wiley:sim:media:sim6865:sim6865-math-0031 are calculated according to equation 9 for different values of s, r and q. With decreasing s, urn:x-wiley:sim:media:sim6865:sim6865-math-0032 increases. For fixed s and r, this factor also increases with decreasing q. When only r is allowed to vary, then an increase of r leads to an increase of urn:x-wiley:sim:media:sim6865:sim6865-math-0033. Note that the maximum inflation factor of 909.86 (r = 0.75, q = 0.1, s = 0.01 and γ = 0.0023) in Table 1 results in an increase of α = 0.0252 to an αadj of 0.57, which in practice would mean no further study in the target population.

Table 1. Different values of scepticism s (left column) motivate different values of urn:x-wiley:sim:media:sim6865:sim6865-math-0034 for calculating urn:x-wiley:sim:media:sim6865:sim6865-math-0035.
urn:x-wiley:sim:media:sim6865:sim6865-math-0036 urn:x-wiley:sim:media:sim6865:sim6865-math-0037 urn:x-wiley:sim:media:sim6865:sim6865-math-0038 urn:x-wiley:sim:media:sim6865:sim6865-math-0039 urn:x-wiley:sim:media:sim6865:sim6865-math-0040 urn:x-wiley:sim:media:sim6865:sim6865-math-0041 urn:x-wiley:sim:media:sim6865:sim6865-math-0042 urn:x-wiley:sim:media:sim6865:sim6865-math-0043 urn:x-wiley:sim:media:sim6865:sim6865-math-0044 urn:x-wiley:sim:media:sim6865:sim6865-math-0045
0.9 10.10 4.55 2.70 1.78 1.22 0.85 0.59 0.39 0.23 0.11
0.8 11.48 5.24 3.16 2.12 1.50 1.08 0.79 0.56 0.39 0.25
0.7 13.24 6.13 3.76 2.57 1.86 1.38 1.04 0.78 0.59 0.43
0.6 15.58 7.31 4.55 3.16 2.33 1.78 1.38 1.08 0.85 0.67
0.5 18.85 8.96 5.65 3.99 2.99 2.33 1.85 1.50 1.22 1.00
0.4 23.71 11.43 7.30 5.23 3.99 3.16 2.57 2.12 1.77 1.50
0.3 31.74 15.52 10.04 7.30 5.64 4.54 3.75 3.16 2.70 2.33
0.2 47.50 23.62 15.50 11.40 8.94 7.29 6.11 5.23 4.54 3.98
0.1 92.51 47.32 31.58 23.57 18.73 15.48 13.15 11.39 10.03 8.93
0.08 113.82 58.85 39.47 29.57 23.56 19.53 16.63 14.45 12.75 11.39
0.04 209.96 113.41 77.49 58.74 47.22 39.42 33.79 29.54 26.22 23.55
0.02 361.90 209.28 147.03 113.21 91.97 77.40 66.77 58.68 52.32 47.18
0.01 566.12 360.89 264.72 208.94 172.51 146.86 127.81 113.11 101.42 91.91
(a)r = 0.5, 1 − γ = 0.9992
urn:x-wiley:sim:media:sim6865:sim6865-math-0046 urn:x-wiley:sim:media:sim6865:sim6865-math-0047 urn:x-wiley:sim:media:sim6865:sim6865-math-0048 urn:x-wiley:sim:media:sim6865:sim6865-math-0049 urn:x-wiley:sim:media:sim6865:sim6865-math-0050 urn:x-wiley:sim:media:sim6865:sim6865-math-0051 urn:x-wiley:sim:media:sim6865:sim6865-math-0052 urn:x-wiley:sim:media:sim6865:sim6865-math-0053 urn:x-wiley:sim:media:sim6865:sim6865-math-0054 urn:x-wiley:sim:media:sim6865:sim6865-math-0055
0.9 3.37 1.52 0.90 0.59 0.41 0.28 0.20 0.13 0.08 0.04
0.8 3.83 1.75 1.06 0.71 0.50 0.36 0.26 0.19 0.13 0.08
0.7 4.42 2.05 1.25 0.86 0.62 0.46 0.35 0.26 0.20 0.14
0.6 5.21 2.44 1.52 1.05 0.78 0.59 0.46 0.36 0.28 0.22
0.5 6.32 3.00 1.89 1.33 1.00 0.78 0.62 0.50 0.41 0.33
0.4 7.97 3.83 2.44 1.75 1.33 1.05 0.86 0.71 0.59 0.50
0.3 10.71 5.21 3.36 2.44 1.89 1.52 1.25 1.05 0.90 0.78
0.2 16.16 7.96 5.20 3.82 2.99 2.44 2.04 1.75 1.52 1.33
0.1 32.25 16.14 10.69 7.95 6.30 5.20 4.41 3.82 3.36 2.99
0.08 40.14 20.20 13.42 10.01 7.95 6.58 5.59 4.86 4.28 3.82
0.04 78.16 40.09 26.88 20.18 16.13 13.41 11.47 10.00 8.86 7.95
0.02 147.68 78.06 52.98 40.06 32.18 26.87 23.06 20.18 17.93 16.13
0.01 265.36 147.51 102.09 78.01 63.10 52.96 45.61 40.05 35.68 32.17
(b)r = 0.25,1 − γ = 0.9997
urn:x-wiley:sim:media:sim6865:sim6865-math-0056 urn:x-wiley:sim:media:sim6865:sim6865-math-0057 urn:x-wiley:sim:media:sim6865:sim6865-math-0058 urn:x-wiley:sim:media:sim6865:sim6865-math-0059 urn:x-wiley:sim:media:sim6865:sim6865-math-0060 urn:x-wiley:sim:media:sim6865:sim6865-math-0061 urn:x-wiley:sim:media:sim6865:sim6865-math-0062 urn:x-wiley:sim:media:sim6865:sim6865-math-0063 urn:x-wiley:sim:media:sim6865:sim6865-math-0064 urn:x-wiley:sim:media:sim6865:sim6865-math-0065
0.9 30.25 13.65 8.10 5.33 3.66 2.55 1.76 1.17 0.70 0.33
0.8 34.28 15.70 9.48 6.36 4.49 3.24 2.35 1.68 1.16 0.75
0.7 39.44 18.32 11.24 7.69 5.55 4.13 3.11 2.35 1.76 1.28
0.6 46.24 21.81 13.58 9.45 6.97 5.31 4.13 3.24 2.55 1.99
0.5 55.65 26.66 16.85 11.91 8.94 6.96 5.54 4.48 3.65 2.99
0.4 69.49 33.86 21.71 15.59 11.90 9.43 7.66 6.33 5.30 4.47
0.3 91.89 45.69 29.75 21.67 16.79 13.52 11.18 9.42 8.04 6.95
0.2 134.32 68.69 45.51 33.65 26.46 21.62 18.15 15.54 13.50 11.86
0.1 245.37 132.88 90.52 68.30 54.61 45.33 38.62 33.55 29.58 26.39
0.08 293.30 162.52 111.83 84.91 68.22 56.85 48.61 42.37 37.47 33.53
0.04 479.54 290.63 208.01 161.68 132.03 111.43 96.28 84.68 75.50 68.06
0.02 700.68 475.98 360.05 289.30 241.63 207.33 181.46 161.26 145.05 131.75
0.01 909.86 696.89 564.50 474.22 408.73 359.04 320.05 288.65 262.81 241.17
(c)r = 0.75, 1 − γ = 0.9977
  • The error rates α and β from the benchmark scenario are considered to be 0.0252 and 1 − 0.92=0.19, respectively.

2.3 Eminence and evidence

To summarize, there are two sources of information, on which the final belief in the truth of H1 depends:
  1. Eminence: Information on which regulatory experts (e.g. a division of the Paediatric Committee) base their choice of the design parameters r, q and the scepticism s. Note that at the time when the PIP has to be laid down, often no data are available from efficacy trials in the adult population. The value of r is derived from some general arguments on the prior belief that the drug has no relevant effect when a standard drug development programme has already passed phase II and has arrived to plan phase III. We have suggested a plausible choice from general regulatory experience. However, it may be advisable to choose different prior beliefs r depending on the type of disease and type of drug under investigation. An increasing transparency of data from the regulatory drug registration process 12 may in future help to choose appropriate prior beliefs. We have chosen the slightly provocative term ‘eminence’ for expert opinion on similarity, modes of action, age dependency, prior beliefs and so on to express our precautions about the possibility of elicitation of all these types of information, accounting for the potentially high variability of the information between experts. Choosing an appropriate value of s seems to be even more difficult, and special techniques of eliciting prior knowledge in Bayesian statistics may be applicable 13, 14.

    One way to simplify the arguments is to assume q = 1 throughout. This implies that if extrapolation between populations is not considered to be an option, then the belief in efficacy (H1) would be zero. The framework allows to choose values q < 1, allowing a perspective for efficacy even if there is a high certainty about relevant differences in the populations not allowing to use extrapolation arguments. A crucial assumption in our framework is the extrapolation assumption P(H1|E) = 1−γ, meaning that if extrapolation is considered to be applicable, the proof of efficacy in the source population can be extrapolated to the target population providing, after rejection in the target population, the same posterior belief in the alternative as in the adult population with registration according to the standard procedure.

  2. Evidence: Data from a trial in the target population designed to control the error rates αadj and β of the test of the primary outcome variable. As formulated earlier, this adjusted significance level results from all a priori knowledge, for example, from expert opinion and/or from trials in the source population, covered by the additional design parameters r, q and s. In our simplified scenario, αadj and β are the criteria of statistical evidence to be reached for the trial to be performed in the target population. Hence, from a regulatory perspective, αadj is used as the final decision criterion for registering a treatment in the target population.

3 Extrapolation in normally distributed data

In this section, we propose a generalized framework for the test of one-sided hypotheses by introducing general prior distributions. Then, we motivate the application of two-point priors and focus on the special case of normally distributed outcome variables.

Let π denote a prior distribution on a parameter Δ and φ denote a test procedure testing H0:ΔΔ0 against H1:Δ > Δ0.

We redefine the test quality criteria from Section 2:
  • urn:x-wiley:sim:media:sim6865:sim6865-math-0066
  • urn:x-wiley:sim:media:sim6865:sim6865-math-0067,

where urn:x-wiley:sim:media:sim6865:sim6865-math-0068 and P(φ = 1|Δ) denotes the probability of a rejection given Δ. The left-hand side in the first inequality is the Bayesian power, defined as the average of the frequentist power across alternatives urn:x-wiley:sim:media:sim6865:sim6865-math-0069 according to the prior π. In the same sense, a Bayesian type I error rate can be defined as urn:x-wiley:sim:media:sim6865:sim6865-math-0070.

Similar to our model in the last section, we now take a look on the two scenarios, the benchmark and the α-level adjustment scenario.

Benchmark scenario:

Let πb represent the prior belief in this scenario and urn:x-wiley:sim:media:sim6865:sim6865-math-0071 with error rates urn:x-wiley:sim:media:sim6865:sim6865-math-0072 and urn:x-wiley:sim:media:sim6865:sim6865-math-0073 denote a standard trial in phase III that is usually chosen to be equivalent to a Neyman–Pearson test. The posterior belief after rejection is then given by
urn:x-wiley:sim:media:sim6865:sim6865-math-0074
where urn:x-wiley:sim:media:sim6865:sim6865-math-0075 is the subset of H1 that is considered to be non-relevant for the Bayesian power calculation.
The question is now how to choose πb because different opinions about the effect Δ lead to different posterior knowledge of the H1. Our suggestion is as follows: similarly to the formulation of the benchmark scenario in the previous section, we first fix the prior probability 1 − r, but this time this value represents the belief on the parameter set urn:x-wiley:sim:media:sim6865:sim6865-math-0076. In the Appendix, we will show, that every prior πb fulfilling urn:x-wiley:sim:media:sim6865:sim6865-math-0077 leads to a positive predictive value with the property
urn:x-wiley:sim:media:sim6865:sim6865-math-0078
The right-hand side of this inequality can be thought as the positive predictive value of the test urn:x-wiley:sim:media:sim6865:sim6865-math-0079 coming from a Dirac distribution πNP(A) = r·1{A}(Δ0) + (1 − r1{A}(Δ1)(here, 1 denotes the indicator function). It furthermore holds
urn:x-wiley:sim:media:sim6865:sim6865-math-0080
With the last two inequalities, we can conclude that after the choices of urn:x-wiley:sim:media:sim6865:sim6865-math-0081 and 1 − r, both the Bayesian power and the positive predictive value are controlled on a level urn:x-wiley:sim:media:sim6865:sim6865-math-0082 and on a level 1 − γb respectively with urn:x-wiley:sim:media:sim6865:sim6865-math-0083, for the set of all prior distributions πb with urn:x-wiley:sim:media:sim6865:sim6865-math-0084.

Note that in the benchmark scenario, the restriction on two points in the parameter space is widely used when planning a frequentist phase III study at a level α and a power of 1 − β, so that focusing on these two points is not an uncommon approach.

α-Level adjustment scenario:

Similarly to the restriction of possible prior distributions in the benchmark scenario to such priors πb that fulfil the constraint urn:x-wiley:sim:media:sim6865:sim6865-math-0085, we will restrict the possible prior distributions πadj in the α-level adjustment scenario. To this end, we repeat the approach from the last section as follows: if full extrapolation can be considered as possible, all the evidence can be taken from the source population; hence, urn:x-wiley:sim:media:sim6865:sim6865-math-0086. Furthermore, urn:x-wiley:sim:media:sim6865:sim6865-math-0087 has to be specified. After specifying the scepticism s, this again leads to
urn:x-wiley:sim:media:sim6865:sim6865-math-0088
as in equation 5, which we will use as a condition for the priors πadj in the α-level adjustment scenario.
By again defining a Neyman-Pearson test urn:x-wiley:sim:media:sim6865:sim6865-math-0089 with error rates and urn:x-wiley:sim:media:sim6865:sim6865-math-0090 and urn:x-wiley:sim:media:sim6865:sim6865-math-0091, as in the benchmark scenario, it can be concluded that the Bayesian power is controlled at a level urn:x-wiley:sim:media:sim6865:sim6865-math-0092 and the positive predictive value
urn:x-wiley:sim:media:sim6865:sim6865-math-0093
is controlled at a level 1 − γadj with urn:x-wiley:sim:media:sim6865:sim6865-math-0094. Therefore, we again calculate the positive predictive value of the Neyman–Pearson test with prior probabilities coming from a two-point distribution with probabilities urn:x-wiley:sim:media:sim6865:sim6865-math-0095 and urn:x-wiley:sim:media:sim6865:sim6865-math-0096 to derive a lower bound for the positive predictive value under constraint (Appendix).
Given our prior assumptions in both scenarios, we aim to control the positive predictive value and the Bayesian power in the α-level adjustment scenario at the level 1 − γb defined in the benchmark scenario. To that end, we equate the lower bounds of the positive predictive values in both scenarios 1 − γb=1 − γadj, which can be written as
urn:x-wiley:sim:media:sim6865:sim6865-math-0097
This is exactly of the form of equation 7, and hence, for equal type II error rates urn:x-wiley:sim:media:sim6865:sim6865-math-0098, all the results from Section 2 can be applied directly.
As an example, we now show how the previously developed framework can be applied in the setting of normally distributed outcome variables (variance known). We consider two groups of patients, where the first group receives an experimental treatment (T) and the second group acquire either the standard clinical intervention or a placebo (C). For nT measurements in the experimental group, and nC measurements in the control group, urn:x-wiley:sim:media:sim6865:sim6865-math-0099 and urn:x-wiley:sim:media:sim6865:sim6865-math-0100 denote independent estimators for the respective group mean. In this setting, the variances for the observations are assumed to be known and equal in both groups, and the measurements are given in units of this variance. Our interest lies on the effect of the experimental treatment Δ = μTμC, which can be estimated in a natural way by urn:x-wiley:sim:media:sim6865:sim6865-math-0101, where urn:x-wiley:sim:media:sim6865:sim6865-math-0102. For our test problem, we consider the null hypothesis H0:ΔΔ0 and its alternative H1:Δ > Δ0. With the following rejection rule, φ = 1 if and only if
urn:x-wiley:sim:media:sim6865:sim6865-math-0103(10)
for specific α, n and where xn denotes a realization of Xn, we have the classical test that controls the type I error rate at level α. The power is controlled by an adequate selection of n.

In Figure 1, the fractions urn:x-wiley:sim:media:sim6865:sim6865-math-0104 of the adjusted sample size based on αadj over the standard sample size are drawn for different values of s, q and r = 0.5. Here, the formulae urn:x-wiley:sim:media:sim6865:sim6865-math-0105 with A = αadj for nadj and A = α = 0.025,0.0252 for n were used, with a targeted power of 1 − β = 0.9 and 1 − β = 0.81, respectively.

Details are in the caption following the image
Relative sample sizes depending on different scepticisms for r = 0.5. The horizontal lines indicate the relative size of 1, that is, the adjusted significance level αadj is equal to α; hence, the sample size of the adjusted test is equal to the sample size of the conventional test. The left figure refers to α = 0.025 and 1 − β = 0.9 (1 − γ = 0.973) and the right figure to α = 0.0252 and 1 − β = 0.81 (1 − γ = 0.999). For a detailed description, see the text.

At first, we realize that for decreasing q (more prior belief in the alternative if extrapolation is not assumed to be a possible option), the decrease in sample size becomes larger. Moreover, for q = 0.5, even large scepticisms (s > 0.5) may lead to a saving of sample size. Savings of sample size (values below the horizontal line) are possible up to scepticism values similar for high or low significance levels. Note that if the scepticism is high, that is, putting more prior belief in the non-applicability of extrapolation and/or in the lack of efficacy if extrapolation is not considered to be an option in our framework, larger sample size than in the conventional test may be required to achieve the same high positive predictive value in the end. The possible savings however are larger for the higher significance level 0.025 (left panel in Figure 1). For the lower significance level (right panel), the sample size decrease is very steep for very small scepticisms.

3.1 Rejection probabilities

In Figure 2, curves for the rejection probabilities in the aforementioned discussed two-sample normal distribution scenario with common σ = 1 are drawn as a function of the true effect Δ. The α level in the benchmark scenario is set equal to 0.025 in Figure 2(a) and 0.000625 in Figure 2(b). In these two figures, the relative sample sizes were calculated to reach a power of 0.9 and 0.81, respectively, at the alternative Δ = 1. The reference for calculating the relative sample size is a traditional parallel group comparison with α = 0.025 with 1 − β = 0.9 (Figure 2(a)) and α = 0.0252 with 1 − β = 0.81 (Figure 2(b)), respectively. Curves are drawn for a scepticism s equal to 0.1 (dark grey), 0.3, 0.5, 0.7 and 0.9 (light grey). In both figures, 1 − r takes values of 0.75 (top line of plots), 0.5 (second line) and 0.25 (bottom line of plots). In the first, second and third columns, 1 − q is set to 0, 0.25 and 0.5, respectively.

Details are in the caption following the image
Rejection probabilities for different parameters as a function of the true effect Δ. Descending grey values of the solid curves represent ascending scepticisms s∈{0.1,0.3,0.5,0.7,0.9} in each plot (thus for s = 0.1, the solid curve is dark grey). The dashed curve represents the traditional power curve outside of our proposed framework (α = 0.025 (a) and α = 0.0252 (b)), which serves as a reference design. The numbers in the right grey area of the panels show the relative sample sizes in relation to this reference design for each value of s. The vertical solid line intersects the curves at their (adjusted) α level. The horizontal and vertical dotted lines represent the power and the predefined effect used for sample size calculation, respectively.

A decrease of q as well as an increase of r increases the α-level adjustment factor and therefore decreases the needed sample size in our approach: the relative sample size as compared with the reference design is <1. This is in complete accordance with the results already discussed for the framework of simple null and alternative hypotheses in Section 2.

Figure 2 also illustrates the risk of erroneously relying on extrapolation. Let us assume that for the paediatric population, there is no effect at all (H0 is true). Obviously, the probability for a false positive claim in the paediatric population when applying the adjusted test is αadj. This can be considerably large for large 1 − q, the prior belief in efficacy of the paediatric drug if extrapolation is not deemed to be feasible, or small scepticism s (high confidence in similarity). Somehow counter-intuitive is the impact of 1 − r: the smaller the belief in the alternative in the benchmark scenario, the larger the false positive rate αadj in the adjusted test. This is because with increasing values of 1 − r, the level of evidence 1 − γ reached in the benchmark scenario is also increasing (see the left side of equation 6). If on the other hand the alternative is in fact true, then the power is 1 − β by design (Figure 2).

4 Example: the Food and Drug Administration decision tree

The FDA designed a decision tree, where knowledge on the similarities in both, the disease progression and the response to intervention in the adult and in the paediatric population, leads to reduced study programmes 6. The magnitude of the reduction then depends on the similarity of the exposure–response relationships in the two populations. The tree distinguishes three levels of extrapolation and the corresponding levels of evidence to be supplied to the authority: full extrapolation, partial extrapolation and no extrapolation. Under partial extrapolation, a single well-controlled trial, uncontrolled efficacy and safety trials and single exposure response trials (not powered for efficacy) may be required. However for partial extrapolation, the decision tree does not specify the required evidence in detail. Therefore, for the sake of demonstration, we assume that under partial extrapolation a single (instead of 2) confirmatory trials are required. This simplifies the decision tree to three levels:
  1. No additional trial (full extrapolation)
  2. A single confirmative trial (partial extrapolation)
  3. Two confirmative trials

In a recent review 15, it was shown that 68% of 166 paediatric products investigated used the concept of partial extrapolation. This indicates a high confidence of experts that adult data can be extrapolated to some extent to the paediatric population. Full extrapolation was only applied for 14% of the products, which corresponds to a scepticism equal to 0 in our framework. The supplementary material of the review contains tables listing the indications, the age groups and the products, for which no extrapolation, partial extrapolation or full extrapolation have been applied. However, there is no quantitative description on the amount of scepticisms leading to different study programme reductions. We tried to give a rough visualization of the decision tree in terms of our framework based on scepticism. In Figure 3, the horizontal lines indicate the three levels of evidence, and the lowest bar refers to two independent trials, both on a one-sided significance level of 0.025. To apply our previous framework, we assumed instead that a single trial with a significance level of 0.0252 would be performed. The middle bar corresponds to a single trial at a one-sided significance level of 0.025; the highest point indicates full extrapolation (no clinical trial needed). To mimic the FDA decision tree, the length of the bars for the distinct significance levels have been chosen in a somehow demonstrative way such that the curve for the adjusted significance level crosses the central bar right in the middle. We believe, that this is a fair approximation of the curve sharing a fixed initial point and two piecewise constant levels. It should be noted, that our proposal could be interpreted as a continuous generalization of the three levels of the FDA decision tree, assuming that they correspond to three discrete levels of scepticism.

Details are in the caption following the image
Translating the Food and Drug Administration (FDA) decision tree on extrapolation into our proposed methodology of α adjustment. The dashed line shows αadj as a function of the scepticism s in the simple extrapolation model. The piecewise constant solid lines represent the FDA extrapolation tree when expressed in terms of the framework of α adjustment and scepticism. Points of discontinuity are chosen such that the left bar is intersected by the continuous α level curve right in the middle.

Figure 3 shows a sharp increase of αadj with decreasing scepticism s in the relevant regions where the decision has to be made whether one or two pivotal studies have to be performed. Small differences in the assessment of s may have substantial consequences for the paediatric development programme. The figure also shows that to achieve the same posterior probability for a relevant effect size, full extrapolation (highest point at the left) is possible only if the experts are completely certain that extrapolation is fully applicable (s = 0). But also relaxing the drug development to a single trial at level 0.025 (middle bar) according to our framework and graphical approximation would require a very high belief in extrapolation (in terms of a small s). This seems to be the case for most of the products in the experience of the FDA 15. If there is some belief in efficacy if extrapolation is not considered to be possible (q < 1), slightly larger scepticisms may allow to avoid the full development programme (for the FDA following the two pivotal studies paradigm).

5 Discussion

In drug development, several standards have evolved over time. Most of the statistical standards refer to the planning and analysis of single trials (e.g. 16). The important quality of reproducibility of trial results for long time has been accounted for particularly at the FDA demanding the two pivotal studies paradigm for phase III of drug development 17. However, it is also possible to rely on a single adequate and well-controlled study of a drug, if supported by additional evidence from other sources 17-19. For drug development in small populations 8, to keep such standards may simply not be feasible. For subpopulations, such as the paediatric population, an additional problem arises because when developing a drug for children, in general, sufficient data are available from clinical trials having been performed for authorization of the drug in the adult population. In this situation, parents will be very cautious to allow their children to participate in paediatric trials, particularly in trials using a placebo control that will be excluded in life-threatening diseases anyway. There is a never ending discussion if placebo-controlled trials may be performed at all in paediatric populations when the drug has been registered for adults. The consequence of all the ethical, feasibility and economic constraints in the past was that only a small proportion of drugs registered for adults have been also registered in children. Off-label use of the drug in children was the consequence if paediatric doctors believed that the drug would improve their patients' health. Shifting such decisions to the responsibility of the individual paediatrician without any access to systematic collection of efficacy data in the paediatric population is not an acceptable option from the legal and medical perspective. Hence, the legislation has been changed. According to this new legislation, a new drug for adults is only registered at the FDA or EMA when a programme for drug development (at the EMA a PIP) in children has been provided by the drug developer and has been approved by the regulators (at the EMA by the Paediatric Committee). To not to be late with the registration in children, the development programme has to be proposed already early in the development programme for adults. Another advantage of an early development plan for children is that at this time it could be integrated scientifically in the adult development by planning studies in adults that in turn provide specific data relevant for the paediatric development. This is the crucial problem: the earlier a paediatric programme is planned, the less information is available from adults. Hence, expert knowledge on the type of disease and type of drug plays an important role in early deciding on the design of a paediatric development programme to be accepted by regulators. Similar arguments may apply for very rare diseases where full programmes are infeasible or could withhold the beneficial use of a potent drug for a long time. Moreover, at the end of an overly long development programme, there could be no more interest in the drug because other potent therapeutic options have been established meanwhile.

We have tried to structure this procedure to decide on a drug development programme under uncertainty. Two quality criteria to compare different drug development programme have been fixed: first, the power of detecting an effective drug in the end is prefixed at a certain value. This is in the interest of the developer. Second, the posterior probability that there is indeed a relevant positive treatment effect, after the test of the no effect null hypothesis has been rejected, is also fixed at a (large) value. This in our framework is in the public interest of the regulator who aims at a large positive predictive value of the final test decision. As a paradigm for standard drug development, we used a conventional clinical trial in phase III analysed by a statistical test with significance levels 0.025 or 0.0252 and a certain prior belief (1 − r) in effectiveness based on earlier phases before starting phase III. In contrast, we looked at a test with an adapted significance level, where the adjustment of the significance level is depending on the prior belief on the possibility of full extrapolation from another source (e.g. from another population). Not surprisingly, noticeable sample size savings are only possible if the prior belief in efficacy is fairly high. We looked also on how the results would change when we assume that there is still a positive belief (1 − q) in efficacy although extrapolation is not considered to be applicable. Obviously, the opening of a new track for a positive result by choosing q < 1 will increase the savings in sample size.

With regard to the FDA decision tree for extrapolation to paediatric populations (full extrapolation, a single trial and two pivotal studies) to apply full extrapolation, we need complete confidence in extrapolation. The reason is that a standard programme of drug development with a reasonably high power (e.g. 0.9) and plausible prior odds of 1 (r = 0.5) results in very large positive predictive values given the programme has succeeded to reject the null hypothesis of no efficacy (0.973 for a test at level 0.025 and 0.999 at level 0.0252). If no trial is run, there should be no doubt at all in the appropriateness of extrapolation in order to end up formally with a positive predictive value of the magnitude in the benchmark scenario. But even for relaxing the statistical rigour to the degree that only a single trial at a conventional significance level has to be run, the scepticism about the appropriateness of extrapolation has to be very small. It will be difficult to settle on such borderline prior beliefs in expert panels accounting for potential differences in eminence-based information. Moreover, as also pointed out in the reviewing process, the approach ‘is highly dependent on a number of assumptions related to the key parameters used to determine the adjusted level of α for the target population’. In Figure S1 of the Supporting Information, we present results of a sensitivity analysis, quantifying the impact of varying assumptions on the relationship between αadj and s within our simple framework. Here, we also look at the situation, where the assumption of equal levels of evidence and equal power in the adult and paediatric programme is dropped. The relationship varies considerably with varying probability 1 − q of effectiveness without the option of extrapolation, and with varying levels of evidence 1 − γc to be reached in the paediatric study programme (dropping the assumption of equal level of evidence for adults and children in equation 3). To understand the high dependency on 1 − γc, it is helpful to look at the level of evidence in terms of odds: assuming 1 − γc=0.9 the corresponding odds in favour of the alternative against the null hypothesis are 9:1, whereas for 1 − γc=0.9992 in the two-pivotal study programme, the odds are 1249:1. Less variability is observed when the power in the paediatric study programme 1 − βc is chosen differently to that in the adult population. Variation of the targeted level of evidence in the adult population 1 − γa does also not severely impact on the relationship.

A common regulatory practice to ask for a single study in the paediatric population would – in our framework – correspond to low scepticism of experts about the appropriateness of extrapolation from the adult population. To this end, it seems to be questionable, in particular with regard to the significance level 0.0252(which mimics the two pivotal studies paradigm of drug regulation), whether the α-level adjustment approach in small populations is a feasible way from the perspective of the responsible experts who have to decide at an early stage of the drug development process. In very small populations, this will have to be further relaxed for feasibility reasons, so that only smaller positive predictive values following a successful development programme will be achievable. Such a relaxation, in combination with post-marketing research, may be very reasonable in indications were no accepted efficacious therapy is available. One of the purposes of this paper is to bridge frequentist and Bayesian arguments and create a framework to compare different approaches and counter weight evidence from data in (smaller) trials and eminence (expert knowledge) in this specific environment of decisions under uncertainty in medicine. The importance of decisions on extrapolation, for example, in the paediatric population can be seen from the review 15.

A methodological spin off in our framework is that simple two point priors may be used in the argumentation as they, under some constraint, have some useful limiting properties among all other prior distributions. It should be noted that we tried to use our framework to portray existing decision structures in drug regulation. However, it could also be used in different contexts, for example, fixing the positive and negative predictive values of regulatory decisions, including utilities/losses or simply backward calculating a ‘virtual’ scepticism if only a small sample size in the target population is available and cannot be based on calculations with a targeted power. It has to be stressed that our framework refers to a very early phase of drug development in the target population: on the one hand, it is understandable that regulators aim at binding commitments of drug developers on which and how much evidence will be supplied for the registration of a new drug in children. In the present legislation, such a commitment is even a condition for registration of the drug in adults. On the other hand, the actual trials in the paediatric population are often starting not before or even quite delayed after the drug has been registered in adults. Hence, the environment of extrapolation is likely to change if data from adult studies will become available. Consequently, by the logic of science, it is reasonable to consider adaptations of the agreed paediatric development programme. In the legislation, the request for modification of an approved PIP, in the EU to be dealt with by the Paediatric Committee of the EMA, is an appropriate way to deal with this learning from experience situation. Other Bayesian approaches using data from the source population 20-23 may be applied to adaptively modify the preplanned paediatric development programme. This may be achieved in practice by allowing for the option of an adaptive PIP as an example of an adaptive licensing approach 24. It seems to be reasonable that more emphasis in research and application should be put also on this stage of developing new drugs in children.

Acknowledgements

We thank the editor, the associate editor and two reviewers for their constructive criticism, which helped us to improve the quality of the paper. The research leading to these results has received funding from the EU Seventh Framework Programme [FP7 2007-2013] under grant agreement no. 602552. Martin Posch was supported by EU FP7 HEALTH.2013.4.2-3 grant no. 603160.

    Appendix A

    The Neyman–Pearson test and the use of two-points prior in Bayesian contextLet Θ denote the parameter space. By setting two points Δ0 and Δ1 with Δ0<Δ1, three subsets of Θ can be distinguished, namely, the null hypothesis H0=(−,Δ0], and furthermore, urn:x-wiley:sim:media:sim6865:sim6865-math-0106 and urn:x-wiley:sim:media:sim6865:sim6865-math-0107.

    We will prove the following result for tests applied on data x coming from distributions fulfilling the monotone likelihood ratio property in T(x) for some statistic T: Given urn:x-wiley:sim:media:sim6865:sim6865-math-0108, calculating the positive predictive value of a Neyman–Pearson test φNP (where φNP=1 again means rejection of the null hypothesis) by using a two-points prior results in a lower bound in the sense that for any other prior π with urn:x-wiley:sim:media:sim6865:sim6865-math-0109, the Bayesian averaged power for the rejection of the null hypothesis urn:x-wiley:sim:media:sim6865:sim6865-math-0110 and the posterior probability Pπ(H1|φNP=1) both will never be smaller. Moreover, the Bayesian averaged type I error rate Pπ(φNP=1|H0) will not exceed the corresponding frequentist type I error of the Neyman–Pearson test.

    This result will be proven in two steps:
    • I

      First, we show, that for any test φ the posterior probability P(H1|φ = 1) can always be reduced by choosing urn:x-wiley:sim:media:sim6865:sim6865-math-0111 and urn:x-wiley:sim:media:sim6865:sim6865-math-0112.

    • II

      Then, we derive that in the class of priors with urn:x-wiley:sim:media:sim6865:sim6865-math-0113 and urn:x-wiley:sim:media:sim6865:sim6865-math-0114, by using a two-point prior together with a Neyman–Pearson test, we have a lower bound for urn:x-wiley:sim:media:sim6865:sim6865-math-0115 and Pπ(H1|φNP=1) and an upper bound for Pπ(φNP=1|H0).

    Let urn:x-wiley:sim:media:sim6865:sim6865-math-0116 denote a mixture of two Dirac distributions: urn:x-wiley:sim:media:sim6865:sim6865-math-0117 with an arbitrary p > 0. When φNP denotes a Neyman–Pearson with error rates αNP and βNP, it holds that the positive predictive value is
    urn:x-wiley:sim:media:sim6865:sim6865-math-0118
    with urn:x-wiley:sim:media:sim6865:sim6865-math-0119. Forthe sake of completeness of our description, we note that the Neyman–Pearson test in our setting is described by the statistic T(x) that occurs in the monotone likelihood ratio condition, and a threshold kNP, which defines the rejection region {x:T(x) > kNP}.
    It is easy to show that calculating the positive predictive value of a Neyman–Pearson test by using the earlier defined two-points prior with urn:x-wiley:sim:media:sim6865:sim6865-math-0120 results in a lower bound for both, the set of all possible positive predictive values and the Bayesian averaged power for all priors with fixed prior probability urn:x-wiley:sim:media:sim6865:sim6865-math-0121. First, we notice that generally the positive predictive value has the following form:
    urn:x-wiley:sim:media:sim6865:sim6865-math-0122
    The partial derivative of this function with respect to urn:x-wiley:sim:media:sim6865:sim6865-math-0123 is then
    urn:x-wiley:sim:media:sim6865:sim6865-math-0124
    which is positive. Therefore, decreasing urn:x-wiley:sim:media:sim6865:sim6865-math-0125 to its smallest possible value 0 minimizes the positive predictive value:
    urn:x-wiley:sim:media:sim6865:sim6865-math-0126
    This concludes part I of the proof.
    For part II of the proof, we now show that for any prior π with urn:x-wiley:sim:media:sim6865:sim6865-math-0127, it holds
    urn:x-wiley:sim:media:sim6865:sim6865-math-0128
    From part I, we have urn:x-wiley:sim:media:sim6865:sim6865-math-0129. If it can be shown that the inequalities urn:x-wiley:sim:media:sim6865:sim6865-math-0130 and Pπ(φNP=1|H0)≤αNP hold, then we obtain
    urn:x-wiley:sim:media:sim6865:sim6865-math-0131
    (note that the partial derivative of the left side with respect to urn:x-wiley:sim:media:sim6865:sim6865-math-0132 is positive and urn:x-wiley:sim:media:sim6865:sim6865-math-0133), and our main result is proven because the right side of the last inequation is equal to urn:x-wiley:sim:media:sim6865:sim6865-math-0134.
    We now prove Pπ(φNP=1|H0)≤αNP: by using the Fubini–Tonelli theorem, we have
    urn:x-wiley:sim:media:sim6865:sim6865-math-0135
    The inequality of the fifth line of the last inequation follows from the monotonicity of the power function as a direct result of the Neyman–Pearson lemma for one-sided tests.

    The second inequality urn:x-wiley:sim:media:sim6865:sim6865-math-0136 follows analogously by using again Fubini–Tonelli and the monotonicity of the power.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.