Clinical outcome endpoints in heart failure trials: a European Society of Cardiology Heart Failure Association consensus document
Abstract
Endpoint selection is a critically important step in clinical trial design. It poses major challenges for investigators, regulators, and study sponsors, and it also has important clinical and practical implications for physicians and patients. Clinical outcomes of interest in heart failure trials include all-cause mortality, cause-specific mortality, relevant non-fatal morbidity (e.g. all-cause and cause-specific hospitalization), composites capturing both morbidity and mortality, safety, symptoms, functional capacity, and patient-reported outcomes. Each of these endpoints has strengths and weaknesses that create controversies regarding which is most appropriate in terms of clinical importance, sensitivity, reliability, and consistency. Not surprisingly, a lack of consensus exists within the scientific community regarding the optimal endpoint(s) for both acute and chronic heart failure trials. In an effort to address these issues, the Heart Failure Association of the European Society of Cardiology (HFA-ESC) convened a group of expert heart failure clinical investigators, biostatisticians, regulators, and pharmaceutical industry scientists (Nice, France, 12–13 February 2012) to evaluate the challenges of defining heart failure endpoints in clinical trials and to develop a consensus framework. This report summarizes the group's recommendations for achieving common views on heart failure endpoints in clinical trials.
Introduction
Endpoint selection is one of the most critical components of clinical trial design. Large pivotal heart failure trials are designed to provide robust evidence that may support regulatory approval, extension of indications, consolidation or rejection of therapeutic strategies, and reimbursement claims. Thus, the efficacy endpoints for these studies usually reflect total and/or cause-specific mortality, morbidity, or clinician-interpreted (NYHA class) or patient-reported outcomes [Minnesota Living With Heart Failure Questionnaire (MLWHFQ) or dyspnoea], either alone or in combination (Figure 1A and B).1 Measures of functional status may be used as endpoints, and they may be adequate for regulatory authorities to be approved for treatment, provided there are sufficient safety data. More than one trial replicating an effect on functional outcomes may be required for regulatory approval. However, endpoints that measure functional status (e.g. exercise tolerance or oxygen consumption) or reflect manifestations of disease pathophysiology (e.g. biomarkers or remodelling variables) are typically applied in earlier phases of drug or device development to support proof-of-concept, demonstrate dose responsiveness, and/or provide preliminary evidence of safety and efficacy. Mortality and morbidity endpoints for pivotal trials are the primary focus of this paper.
The therapeutic agent under investigation should have a mechanistically plausible effect on the chosen primary endpoint. The primary endpoint must be clinically relevant, important to both patients and healthcare providers, measurable, and responsive to therapeutic interventions such that it distinguishes between effective and ineffective therapies. It should be robust, with minimal bias or other confounding factors. The choice of endpoint is further influenced by the target patient population (e.g. acute vs. chronic heart failure) and treatment objective (e.g. reduction of morbidity and/or mortality vs. symptomatic improvement) (Figure 1A and B).1 The primary efficacy endpoint is a key determinant of sample size estimates, since these are determined by the expected event rate of the endpoint, its variability, and the expected effect size over standard care. Consensus has not been reached on the optimal phase III endpoints in acute or chronic heart failure trials, but this topic is a central priority as evidenced by the attention it has received in the medical literature.2–7

The Heart Failure Association of the European Society of Cardiology (HFA-ESC) convened a group of expert heart failure clinical trialists, biostatisticians, regulators, and pharmaceutical industry scientists (Nice, France, 12–13 February 2012) to evaluate the challenges of defining heart failure endpoints in clinical trials and to develop a consensus framework. This report summarizes the group's recommendations for moving towards consensus (Table 1), and it identifies areas of uncertainty where more research is needed (Table 2).
Endpoint | Consensus |
---|---|
Mortality endpoints |
|
Heart failure hospitalization |
|
Recurrent morbid event endpoints |
|
Endpoints other than hospitalization |
|
Symptom and patient-reported health outcomes |
|
Clinical composite endpoints |
|
Safety endpoints |
|
Recurrent morbid event endpoints |
|
Non-hospitalization endpoints |
|
Symptoms and patient-reported health outcomes |
|
Clinical composite endpoints |
|
Safety endpoints |
|
Specific endpoints used in phase III heart failure clinical trials
Heart failure is a syndrome with a wide spectrum ranging from asymptomatic LV dysfunction to end-stage heart failure. Within this spectrum, patients may either have heart failure with reduced ejection fraction (HF-REF) or heart failure with preserved ejection fraction (HF-PEF). They may also oscillate between periods of stability, where they are generally managed well as outpatients (i.e. chronic heart failure), and periods of decompensation requiring hospitalization [i.e. hospitalized heart failure or acute heart failure (AHF)]. Selection of the appropriate endpoint should take into account the unique differences (e.g. expected event rates) within these subsets of heart failure. For the purposes of this paper, endpoints are discussed within the general context of heart failure. Where relevant, considerations that are important for trials targeting a specific heart failure subset are also provided.
Mortality endpoints
All-cause death vs. cause-specific death
No drug can be expected to affect all causes of death (or hospitalization); rather, a treatment may reduce all-cause mortality by reducing the major cause or causes of death. The choice between all-cause events and cause-specific events depends on whether the objective is only to reflect the benefit of the drug, in which case the endpoint should be as specific as expectations permit, or if the objective is to reflect net benefit, in which case a non-specific (i.e. all-cause) endpoint clearly shows that the benefit is not obscured by adverse effects or noise from completely unrelated events. All-cause endpoints, especially all-cause mortality, should reduce bias in unblinded trials. For other endpoints, a blinded endpoint adjudication committee is key in all unblinded trials. Both total death and cause-specific death are important to quantify in chronic and acute heart failure trials to achieve a comprehensive evaluation of safety, efficacy, and health economics.
Recent evidence suggests that non-cardiovascular deaths are increasing among patients with heart failure, particularly in some subsets such as patients with HF-PEF.8 Therefore, cardiovascular death is a more efficient endpoint than all-cause mortality if non-cardiovascular death is expected to account for a substantial portion of deaths accrued in a trial, and if the effect of the therapy on non-cardiovascular mortality is expected to be neutral. Otherwise, if all-cause mortality is the primary endpoint, sample sizes will need to be increasingly large to account for the ‘random noise’ added by non-cardiovascular deaths.9 Even when a primary endpoint of cause-specific mortality is significantly reduced by an intervention, total mortality and non-cardiovascular deaths should be included as key safety endpoints to capture potential adverse effects. A directionally similar decrease in overall mortality (even if not statistically significant) should be demonstrated without an adverse increase in non-cardiovascular deaths. In acute heart failure trials, although long-term event rates are high, it may be difficult to demonstrate efficacy on a mortality endpoint since acute therapies primarily target symptoms and are administered for a short duration. Thus, employing total mortality as a key safety measure may be more appropriate than as a measure of efficacy in acute heart failure trials. It has been hypothesized that a short-term therapy could reduce long-term mortality if it ameliorated acute myocardial injury that may occur in the setting of worsening heart failure (similar to thrombolysis in acute coronary syndrome), but such an effect has not been demonstrated to date. Certainly the results of the RELAXin Acute Heart Failure (RELAX-AHF) trial10 suggest that such a possibility exists, although the mechanisms of intermediate-term (6 months) benefits after a short-term (48 h) serelaxin administration are yet to be understood.10
Adjudication of cause-specific death
Although endpoints of cause-specific death or hospitalization are often evaluated by adjudication of protocol-specified definitions by a committee with the appropriate range of clinical and statistical expertise, it has been questioned whether adjudication adds value in cardiovascular trials.11,12 The site investigator often has access to more information than is captured on the case report forms. If the study is unblinded (by design or treatment effects) or the site investigators (or study personnel) lack expertise to adjudicate outcomes, adjudication may reduce noise by applying event definitions consistently and should be performed; otherwise, the benefit of adjudication is less clear, even when site-reported and adjudicated results differ. Adjudication is also expensive, reducing the resources that otherwise might contribute to a more robust (larger) study or to exploring more than one dose. Despite these limitations, data suggest that clinical event adjudication improves the precision of classifying events in a clinical trial.13–18 Hence, the need for adjudication remains unresolved (Table 2).
The mode of death in patients with heart failure is frequently difficult to determine. It is commonly agreed that heart failure deaths may follow progressive symptomatic worsening (‘pump failure death’) or occur suddenly (‘sudden cardiac death’). Sudden death is the primary cause of death for the majority of patients with mild to moderate heart failure; pump failure death is more common among patients with advanced symptoms.9,19–22 The determination of the cause of death in patients found dead is problematic, and it is debated whether these should be categorized as ‘unknown’ or ‘presumed cardiovascular’. Attributing ‘unknown’ cause of death to a specific cause (e.g. defaulting to sudden death or presumed cardiovascular death) is problematic when there are insufficient data and may explain why, in some trials, the numbers of sudden deaths and/or cardiovascular deaths may be overestimated. However, where a patient has had no recent contact with medical services, has no obvious cause of death (e.g. metastatic cancer), and is found dead at home, then this provides strong circumstantial evidence of sudden death. Place of death should be reported more often in trials as well as the events and patient status in the month prior to death.23
Morbidity and clinical composite endpoints
Heart failure hospitalization
Hospitalization for heart failure is clinically meaningful to patients, physicians, and regulators, and it correlates with disease progression and prognosis. Hospitalizations are relevant to payers because they create the majority of the health economic burden of heart failure.24–29
Despite the clinical relevance of hospitalizations for heart failure, its use as an endpoint in heart failure trials must take into account some limitations. The decision to hospitalize a patient with heart failure is often based on subjective criteria. The threshold for hospitalization is highly variable across (and within) regions of the world, which may affect the interpretability and applicability of study results in specific regions, particularly in global trials.30 As models of healthcare delivery evolve in response to economic pressures, care for many patients may shift from the hospital setting to short stay or observational unit settings, or patients may be treated with i.v. therapies in outpatient heart failure clinics,31,32 and this is recognized as part of the Standardized Data Collection for Cardiovascular Trial Initiative definition.33 The decision to hospitalize can also be driven by external factors unrelated to the patient's clinical status (e.g. local availability of hospital facilities, day of the week or time of day the patient presents, or physician and patient local attitudes and traditions).
Local standards of care (such as length of stay or availability of out-of-hospital treatment resources) may differ in a clinical trial. The mean length of stay ranges from 6 to >10 days in Europe, ∼21 days in Japan, and 4–6 days in the USA.34–40 Length of stay can also be influenced by non-clinical factors, such as provider staffing.41 For studies where patients are enrolled during a hospitalization, the length of the index admission influences the time available to accrue subsequent hospitalization events. Patients (or regions) with a long index length of stay may have lower rates of re-hospitalization during follow-up, thus confounding the interpretation of the hospitalization endpoint. Showing a consistent effect on the rate of HF hospitalization across geographical regions strengthens the results of any trial. Stratifying enrolment by region is one approach that may balance the standard of care within regions across treatment arms.
Worsening heart failure symptoms and signs during the index hospitalization (>24 h after study drug initiation through to discharge or 7 days) may be used as a component of the primary composite endpoint.10,42 In RELAX-AHF, this component was one of the most favourably affected by serelaxin treatment, compared with placebo.10 This approach may capture important non-fatal events that occur prior to discharge from an index hospitalization. However, adjudication of such events can be difficult, and specific criteria must be pre-defined to ensure that the events are captured consistently across sites.
Defining heart failure hospitalizations
Heart failure hospitalizations have been inconsistently defined across clinical trials.43 This practice limits the ability to interpret data and gain an in-depth understanding of treatment effects across trials. Harmonizing definitions for heart failure hospitalization endpoints would be useful to achieve consistency similar to that achieved with myocardial infarction, and it minimizes the influence of external factors (cultural or societal practices).44 Progression towards adopting specific criteria that define a heart failure hospitalization or equivalent event is imperative. Efforts are under way to standardize endpoint events in cardiovascular trials, including heart failure hospitalizations.33 Meanwhile, this group has reached specific recommendations (Table 3). These criteria may help to differentiate a heart failure hospitalization from a non-heart failure hospitalization (e.g. AF where BNP is often elevated).
Category | Definition |
---|---|
Heart failure hospitalization | A hospitalization requiring at least an overnight stay in-hospital caused by substantive worsening of heart failure symptoms (although, admittedly, the decision to admit a patient for worsening heart failure may vary subjectively across centres and more so across various healthcare systems) and/or signs requiring the augmentation (an increase in the dose or frequency of administration) of oral medications or new administration of i.v. heart failure therapy, including inotropes, diuretics, or vasodilators, ideally pre-defined in the critical events manual. |
aThe Standardized Data Collection for Cardiovascular Trials initiative defines heart failure hospitalization as an admission to an inpatient unit or a visit to an emergency department that results in at least a 24 h stay (or a date change if the time of admission/discharge is not available). The required duration of stay should be flexible depending on the population and drug profile under study, AND at least one new or worsening clinical symptom of heart failure, AND at least two physical signs of heart failure, AND need for additional/increased therapy, AND no other non-cardiac or cardiac aetiology is identified.33 |
- a This definition was developed by the Standardized Data Collection for Cardiovascular Trials initiative, a working group of academicians, professional societies, Clinical Data Interchange Standards Consortium (CDISC), Health Level 7, Clinical Trials Transformation Initiative (CTTI), industry, and the Food and Drug Administration (FDA).
- b While experts agree there is value in standardizing definitions, some experts are concerrned that the definition suggested by the Standardized Data Collection for Cardiovascular Trials initiative is too restrictive and may result in low sensitivity to detect heart failure events (i.e. lower event rates), and potential loss of safety signals.
Recurrent morbid events
Recurrent hospitalizations are a common occurrence in patients with heart failure, and they impose a substantial clinical and economic burden on patients, caregivers, physicians, and health systems. Despite their importance, repeat events are ignored in the majority of clinical trials in favour of ‘time to first event’ analyses.45 The primary ‘time to first event’ analysis in the Eplerenone in Mild Patients Hospitalization and Survival Study in Heart Failure (EMPHASIS-HF) did not consider second or subsequent heart failure hospitalizations, but these accounted for 42% of the total admissions for heart failure in the placebo group.46 In CHARM-Preserved, only 53% of heart failure hospitalizations and 57% of cardiovascular deaths were included in the primary conventional analysis. Methodologies accounting for repeat events that are clinically meaningful to patients and physicians may both achieve practical gains (e.g. increased statistical power with smaller sample sizes due to higher number of events) and better characterize and quantify the patient's journey throughout the follow-up period.47 For instance, analyses using recurrent events increase statistical power substantially, which could potentially halve the sample size compared with conventional time to first event analyses.48 Several limitations of this approach warrant consideration, including the influence of variations in standard practice patterns across the world (see above), event clustering in a small proportion of patients, and confounders related to mortality differences in those patients who are hospitalized vs. patients who are not.46 Regulatory authorities seem to be willing to consider their use as a primary endpoint in future heart failure trials.
The ‘days alive and out of hospital’ endpoint incorporates the components of days in hospital (including days of the index hospitalization and repeat hospitalizations), days alive and not in hospital, and days dead into a single measure over a defined time frame (e.g. 30 or 60 days). This endpoint was developed to address the issue of repeat hospitalizations for all causes, but it is limited in its ability to weight the relative importance of deaths vs. repeat hospitalizations.49 With this approach, early deaths carry much greater weight than multiple recurrent hospitalizations followed by death very late in follow-up. The ‘patient journey’ concept creates a symptom-adjusted (and quality-adjusted if quality of life scores are collected) days alive and out of the hospital endpoint.50,51 This proposal creates a new space for research and debate, with the potential added value of information that might be useful to health economic interests.
A novel method has recently been proposed to overcome the limitations associated with time to event analyses: the ‘win ratio’.45–47,52 Patients in the new and control treatment arms are formed into matched pairs based on their risk profiles. Consider a primary composite endpoint, e.g. cardiovascular death and heart failure hospitalization, in heart failure trials. For each matched pair, the study treatment patient is labelled a ‘winner’ or a ‘loser’ if it is known who had a cardiovascular death first. If that is not known, they are labelled as a ‘winner’ or ‘loser’ if it is known who had a heart failure hospitalization first. Otherwise, they are considered tied. The win ratio is the total number of winners divided by the total number of losers. A 95% confidence interval and P-value for the win ratio are readily obtained. If formation of matched pairs is impractical, then an alternative win ratio can be obtained by comparing all possible unmatched pairs. This methodology places a greater emphasis on death, but it is still able to consider time in the analysis. Multiple hospitalizations can also be considered in this approach. Experience with this approach is still limited, and it will require further validation to gain regulatory acceptance. Investigators should consider including the win ratio method as a planned supplementary analysis in conjunction with existing standard analytic methods so that additional experience with the win ratio method can be accrued.47 A consistent analytical strategy needs to be agreed upon among clinical trial experts in the field.
Non-heart failure hospitalizations
Diagnoses other than heart failure (e.g. cardiac dysrhythmia or acute coronary syndrome) are the primary reason for most hospitalizations in patients with heart failure.53 Many cardiovascular (and non-cardiovascular) factors can exacerbate heart failure and lead to hospital admission.54 Cardiovascular conditions other than heart failure that necessitate hospitalization may share many features (e.g. similar symptoms or elevated BNP) with heart failure. It is often difficult to identify the importance of heart failure as the reason for admission or prolongation of hospitalization. A well-designed case report form should record the importance of heart failure (e.g. primary, contributory, or non-contributory) both to admission and to length of stay. When an endpoint of heart failure hospitalization is reduced by an intervention, it is important to document that this was not accompanied by an increase in other admissions. A reduction in all-cause hospitalization is even more reassuring. Hospitalizations contribute to both safety and economic analyses.
Non-hospitalization endpoints
Worsening heart failure without hospitalization
Patients are often treated for symptoms of worsening heart failure in non-hospital settings.26 Thus, focusing on heart failure hospitalizations alone may fail to characterize the full spectrum of progressive heart failure. The outpatient management of worsening heart failure (in either the emergency department, observation units, or other outpatient settings) is expected to increase as the expected growth in the heart failure population is realized and reimbursement models shift to encourage lower rates of hospitalization or penalize recurrent admissions. Thus, the development of methods to capture these ‘hospitalization equivalent’ events is warranted (Table 2).
Similar to the challenges described for hospitalization endpoints, treatment practices for worsening heart failure vary substantially among individual physicians as well as regions of the world.30 Since most endpoints that capture worsening heart failure without hospitalization require a treatment component to define the event, heterogeneity in treatment practices creates substantial challenges for the analysis and interpretation of such data. Although standard definitions for heart failure requiring hospitalization33 and also for in-hospital worsening of heart failure55 have been proposed, the issue of worsening heart failure without hospitalization (or hospitalization equivalent events) has not been addressed.
One concern with considering outpatient heart failure events is that they may be less meaningful outcomes that increase the event rate but dilute the treatment effect. However, data suggest that the presence of non-hospitalization events still indicates a high risk population.56 If these events are used in a composite endpoint, weighting of the endpoints is key to avoid situations where the least important element drives the overall result and conclusion.57–59
Although these events may be substantially important to the clinical course and prognosis of patients with heart failure, their use may result in an unacceptable lack of specificity linked to the uncertain diagnosis, and they may be highly susceptible to observation bias. Adjudicating non-hospitalization heart failure events is one approach to address these problems, but it has not been widely used for this purpose in clinical trials to date, and its utility remains uncertain. It may be difficult to determine the underlying cause of the event for patients with a history of heart failure who present with dyspnoea in the setting of a recent respiratory illness (e.g. influenza or upper respiratory infection) or other concomitant medical condition. Adjudication committees are limited by the documentation available to them. If medical records are incomplete or lack sufficient detail, then the ability to determine the cause of the event precisely is limited. The physician caring for the patient may be better suited to make the judgement of worsening heart failure in such circumstances. More experience is needed with non-hospitalized heart failure endpoints before a conclusion can be drawn regarding the need, or lack thereof, for event adjudication.
Implantable cardioverter device shocks
Life-threatening arrhythmia [ventricular tachycardia (VT) or ventricular fibrillation (VF)] is a potential endpoint that has not been widely used in clinical heart failure trials to date. Implantable cardioverter defibrillators (ICDs) are indicated in many patients with heart failure, and they act to terminate arrhythmias that would otherwise be fatal. Using VT/VF as an endpoint was challenging prior to the widespread use of ICDs, since most episodes were undocumented. Recent registry and clinical trial data from the USA suggest that ∼40–50% of HF-REF patients have ICDs.60,61 In Europe, the rate of ICD use is widely variable across countries, but it is increasing among most countries where data have been collected.62 Data from the EURObservational Research Program ESC-HF pilot showed that 32.7% of patients with clinical characteristics suggesting they had an indication for an ICD received the device.35
The availability of detailed information through device interrogation has renewed interest in considering VT/VF as an endpoint in heart failure trials for several reasons. Arrhythmic events compete with pump failure as the primary mode of death in patients with heart failure. ICDs prevent fatal arrhythmic death, which may impact the overall magnitude of effect for a new therapy on total mortality, particularly if the therapy is more likely to reduce arrhythmic events than pump failure events. Additionally, many drugs have been found to be proarrhythmic, particularly in the heart failure population. Such an effect might be missed if arrhythmic events are not considered. Finally, VT/VF is a marker of worsening heart failure and may be a useful indicator of heart failure progression.
Limitations exist to the use of VT/VF episodes as endpoints. Importantly, not all episodes of VT are life threatening. The frequency of detected episodes depends on device programming, which differs across patients. Devices can be programmed to terminate fast VT by antitachycardia pacing instead of shocks. This therapy is delivered faster than shocks, and it is possible that some episodes of VT treated by antitachycardia pacing would not have required delivery of a shock (i.e. they would have been non-sustained even if the delivery of antitachycardia pacing were delayed).63 Inappropriate shocks occur, and the available evidence suggests that device programming can greatly influence shock rates.64 In the Multicenter Automatic Defibrillator Implantation II Trial (MADIT II), the likelihood of patients experiencing ≥1 inappropriate shocks was 13% after 2 years, and inappropriate shocks accounted for 31.2% of all shock events.65 Inappropriate shocks were associated with increased all-cause mortality during follow-up.65 Other observational studies have shown similar results.66 Adjudication of shocks is mandatory for any clinical trial using such events as surrogates for fatal arrhythmias to minimize the noise associated with inappropriate shocks. While the process is strengthened by the availability of remote monitoring and stored ICD ECGs, the quality of ECG recordings is generally poor and reviewers would be limited by the lack of clinical documentation surrounding such events (e.g. symptoms or vital signs). Discordance among reviewers may also be problematic in some scenarios.67 Finally, the global variation in ICD use limits the use of shocks as an endpoint in international trials.68
Symptoms and patient-reported outcomes
Dyspnoea
Dyspnoea is the most common symptom reported among patients presenting with AHF,34,37,38,69–71 and it has been widely adopted as an endpoint in the majority of AHF trials.2,72 Regulatory authorities may approve a drug if it has a favourable impact on symptoms in the absence of any adverse effect on outcomes. However, dyspnoea as an endpoint is associated with several limitations. Dyspnoea responds well to standard therapy in most patients. Most patients report improvement in dyspnoea by 6 h after initiation of standard therapy.70,73 Early patient identification and rapid drug administration may be required to demonstrate advantages of a new therapy over existing therapies. Clinical trials that require the presence of dyspnoea for inclusion, but allow patients to be enrolled 24–48 h after admission, may capture patients with refractory dyspnoea who are unlikely to respond to therapy.
Patient presentation for worsening heart failure is often heterogeneous. Dyspnoea may be due to volume overload, elevated pulmonary pressures, low cardiac output, deconditioning, co-morbid conditions, or some combination of these factors. Its baseline severity is also variable among patients. In ASCEND-HF, specific clinical variables including older age, oedema on chest radiograph, higher systolic blood pressure, respiratory rate, and natriuretic peptide level, and lower blood urea nitrogen, sodium, and haemoglobin were predictors of early dyspnoea improvement. In addition, substantial geographic variation was noted in dyspnoea relief.74 This heterogeneity makes it more difficult to demonstrate significant dyspnoea reductions in the majority of patients.
The ability to detect changes in dyspnoea may depend on the scale used to measure it or the patient's position when it is assessed (i.e. supine vs. sitting), and an agreed method has not yet been adopted or used consistently in clinical trials.70 Instruments used to assess dyspnoea in clinical trials include 5- or 7-point Likert scales and visual analogue scales (VAS). A Likert scale can be biased by patient recall of symptoms, the level of baseline symptom severity, or patient perception of administered treatments (if a study is unblinded). Intermediate levels of change (e.g. mild vs. moderate on a 7-point scale) may be difficult to differentiate given the large spontaneous variability in this measurement. On the other hand, some investigators propose that the Likert scale is more easily understood by patients than the VAS. However, in RELAX-AHF, the VAS did detect treatment effect better than the Likert, in agreement with earlier observations reported by Grant et al.75,76 One analysis showed that the 5-point Likert scale and the VAS have a high degree of agreement in terms of assessing baseline dyspnoea (correlation coefficient 0.891) or change in dyspnoea (correlation coefficient 0.8), whereas less agreement was observed between these instruments and the 7-point Likert scale [correlation coefficient 0.512 (with the 5-point Likert) and 0.500 (with the VAS)].70 The VAS has been effective in detecting dyspnoea improvement in two recent trials, RELAXin Acute Heart Failure (RELAX-AHF)10 and Clevidipine in the Treatment of Blood Pressure in Patients with Acute Heart Failure (PRONTO).77 In these and other studies,76 the VAS improved in the treatment arm compared with standard of care, while other dyspnoea measurement tools remained unchanged. In PRONTO, improvement in the VAS was evident within a few hours after clevidipine administration compared with the standard of care arm, although it should be noted that it was an open-label trial and dyspnoea improvement was a secondary endpoint. Further studies are needed to confirm that the VAS is sensitive to detect small and/or early changes in dyspnoea and that these changes are clinically relevant. In the interim, using both scales in clinical trials will facilitate more validation research that might lead to evidence-based determination of the best scale for dyspnoea assessments. Importantly, symptoms must be measured on an absolute scale at baseline. Whether it is better then to measure change from baseline by repeated measurement on an absolute scale or asking the patient to report change is less clear. Analytical approaches also differ across trials (e.g. change from baseline to a single time point, change from baseline to multiple time points, or area under the curve). Consensus on the optimal analytic approach is desirable to achieve a common and consistent style of reporting across trials whenever possible.
The clinical community has not reached agreement on the meaning of changes in dyspnoea scores and what constitutes an important change. The meaningful change must first be determined, then the incremental level of dyspnoea improvement (both degree of and time to improvement) that a new therapy must demonstrate over standard therapy to be considered cost-effective needs to be established.
Health status and patient-reported health outcomes
Quality of life and patient-reported health outcomes, while traditionally viewed as ‘soft’ endpoints, provide insight into treatment effects from the patient's perspective.78 The therapeutic goal in heart failure patients is not limited to prolonging survival; improving the quality of life gained is equally important. The relative weight given to these outcomes may vary substantially from patient to patient. Thus, patient-reported data from clinical trials are very useful to physicians striving to inform their patients of the relative benefits and risks of specific treatments. Patient-reported outcomes can be used to support claims;79 however, for new heart failure drugs, they are generally not acceptable as the sole basis for approval because of the uncertain benefit to risk ratio (i.e. what level of risk needs to be excluded to achieve net benefit on a symptom-based claim). Patient-reported outcomes, health-related quality of life endpoints, and cost-effectiveness endpoints are also relevant endpoints to include in trials because they are important to payers and society, and they can impact the uptake of a therapy after regulatory approval, even if regulatory approval was not based on these endpoints.
As with dyspnoea assessment, patient-reported health outcome assessments are limited by adherence to instrument completion and patient recall, if the data are collected retrospectively. Real-time measurements may overcome these limitations, and employing modern technology, such as smart phones, may be one way to achieve real-time data collection. Many instruments are available to collect patient-reported outcome and health-related quality of life data. The instrument should demonstrate content validity, i.e. it should measure the specific concept or construct of interest that it is intended to assess.80 The MLWHFQ and the Kansas City Cardiomyopathy Questionnaire (KCCQ) have shown validity and responsiveness to change, but this may reflect the fact that these questionnaires measure the severity of symptoms rather than actual quality of life. The Chronic Heart Failure Questionnaire (CHFQ) has also been shown to be responsive to change and to differentiate between interventions, but it is not self-administered which may limit its usefulness in some trials.81 These instruments are heart failure specific, and they are preferred over generic instruments, at least when used as endpoints in heart failure clinical trials. However, generic instruments such as EQ5D or SF36 are needed for economic analyses and to provide a measure of general quality of life, which also provides insight into the impact of adverse effects or the burden of taking treatment. Individualization of quality of life assessment is important. Dynamic tools to measure health status from a patient's perspective are available for a variety of conditions. This approach is actively being investigated in the heart failure community (http://www.nihpromis.org).82 Regulatory agencies should be consulted before a patient-reported outcome instrument is chosen for a trial seeking to achieve a label claim on a patient-reported outcome endpoint, to ensure the instrument has been sufficiently validated in the context of its planned use.83,84
Although patient assessment of symptoms has been associated with subsequent mortality and hospitalization, patient-reported health outcomes are not reliable surrogates for mortality.85,86 Rather, they are important outcomes independent of mortality. Regulatory agencies recognize scientifically robust assessments of patient-reported health outcomes as sufficient to meet labelling requirements.83 However, the proper validation to develop an instrument is time consuming and detailed. The instruments must specifically measure the expected effect, and improvement must be evident in all domains. Concordance with other clinical measurements (e.g. 6 min walk test or symptoms) strengthens the confidence that the instrument is accurately measuring the outcomes.
Evaluation of patient-reported outcomes can reasonably be accomplished with substantially fewer patients than those needed for outcome studies. Indeed, an additional advantage of these evaluations is that they may capture the net effect of interventions, since both efficacy and safety aspects affect quality of life. However, these endpoints do introduce analytical challenges. The clinically important magnitude of change in the commonly used instruments is not known and often arbitrarily chosen. More research is needed to determine the threshold that constitutes a clinically meaningful change. Additionally, new methodologies to minimize bias are needed for trials that cannot be blinded. Missing data are problematic. Missing data may reflect a greater burden of disease (e.g. patients may be too sick to complete or they may have died). Thus, the result may be biased if the data only reflect patients who were well enough to complete the instrument. Several methods exist to deal with missing data (e.g. imputation), but all have limitations. Finally, assessing patient-reported outcomes as an endpoint is only possible in surviving patients, which also introduces bias. New methods should be developed to account for these important confounders when using patient-reported outcomes as an endpoint.
Clinical composite endpoints
Several types of clinical composite endpoints have been proposed that incorporate aspects of mortality, morbidity, and patient-reported outcomes. These endpoints address the limitations associated with traditional morbidity and mortality endpoints, and they may reduce the resources required to conduct clinical trials.87 However, these endpoints have their own challenges, as described in the following paragraphs, which limit their usefulness as pivotal endpoints for phase III trials.
Composite scores consider physician assessment of symptoms (NYHA class), patient global symptom assessment, and morbidity and mortality. Such scores classify patients as improved, unchanged, or worsened.88 Variations of this score approach have been adapted in a number of studies.89–92 One challenge with composite scores is the proper weighting of different outcomes. In A-HeFT, death, hospitalization for heart failure, and change in quality of life were weighted differently in calculating an outcome score for each patient. The appropriateness of the arbitrary values assigned to each of these endpoints could be debated, but the methodology can reduce the sample size needed to document efficacy as compared with using a mortality plus morbidity endpoint. More research is needed on methods to weight the components of composite endpoints appropriately and to quantify clinically relevant changes.
Composite scores should ideally consist of objective clinical events, and the components should demonstrate directional concordance. The primary endpoint of the Calcium Upregulation by Percutaneous Administration of Gene Therapy in Cardiac Disease (CUPID) trial was a composite of seven efficacy variables in four domains.90,93 Concordant improvement in the seven efficacy variables was required without clinically significant worsening in any variable.
Incorporating mechanistic endpoints (changes in biomarkers or changes in LV volume) into composites that also reflect clinical outcomes is a concept that merits consideration, at least in phase II trials. However, the optimal assessment variables for mechanistic endpoints need to be established (e.g, size/volumes vs. sphericity or level of change in a biomarker that is clinically important). Imaging is expensive, impractical for large trials, and may be strongly operator dependent. It is also important to note that complex composites combining objective measures of mortality/morbidity with subjective measures of symptoms, quality of life, biochemical, functional measures, or changes in concomitant therapy are often difficult to interpret and are generally discouraged as primary endpoints by some regulatory bodies.
Data interpretation from clinical composite endpoints can be challenging, since they can have many components that may diverge in direction and/or magnitude, and the details are often difficult to ascertain. Because of concerns that combining efficacy and safety in a single ‘net clinical benefit’ composite may obscure safety signals, the two components should usually be examined separately.94,95 It is unlikely that a single drug or device will positively influence all components of a clinical composite, particularly as more components are added. A single endpoint has advantages over clinical composite endpoints, because the intent of therapy is clear and the results cannot be confounded by divergent effects. As multiple different composites are used, evaluating data across trials will be difficult since the endpoints are likely to differ in perhaps subtle, but important, ways.
Safety endpoints
Mortality
Several examples can be offered where drugs with positive phase II data and a solid scientific rationale increased mortality when they were studied in large, adequately powered, randomized trials. For this reason, if all-cause mortality is not assessed as an efficacy endpoint, then the study should still be reasonably powered to rule out excess mortality. It is less appropriate for early phase trials that are underpowered to make any reasonable evaluation of mortality risk.
Pre-specifying non-inferiority boundaries for mortality (i.e. the new therapy does not increase mortality compared with the control by a pre-specified margin) may be considered. One problem is the arbitrary selection of the non-inferiority margin; the margin may have to be inappropriately liberal to achieve feasible sample size estimates.96 Specifically for AHF trials, the optimal time period to record mortality remains an unresolved issue. It is reported at 30 days in some studies and up to 180 days in others. An examination of reported event rates seems to indicate that mortality becomes linear after ∼60–90 days;42,97 thus, 60–90 days may be the optimal time point to assess death post-discharge for AHF trials.
Renal function
Both baseline renal function and change in renal function are independent predictors of outcome in patients with heart failure.98–103 Substantial attention has been paid to cardiorenal interactions and the importance of renal function in patients with heart failure. However, many uncertainties exist with regard to renal function as an endpoint in heart failure trials. It is unknown if a transient change in renal function is more or less important than a persistent change. Whether renal function performs best as a clinical or safety endpoint is debated. The ideal predictor for assessment of renal safety has also not been determined. Is it discharge serum creatinine (SCr)? Change in SCr by some threshold level? Decrease in estimated glomerular filtration rate (eGFR)? Should the endpoint be a composite of glomerular and tubular markers of renal damage?
Although renal function does predict outcome, it is a poor surrogate for clinical outcomes in patients with heart failure. It is documented that several life-saving therapies [ACE inhibitors, ARBs, and mineralocorticoid receptor antagonists (MRAs)] may increase SCr, but these therapies still improve outcomes despite the change in renal function.104 An analysis from Eplerenone Post-Acute Myocardial Infarction Heart Failure Efficacy and Survival Study (EPHESUS) showed that eplerenone was associated with a lower risk of cardiovascular death or hospitalization compared with placebo despite a greater proportion of eplerenone-treated patients experiencing a decline in eGFR.105 Similar results from a post-hoc analysis of the Randomized Aldactone Evaluation Study (RALES) have been recently published.106 This example illustrates the caution that must be exercised when using change in renal function as either a safety or an efficacy endpoint, since worsening renal function does not always correlate with worse clinical outcomes, depending on the therapeutic intervention.
Biomarkers
Biomarkers are not acceptable surrogates of clinical outcome, but some may be useful indicators of safety.107 Increased troponin, serum creatinine, cystatin-C, or hepatic transaminases108 were associated with a higher risk of 6-month mortality, and larger decreases in NT-proBNP were associated with a lower risk of 6-month mortality in the phase III RELAX-AHF study.109 Patients randomized to serelaxin had significantly lower levels of serum creatinine, blood urea nitrogen, and uric acid within the first 5 days after randomization, and a lower level of hepatic transaminases within the first 3 days after randomization compared with patients randomized to placebo. These data are hypothesis generating but suggest that favourable effects on laboratory variables might be an indicator of better long-term outcome.109 Further research and validation of this approach, along with input from regulatory authorities, will be required to determine whether a single or composite biomarker endpoint could be used as a safety endpoint for phase III trials.
Conclusion
The selection of primary efficacy and safety endpoints in heart failure trials continues to challenge clinical trialists, regulators, and sponsors. As event rates decline in response to greater adoption of evidence-based, guideline-recommended therapies, the resources needed to conduct trials with traditional endpoints of mortality and morbidity may be prohibitive. In many circumstances, it will be necessary to tailor the endpoint to meet the needs of the population under study, since therapeutic goals and underlying event rates differ across the spectrum of patients with AHF, chronic heart failure, or HF-PEF. The endpoint chosen for a single pivotal trial is unlikely to address the needs of all relevant parties. This Working Group of the HFA-ESC identified several important areas of consensus related to endpoints in heart failure trials, as well as areas of uncertainty where more research and analysis is needed to progress the field. The points summarized in Table 1 should serve as a resource for investigators and sponsors who are actively planning clinical trials. Several of these points represent a ‘call to action’ for stakeholders to cooperate and jointly develop strategies or research initiatives to address the unmet needs of clinical trials in this therapeutic space (Table 2). Future collaborative and timely efforts are required in order to influence favourably the direction of heart failure research going forward, and, ultimately, patient outcomes.
Funding
The Heart Failure Association of the European Society of Cardiology (HFA-ESC)
Conflicts of interest: P.W.A. received research grants from Boehringer Ingelheim, Hoffman LaRoche, SanofiAventis Canada, Inc., Merck Sharp & Dohme, GlaxoSmithKline, Amylin Pharmaceuticals, and Merck & Company, Inc., and consulting fees from AstraZeneca, Eli Lilly, Merck & Company, Inc., F. Hoffman-La Roche Ltd, Axio/Orexigen, and GlaxoSmithKline. A.F.H. received research funding from Johnson & Johnson, Amylin, and Portola, and a honorarium from Corthera. J.K. is an employee of Merck Research Laboratories. S.K. is an employee of Takeda Pharmaceuticals. A.P.M. received consultancy fees for participation in Steering Committees or DSMB of studies sponsored by Novartis, Bayer, Abbott Vascular, Amgen, and Cardiorentis. M.M. received honoraria for consultancy and speaking from Abbott Vascular, Bayer, Corthera, Novartis, and Servier. C.N. is an employee of Bayer Pharma AG. T.S. is an employee and shareholder of Novartis Pharma AG. K.S. is an employee of Boston Scientific Corporation. W.G.S. received support from the Heart Failure Association of the European Society of Cardiology. L.T. is a consultant or committee member for LoneStar Heart, Servier, Saint Jude Medical, and Vifor Pharma. A.A.V. received consultancy fees and or research grants from Alere, Bayer, Cardio3Biosciences, Celladon, Ceva, Novartis, Servier, Torrent, and Vifor, grant support from the European Commission (FP7-242209-BIOSTAT-CHF), and is a Clinical Established Investigator of the Dutch Heart Foundation (2006T37). S.M.W. is an employee of Amgen, Inc., and has received Amgen stock/stock options. H.W. is an employee of ResMed, Martinsried, Germany. A.Z. is an employee of Novartis Pharmaceuticals Corporation. All other authors have no conflicts of interest to declare.
Acknowledgements
We acknowledge Roger Mills, MD, who participated in and contributed to discussions that took place during the Third Heart Failure Clinical Trialists Workshop of the Heart Failure Association of the European Society of Cardiology, Nice, France, 12–13 February 2012.