Psychometric Equivalence of the OAB-q in Danish, German, Polish, Swedish, and Turkish
ABSTRACT
Objective: Patient-reported outcomes (PROs) are measures of patients' health status provided directly by patients and when utilized in clinical trials, multiple language versions may be needed. The overactive bladder questionnaire (OAB-q), a self-administered PRO assessing symptom bother and health-related quality of life (HRQL) in patients with OAB, was developed in US English and has been translated into more than 40 languages. This analysis evaluated the psychometric equivalence of five language versions of the OAB-q.
Methods: The Disease Management Study (DMS) was a multicenter, double-blind, placebo-controlled, parallel group, randomized study in adults with OAB. Participants completed the OAB-q, 3-day micturition diaries, and the patient's perception of bladder condition (PPBC) at baseline and weeks 1 and 12 of treatment.
Results: Data from 398 patients from five countries were analyzed: Denmark (N = 71), Germany (N = 127), Poland (N = 60), Sweden (N = 94), and Turkey (N = 46). Participants were a mean of 57.4 years old; 31% were male; and almost all were Caucasian. Cronbach alphas for the OAB-q symptom bother subscale = 0.71 to 0.83 and 0.82 to 0.94 for the HRQL subscales (concern, coping, sleep, and social) across all five languages. OAB-q subscales were significantly correlated with PPBC in all languages. Mean baseline to week 12 change scores = −21.4 to −30.3 for symptom bother and 5.2 to 36.0 for the HRQL subscales. Effect sizes for the symptom bother subscale = 0.92 to 2.79 and 0.21 to 1.30 for the HRQL subscales.
Conclusion: OAB-q language versions of Danish, German, Polish, Swedish, and Turkish demonstrated acceptable psychometric characteristics, including internal consistency reliability, construct validity, and responsiveness.
Introduction
Patient-reported outcomes (PROs) are measures of a patient's health status provided directly from the patient [1]. PROs can be used to assess such outcomes as health-related quality of life (HRQL), symptoms, patient satisfaction, and social, emotional and physical functioning. As with any clinical assessment, the assessment of PROs relies on the use of questionnaires with demonstrated validity and reliability [1–4].
PROs utilized in multinational clinical trials may require multiple language versions of a questionnaire. Standard methodology for linguistic validation of questionnaires includes forward translation into the target language by two native language speakers, with comparison and reconciliation of the translations, and then at least one backward translation to the initial language, with comparison of the translations to the primary questionnaire. The translated versions are reviewed by lay persons and experts, and revisions are made if necessary. The final version is then tested with patients in the target language [5–8]. With the linguistic validation, conceptual equivalence is often also assessed in the various language versions of a questionnaire [9,10]. Thus, after completing this process, the questionnaire is considered to be linguistically and conceptually equivalent.
The next step in successful questionnaire translation is performing a psychometric evaluation of each new language version and determining the psychometric equivalence of the different language versions [1]. Psychometric evaluation includes examination of the instrument's validity and reliability, and occurs throughout the use of the questionnaire in the target patient population. The evaluation of psychometric equivalence involves the assessment of the psychometric properties of each language version of a questionnaire when utilized within the same target patient population [11–14]. Psychometric equivalence is the extent to which the psychometric properties of the different language versions of the questionnaire are similar, though not necessarily the same [9]. The determination of psychometric equivalence is paramount to determining if data from the different language versions can be combined for analyses in multinational clinical trials [7].
The overactive bladder questionnaire (OAB-q) is a 33-item, self-administered, disease-specific questionnaire to assess symptom bother and HRQL in patients with OAB [15]. The OAB-q was originally developed in English for the United States and has been translated and culturally validated into numerous languages using accepted translation methodology [5]. Although this translation methodology is designed to produce a culturally valid questionnaire, it does not ensure the psychometric validity of the translated versions. This post hoc analysis was performed to evaluate the psychometric equivalence of five language versions of the OAB-q.
Methods
Study Design
The Disease Management Study (DMS) was a multicenter, double-blind, placebo-controlled, parallel group, randomized study of adults with OAB. Patients were eligible for DMS if they were ≥18 years of age and reported OAB symptoms for ≥3 months before the screening visit, urinary frequency of ≥8 micturition episodes per 24 hours, and at least three episodes in 3 days of urgency or urgency incontinence. Eligible patients must also have rated the most bothersome OAB symptom as at least moderately bothersome on the OAB bother rating scale (BRS). Participants were randomized (2:1) to receive either tolterodine or placebo for 12 weeks. Four study visits occurred: screening, baseline, week 1, and week 12. Participants were instructed to take one capsule of study drug in the evening before going to bed for 12 weeks. No dose adjustment was allowed. The following questionnaires were completed:
- •
OAB-q: The self-administered OAB-q measures symptom bother (8 items) and HRQL (25 items) with the HRQL scale consisting of four subscales: coping, concern, sleep, and social interaction [15]. Scores range from 0 to 100; higher symptom bother scores indicate greater reports of symptom severity while higher HRQL scores indicate better HRQL. The OAB-q was developed in English based on patient input (i.e., focus groups and cognitive debriefing interviews), and has been validated in more than 5000 patients with continent and incontinent OAB, in both clinical and community samples. The OAB-q has consistently demonstrated good internal consistency and test-retest reliability, concurrent and discriminant validity, and responsiveness. The OAB-q was completed at baseline, week 1, and week 12 in 11 languages: Canadian English, Canadian French, Danish, German, Italian, Dutch, Norwegian, Polish, Spanish, Swedish, and Turkish.
- •
Patient's perception of bladder condition (PPBC): The PPBC is a validated, one-item, self-administered questionnaire that asks the patient to describe the perception of his/her bladder condition [16]. Patients are asked to choose the statement which best describes his/her current bladder condition, with responses ranging from 1 (“does not cause me any problems at all”) to 6 (“causes me many severe problems”). The PPBC was completed in 11 languages at the baseline, week 1, and week 12 visits.
- •
Micturition diary: The number of daytime and nighttime urinations, urgency episodes, and urge incontinent episodes was collected in the micturition diary. Micturition diaries were completed in 11 languages for 3 days before the baseline, week 1, and week 12 visits.
Statistical Analyses and Analysis Sets
All patients in the safety population (i.e., all enrolled participants, regardless of treatment assignment) were used in the baseline-only analyses. Analyses were only conducted on language versions of the OAB-q for which data from at least 40 patients were available. Responsiveness and effect size analyses were conducted on the completers analysis set among patients who had data at both baseline and week 12. No missing subscale PRO data were imputed for the responsiveness and analyses. As the purpose of this analysis was to examine the psychometric properties of the OAB-q within each language, treatment assignment was not included or considered in our analyses.
SAS statistical software Version 9.1.3 (SAS Institute, Cary, NC) was used for all analyses. All statistical tests were two-tailed and were conducted with type I error probability fixed at 0.05. Summaries of categorical variables included the frequency and the percentage of patients within each category. Comparisons of categorical data were performed using chi-square tests (and Fisher's exact test for cases when the frequency per cell was less than 5). Continuous variables were summarized with the following measures of location and dispersion: mean, median, SD, minimum, and maximum.
To assess internal consistency reliability, Cronbach alphas were calculated for each OAB-q subscale score by country at baseline, week 1, and week 12. Coefficients were compared qualitatively across language versions. The mean baseline scores for each OAB-q subscale were examined by language version using general linear models (PROC GLM) to control for selected demographic and clinical characteristics (e.g., age, sex, years since OAB diagnosis, and bladder diary variables); adjustments were made using Scheffe's method of multiple comparisons. The relationship of the OAB-q subscales with other measures (e.g., BRS scores, micturition diary variables) was examined at each baseline, week 1 and week 12 by language version using Spearman's correlations.
Responsiveness is an important component of construct validity of a PRO measure and is critical for PRO end points used in clinical trials [2–4,17]. The responsiveness of the OAB-q was examined by language version with change in the OAB-q subscales calculated by: week 12 visit score—baseline visit score. Pairwise comparisons between differences in least squares means from a GLM model with age, sex, years since diagnosis, and baseline score as covariates by language were performed using Scheffe's test adjusting for multiple comparisons. Effect sizes were also calculated by language version. Effect size was calculated by taking the difference in mean score from baseline to week 12, divided by the SD of baseline scores for all subjects (mean score at baseline—mean score at week 12 ÷ SD of baseline scores). Effect size is characterized as small (0.20), moderate (0.50), or large (0.80) following the guidelines proposed by Cohen [18]. Responsiveness and effect size calculations utilized the completers analysis set among patients who had data at both baseline and week 12, regardless of treatment assignment.
Results
Six hundred seventeen participants were enrolled from 66 centers in 11 countries; however, only five countries enrolled 40 or more patients. As such, data from 398 patients were analyzed from the following countries: Denmark (N = 71), Germany (N = 127), Poland (N = 60), Sweden (N = 94), and Turkey (N = 46). Demographic characteristics for each country are summarized in Table 1. Participants were a mean of 57.4 years of age and 31% were male. Of the 398 patients, 397 were Caucasian. Participants had received a diagnosis of OAB a mean of 5.0 years before enrollment in this study.
Characteristic | Language analysis population (N = 398) | Denmark (N = 71) | Germany (N = 127) | Poland (N = 60) | Sweden (N = 94) | Turkey (N = 46) |
---|---|---|---|---|---|---|
Age (mean, SD) | 57.4 (13.4) | 58.5 (14.0) | 57.9 (12.9) | 59.5 (12.4) | 58.3 (12.4) | 49.7 (14.7) |
Sex (% male) | 30.7 | 47.9 | 18.9 | 16.7 | 44.7 | 26.1 |
Race (N,%) | ||||||
Caucasian | 397 (99.8) | 70 (98.6) | 127 (100) | 60 (100) | 94 (100) | 46 (100) |
Other | 1 (0.3) | 1 (1.4) | 0 (0) | 0 (0) | 0 (0) | 0 (0) |
Country of Enrollment (N,%) | ||||||
Denmark | 71 (17.8) | |||||
Germany | 127 (31.9) | |||||
Poland | 60 (15.1) | |||||
Sweden | 94 (23.6) | |||||
Turkey | 46 (11.6) | |||||
Hormonal status (females only) (N,%) | ||||||
Premenopausal | 89 (32.3) | 11 (29.4) | 33 (32.0) | 12 (24.0) | 19 (36.5) | 14 (41.2) |
Postmenopausal | 187 (67.8) | 26 (70.3) | 70 (68.0) | 38 (76.0) | 33 (63.5) | 20 (58.8) |
Smoking status (N,%) | ||||||
Current smoker | 81 (20.4) | 27 (38.0) | 15 (11.8) | 11 (18.3) | 15 (16.0) | 13 (28.3) |
Ex-smoker | 104 (26.1) | 13 (18.3) | 37 (29.1) | 14 (23.3) | 34 (36.2) | 6 (13.0) |
Nonsmoker | 213 (53.5) | 31 (43.7) | 75 (59.1) | 35 (58.3) | 45 (47.9) | 27 (58.7) |
Years since OAB diagnosis (mean, SD) | 5.03 (6.52) | 8.04 (7.47) | 4.34 (6.36) | 5.19 (5.02) | 6.03 (6.91) | 0.06 (0.10) |
- OAB, overactive bladder.
Baseline Score Comparability
The mean OAB-q baseline scores were compared by country (Table 2). To control for OAB severity and other variations, the following covariates were used in the GLM model: age, sex, years since diagnosis, frequency of daytime urinations, nocturia episodes, urgency episodes, and incontinence episodes. Baseline symptom bother scores ranged from 63.5 (Swedish) to 73.0 (Polish), indicating moderate to high symptom bother. Baseline HRQL subscale scores ranged from 31.5 (concern: Turkish) to 81.6 (social interaction: Danish), indicating low to high HRQL impact. Social activities were impacted the least overall (51.6 [Turkish] to 81.6 [Danish]), while coping demonstrated the greatest impact (37.9 [Turkish] to 54.8 [Danish]). There were significant differences among some of the language versions for the symptom bother subscale and also for each of the HRQL subscales with the Danes consistently reporting highest HRQL and lowest symptom bother.
OAB-q subscale | LS mean (SE) | OverallF-value | P-value | ||||
---|---|---|---|---|---|---|---|
Danish (N = 71) | German (N = 127) | Polish (N = 60) | Swedish (N = 94) | Turkish (N = 46) | |||
Symptom bother† | 63.7 (1.8) | 64.9 (1.4) | 73.0 (1.9) | 63.5 (1.5) | 71.9 (2.4) | 14.9*** | 2*,5*,8** |
Coping‡ | 54.8 (2.8) | 45.8 (2.1) | 39.7 (3.0) | 50.7 (2.3) | 37.9 (3.7) | 11.3*** | 2**,4* |
Concern‡ | 58.4 (2.5) | 49.2 (2.0) | 42.1 (2.7) | 55.7 (2.1) | 31.5 (3.4) | 15.0*** | 2***,4***,7***,8**,10*** |
Sleep‡ | 58.5 (2.7) | 54.5 (2.1) | 45.5 (3.0) | 54.2 (2.3) | 45.9 (3.7) | 17.5*** | 2* |
Social‡ | 81.6 (2.6) | 73.8 (2.0) | 59.9 (2.8) | 73.2 (2.2) | 51.6 (3.5) | 19.0*** | 2***,4***,5**, 7***,8**,10*** |
HRQL total‡ | 62.4 (2.2) | 54.1 (1.7) | 45.7 (2.4) | 57.3 (1.8) | 40.4 (3.0) | 20.2*** | 2***,4***,7**,8**,10*** |
- P-values: Pairwise comparisons between differences in least squares (LS) means from GLM model with age, sex, years since diagnosis, frequency of daytime urinations, nocturia episodes, urgency episodes, and incontinence episodes as covariates by language were performed using Scheffe's test adjusting for multiple comparisons. Comparisons not significant unless noted: 1 = Danish vs. German; 2 = Danish vs. Polish; 3 = Danish vs. Swedish; 4 = Danish vs. Turkish; 5 = German vs. Polish; 6 = German vs. Swedish; 7 = German vs. Turkish; 8 = Polish vs. Swedish; 9 = Polish vs. Turkish; and 10 = Swedish vs. Turkish. P-values are: *<0.05., **<0.01, ***<0.001.
- † Scores range from 0 to 100; higher scores indicate greater symptom bother.
- ‡ Scores range from 0 to 100; higher scores indicate better health-related quality of life (HRQL).
Internal Consistency Reliability
Cronbach's alphas were calculated for each OAB-q subscale score for each visit for each of the five countries (Table 3 presents baseline data). The OAB-q symptom bother subscale and HRQL subscales (concern, coping, sleep, and social interaction) demonstrated good internal consistency reliability across all five languages. Symptom bother Cronbach alphas ranged from 0.71 (Polish) to 0.83 (Turkish); Cronbach alphas for the HRQL subscales ranged from 0.82 (social interaction: Danish) to 0.94 (coping: Swedish). Cronbach's alphas for each of the subscales in each of the languages exceeded the generally accepted value of 0.70 for aggregate data. Cronbach's alphas for each of the subscales in each of the languages at weeks 1 and 12 were consistent with baseline and also exceeded 0.70 (data on file).
OAB-q subscale | Cronbach's alpha | ||||
---|---|---|---|---|---|
Danish | German | Polish | Swedish | Turkish | |
Symptom bother | 0.81 | 0.75 | 0.71 | 0.81 | 0.83 |
Coping | 0.92 | 0.91 | 0.91 | 0.94 | 0.87 |
Concern | 0.90 | 0.87 | 0.91 | 0.90 | 0.85 |
Sleep | 0.89 | 0.91 | 0.92 | 0.93 | 0.93 |
Social | 0.82 | 0.91 | 0.88 | 0.87 | 0.88 |
HRQL total | 0.94 | 0.95 | 0.96 | 0.96 | 0.96 |
- HRQL, health-related quality of life; OAB-q, overactive bladder questionnaire.
Correlations
Symptom bother scores at baseline were significantly correlated with the PPBC in all languages (r = 0.49–0.56; P < 0.001) and also with daytime urgency episodes in Danish, German, and Swedish (r = 0.23–0.36; P < 0.05). The HRQL sleep subscale was significantly correlated with nocturnal micturitions (r = 0.28–0.71; P < 0.001) in all languages except Turkish and nocturnal urgency episodes (r = 0.40–0.63; P < 0.005) in all languages except Turkish and German. All HRQL subscales were significantly correlated with PPBC in all languages except Danish (r = 0.29–0.69; P < 0.01). The correlations at week 12 increased for all subscales and were similar but of greater magnitude than the baseline correlations. These correlations were expected, given that both instruments assess aspects of the patient's perception of the impact of OAB.
Correlations among the HRQL subscales and the micturition diary variables were small, with the exception of the coping subscale in Polish with nighttime urinations and the sleep subscale in Danish, Polish, and Swedish with both nighttime urinations and nighttime urge episodes. Given that the OAB-q and micturition diaries measure two different aspects of OAB, these correlations were in the expected range. These results demonstrate good construct validity of the OAB-q in these languages. The pattern and magnitude of these correlations are consistent with previous research [15].
Responsiveness and Effect Size
The OAB-q was highly responsive across all languages (Table 4). Mean change from baseline to week 12 for symptom bother ranged from −21.4 (Danish) to −30.3 (Polish). HRQL subscale mean changes ranged from 5.2 (social interaction: Danish) to 36.0 (concern: Turkish). Significant differences (P < 0.0001) were present in change scores among the language versions with Poland and Turkey consistently reporting the greatest change.
OAB-q subscale | LS mean (SE) change | P-value | |||||
---|---|---|---|---|---|---|---|
Danish (N = 52) | German (N = 88) | Polish (N = 49) | Swedish (N = 59) | Turkish (N = 33) | OverallF-value | ||
Symptom bother† | −21.4 (2.0) | −24.2 (1.6) | −30.3 (2.1) | −20.1 (1.8) | −29.8 (2.8) | 42.5*** | 2*,8** |
Coping‡ | 13.3 (2.9) | 22.1 (2.3) | 24.7 (3.0) | 15.5 (2.6) | 29.3 (4.0) | 15.6*** | 4* |
Concern‡ | 15.0 (2.8) | 21.7 (2.2) | 26.4 (2.9) | 14.5 (2.6) | 36.0 (3.9) | 21.1*** | 4**,7*,10*** |
Sleep‡ | 9.1 (2.9) | 15.0 (2.3) | 21.3 (3.0) | 12.7 (2.7) | 25.5 (4.1) | 15.7*** | 4* |
Social‡ | 5.2 (2.8) | 8.6 (2.2) | 20.4 (2.9) | 9.4 (2.6) | 28.8 (3.9) | 9.6*** | 2**,4***,5*,7***,10** |
HRQL total‡ | 11.0 (2.4) | 17.7 (1.9) | 24.2 (2.5) | 13.7 (2.2) | 30.3 (3.4) | 20.4*** | 2**,4***,7*,8*,10** |
- P-values: Pairwise comparisons between differences in least squares (LS) means from ANCOVA model with age, sex, years since diagnosis, baseline score, and frequency of daytime urinations, nocturia episodes, urgency episodes, and incontinence episodes as covariates by language were performed using Scheffe's test adjusting for multiple comparisons. Comparisons not significant unless noted: 1 = Danish vs. German; 2 = Danish vs. Polish; 3 = Danish vs. Swedish; 4 = Danish vs. Turkish; 5 = German vs. Polish; 6 = German vs. Swedish; 7 = German vs. Turkish; 8 = Polish vs. Swedish; 9 = Polish vs. Turkish; and 10 = Swedish vs. Turkish. P-values are: *<0.05, **<0.01, ***<0.001.
- † Scores range from 0 to 100; higher scores indicate greater symptom bother.
- ‡ Scores range from 0 to 100; higher scores indicate better health-related quality of life (HRQL).
Effect sizes were also calculated by language version (Table 5). Effect sizes were large for the symptom bother subscale across language versions, ranging from 1.02 (Danish) to 2.79 (Polish). HRQL subscale effect sizes ranged from 0.25 (sleep: Danish) to 1.66 (concern: Turkish).
OAB-q subscale | Effect size | ||||
---|---|---|---|---|---|
Danish (N = 52) | German (N = 88) | Polish (N = 49) | Swedish (N = 59) | Turkish (N = 33) | |
Symptom bother* | −1.02 | −1.98 | −2.79 | −1.07 | −2.11 |
Coping† | 0.58 | 1.14 | 1.36 | 0.59 | 1.30 |
Concern† | 0.70 | 1.29 | 1.33 | 0.59 | 1.66 |
Sleep† | 0.25 | 0.76 | 1.03 | 0.36 | 1.22 |
Social† | 0.30 | 0.55 | 0.86 | 0.35 | 0.95 |
HRQL total† | 0.64 | 1.17 | 1.31 | 0.57 | 1.45 |
- * Scores range from 0 to 100; higher scores indicate greater symptom bother.
- † Scores range from 0 to 100; higher scores indicate better health-related quality of life.
- HRQL, health-related quality of life; OAB-q, overactive bladder questionnaire.
Discussion
This evaluation of the psychometric equivalence of five language versions of the OAB-q found them all to demonstrate adequate psychometric properties. Baseline scores across all language versions were comparable to scores from other studies of patients with OAB [19], and overall worse than normal controls [20]. Interestingly, Danes reported the lowest symptom bother and highest HRQL on all subscales, while Turks and Poles reported high symptom bother and the lowest HRQL.
OAB-q score changes from baseline to 12 weeks were also consistent with other studies of patients who undergo treatment for OAB [21]. Effect sizes in this study were generally consistent with those found in previous studies [19,21].
Although there were statistically significant differences among the baseline scores of various countries, the scores were still within acceptable ranges. The differences in scores could be due to cultural variations in reporting of symptoms, and attitudes toward OAB and urinary issues. Questionnaires are usually initially developed and validated in a single language. When multiple language versions are needed, that primary questionnaire is translated word-for-word or phrase-for-phrase into the target languages. Although the translations may be technically correct, the topic may not be appropriate or applicable in the target culture. In addition, cultural customs and norms are entrenched in patients' responses to questionnaires. Some cultures value stoicism while others are more open. These cultural beliefs can affect patients' responses to their medical conditions and also their questionnaire responses. Although a questionnaire may have undergone full translation and be psychometrically valid, mean scores for one country could be vastly different than mean scores for another country because of these cultural variations. A questionnaire on a topic as sensitive as OAB and urinary incontinence could potentially be greatly affected by these cultural variations. This effect is demonstrated in the significant differences in 6 of the 10 comparisons between the language versions on the social subscale. Coping—which one would not expect to be as affected by cultural norms as social functioning—had more comparable scores at baseline. Given these issues, all scores are within acceptable ranges, indicating good psychometric properties and equivalence.
The authors acknowledge the limitations of these analyses, including the extreme variations in sample sizes across the countries. The patients in Turkey had recently received a diagnosis of OAB before enrollment in the study (0.06 years); patients in the other countries had received diagnosis a mean of 4 to 8 years before enrollment. While the models comparing mean scores utilized years since OAB diagnosis as a covariate to adjust for this difference, this could possibly have an effect on the comparisons. Also, the Turkish language analyses utilized a relatively small sample of 46 patients, possibly affecting the comparisons across language versions. Country was used as a proxy for language version, because we presumed that only one questionnaire translation was utilized for each country. Although the OAB-q was completed in 11 languages, only five of the languages were included in this psychometric equivalence study because of the small sample sizes in the remaining languages. Further evaluation of the remaining language versions is warranted.
Conclusions
The OAB-q language versions of Danish, German, Polish, Swedish, and Turkish demonstrated acceptable internal consistency reliability, good construct validity, and small to large effect sizes across language versions. Results indicate that the OAB-q is psychometrically valid in Danish, German, Polish, Swedish, and Turkish.
Source of financial support: Funding for this article was provided by Pfizer.