Risk of bias in systematic reviews of tendinopathy management: Are we comparing apples with oranges?
Funding information
This work was funded by grants from the Medical Research Council UK (MR/R020515/1).
Abstract
We aimed to provide an overview of the use of risk of bias (RoB) assessment tools in systematic reviews (SRs) in tendinopathy management given increased scrutiny of the SR literature in clinical decision making. A search was conducted in Medline from inception to June 2020 for all SRs of randomized controlled trials (RCTs) assessing the effectiveness of any intervention(s) on any location(s) of tendinopathy. Included SRs had to use one of (a) Cochrane Collaboration tool, (b) PEDro scale, or (c) revised Cochrane Collaboration tool (RoB 2) for their RoB assessment. A total of 46 SRs were included. Around half of SRs (46%) did not use an RoB assessment in data synthesis, and only 30% used it to grade the certainty of evidence. The RoB 2 tool was the most likely to determine “overall high RoB” (52%) followed by the Cochrane Collaboration tool (34.6%) and the PEDro scale (18.6%) as determined by the authors of the SRs. We have demonstrated substantial problems associated with the use of RoB assessments in tendinopathy SRs. The universal use of a single RoB assessment tool should be promoted by journals and SR guidance documents.
1 INTRODUCTION
The constant emergence of new treatment modalities for tendinopathy over the last few decades and the absence of robust evidence for their effectiveness has led to an increasing number of randomized controlled trials (RCTs). Systematic reviews (SRs) of RCTs constitute the strongest level of evidence and can therefore inform clinical practice, both at a policy level and an individual physician level. A SR should be transparent and reproducible, and subjectivity should be kept to a minimum.1 Unfortunately, firm guidance on conducting a SR does not exist and several parameters are left to the judgment of the authors. Moreover, recent debate in the Lancet argues that the findings of SRs may be flawed as they often include poor-quality studies that should have not been published in the first place.2
One of these parameters is risk of bias (RoB) assessment; not only is it a subjective process in its nature, but the existence of several RoB assessment tools further decreases reproducibility by introducing inconsistency. RoB assessment plays an integral role in SRs, and it is an essential part of data synthesis and the reporting of the results. It can be used in one of two ways in a SR, either for subgroup analyses (ie, including only RCTs with low risk of bias) or in determining the strength of evidence for each result in conjunction with other limitations of the included evidence that arise as a result of combining the findings of different studies (consistency, imprecision, etc).3
The Cochrane Collaboration tool4 for assessing internal validity (RoB), which was introduced in 2008 and is the tool most frequently used in SRs of RCTs, consists of 7 components/questions, which can be rated as “low” risk, “unclear” risk, or “high” risk of bias. Through its use over the last decade, it has been associated with a lot of confusion, low inter-rater reliability, and wrong implementation in SRs.5 Additionally, the creators did not specify how the tool should be used to determine overall RoB for each assessed RCT and instead they advised an overall judgment of the result at a domain and not study level, which is both impractical and very subjective. The second most commonly used tool, the PEDro scale,6, 7 is a scoring system that can be used to determine overall RoB for each study based on the overall score out of 10. It includes all the domains of the Cochrane tool and some additional items, and unlike the Cochrane tool, it is less subjective as the assessor only has two possible answers for each item/question: “yes” or “no.” The main disadvantage of its simplicity, however, is that methodological aspects of the assessed RCT that are not described clearly in the article are automatically scored with a “no,” whereas the Cochrane tool has an “unclear” option, which again is not clear how it should be used in the determination of the overall RoB.
The Cochrane group has recently published a revised RoB assessment tool, the RoB 2,5 which, according to the authors, is less subjective, more reproducible, and has more direct implementations in data synthesis. It is made up of 5 items/questions, and each one has a number of signaling questions, which help the author reach a final conclusion about the RoB in each item according to a pre-defined formula. This can either be “low” risk, “high” risk, or “some concerns.” The creators, having realized the importance of determining overall RoB for each study for practical and reproducible implementation of the RoB assessment in data synthesis, have also described how decisions on overall RoB for each study should be reached. Finally, they highlight that RoB should be assessed on an outcome level for each included RCT.5
The introduction of the new RoB assessment tool, regardless of whether it is more effective or not than other tools at predicting the actual RoB, is expected to further increase inconsistency across different SRs. This has the potential to lead to conflicting conclusions between SRs assessing and comparing the same interventions with regard to the strength of evidence of the results and can cause confusion in the translation of the findings and their implementation in clinical practice.
The aims of the present were (a) to provide an overview of the use of RoB assessment tools in SRs of RCTs in tendinopathy through a scoping review and (b) to assess inter-tool reliability among the Cochrane Collaboration tool, the revised Cochrane Collaboration tool (RoB 2), and the PEDro scale at determining overall RoB in tendinopathy SRs. Finally, we provide recommendations at an RCT level, SR level, and journal level with an ultimate objective to make RoB assessment and its use in data syntheses as understandable, transparent, objective, and reproducible as possible.
2 METHODS
2.1 Eligibility
SRs were eligible if they assessed the effectiveness of any intervention(s) on any location(s) of tendinopathy in patients over 16 years of age, included only RCTs, and used one of the following RoB assessment tools: Cochrane Collaboration tool, PEDro scale, RoB 2 tool (revised Cochrane Collaboration tool). Exclusion criteria included SRs including a mixture of randomized and non-randomized studies and a mixture of participants with tendinopathy and other conditions. SRs in languages other than English were also excluded. No criteria were used regarding the following parameters: publication date, journal type, type of tendinopathy and intervention, outcome measures, and length of follow-up.
2.2 Search strategy—Screening
A literature search was conducted by the first author via Medline in June 2020 with the following Boolean operators in “All Fields”: “((systematic review) OR (meta-analysis) AND (tendin*) AND (randomi*)).
For all eligible articles, the reference lists and PubMed's “similar articles” list were screened to identify potentially eligible articles that may have been missed at the initial search. Figure 1 (PRISMA flowchart) illustrates the article screening process.

The initial search returned a total of 208 articles. After exclusion of non-eligible articles according to our pre-defined criteria and inclusion of articles identified from reference screening, 46 SRs were included in our review.
2.3 Data Extraction—Handling
2.3.1 Scoping review
The included SRs were read by the first author, and data were extracted in a Microsoft Word table regarding the following: (a) general SR characteristics (number of included RCTs, location(s) of tendinopathy, intervention(s) assessed, key findings), (b) RoB assessment tool used, (c) whether an overall RoB was determined for each assessed RCT, (d) whether RoB assessment was performed on a study or outcome level, and (e) how RoB assessment was used in data syntheses.
2.3.2 Assessment of consistency of risk of bias assessment
In order to assess for disparity of tools determining overall RoB, we used two separate methods. Firstly, we calculated the proportion of RCTs assessed in all included SRs being determined as of “high overall RoB” for each one of the 3 tools separately and the mean proportion for each tool. Where overall RoB was determined by the authors of the original SR for each RCT, this was used. We also used our own pre-defined criteria (see below) to determine overall RoB for each RCT based on the RoB assessment results reported by the SR authors. Inter-tool reliability was not evaluated formally with statistical tests for this method as the RCTs assessed by each tool were not the same; instead, our purpose was to give a general impression on the likelihood of each tool to determine “high overall RoB” for RCTs and investigate for inter-rater inconsistencies when different criteria are used for the same studies.
Secondly, in light of the newly published RoB 2 tool by the Cochrane Collaboration and its use by the most recently published SR of RCTs in Achilles tendinopathy by van der Vlist et al,8 we assessed RoB of its 29 included RCTs using the two other RoB assessment tools, the Cochrane Collaboration tool and the PEDro scale. We then compared the reliability among the three tools (Cochrane Collaboration and PEDro as performed by the authors of the present review and RoB 2 by the authors of the original SR) at determining overall RoB. We only tested inter-tool reliability for overall RoB determination and not specific domains of the tools as only the former is directly associated with implementation of RoB assessment in data synthesis.
Inter-tool reliability was only assessed for determining “high overall RoB,” which is the aspect of RoB assessment with direct application in data syntheses. “High overall RoB” RCTs determine downgrading of the quality of the evidence, and they are the studies removed for subgroup/sensitivity analyses. For the purposes of the statistical tests, the 29 assessed RCTs were divided in two categories, “high overall RoB” and “other” (“low overall RoB”/”unclear RoB”/”some concerns”), and each category represented each one of the two possible outcomes in the Cohen's kappa formulas.
Overall RoB determination (our criteria)
The RoB 2 tool provides clear, specific instructions on how the overall RoB for each study should be determined5; therefore, we only used the SR authors' assessment.
With regard to the PEDro scale, its final score is traditionally interpreted as 8-10 “excellent quality” and 6-7 “good quality”; therefore, we used ≥6 as a cutoff to divide high and low overall RoB (or low and high study quality, respectively) firstly as this is the criterion most commonly used by SR authors (PEDro ≥ 6). We also used ≥8 as a cutoff to see which score gives more similar results to the other tools (PEDro ≥ 8). As the majority of authors use the PEDro scale for “study quality” and not RoB assessment, for the purposes of this review “high overall RoB” was synonymous to “moderate” or “poor” study quality.
For the Cochrane Collaboration tool, RCTs were considered as “high overall RoB” if they had: (a) high RoB in any of “random sequence generation,” “allocation concealment,” “blinding of patients and staff,” or “blinding of outcome measures” or (b) high RoB in 2 or more of the remaining 3 items (“completeness of outcome data,” “selective reporting,” and “other”) or (c) high RoB in one of the 3 remaining domains if the authors felt the RoB introduced through that domain was significant enough to affect the results of the study. “Unclear overall RoB” was assigned to studies with 3 or more unclear RoB in individual domains not fulfilling the criteria for “high overall RoB,” and “low overall RoB” in those not fulfilling the criteria for high and unclear overall RoB. These criteria, especially for the Cochrane tool and to a lesser extent for the PEDro scale, have been specified by the authors of the present review based on advice deriving from the creators of the Cochrane tool and other researchers9-11; they do not represent the “appropriate” criteria as the creators themselves did not specify any; however, we use them to emphasize the extent of inconsistency and subjectivity.
2.4 Statistical analysis
Cohen's kappa statistic was used to assess inter-tool reliability at determining “high overall RoB.” According to the value of the statistic (range 0-1), the strength of agreement can be: equivalent to chance (0), slight (0.1-0.2), fair (0.21-0.4), moderate (0.41-0.6), substantial (0.61-0.8), near perfect (0.81-0.99), perfect (1).

3 RESULTS
3.1 Scoping review
Table 1 summarizes the key characteristics of the eligible SRs.8, 12-56 Of the 46 included SRs, 31 used the Cochrane Collaboration tool, 13 the PEDro scale, 2 the revised Cochrane Collaboration tool (RoB 2), and 2 both the Cochrane Collaboration tool and the PEDro scale. Modified versions of the PEDro scale and the Cochrane Collaboration tool were used by two and one SRs, respectively. RoB was assessed on an outcome and not study level in only 3 SRs (6.5%). An overall RoB for each assessed RCT/outcome was determined in 17 SRs (37%; n = 7 PEDro scale, n = 2 RoB 2 tool, n = 8 Cochrane Collaboration tool). A total of 21 SRs (46%) did not use the results of their RoB assessment anywhere in data synthesis; the remaining 25 that did used it for either subgroup/sensitivity analyses excluding “high overall RoB”/”low-quality” studies (n = 9; 36%), for grading the quality of the evidence (n = 14; 56%), or both (n = 1; 4%). Where the quality of the evidence was graded, tools used included the GRADE tool3 (n = 6; 43%), the Cochrane BRG tool9 (n = 5; 36%), and the NHMRC tool1 (n = 1; 7%), while the authors of 3 SRs (21%) graded the evidence arbitrarily without a pre-specified method.
Authors | Tendinopathy | Number of included studies | Intervention Assessed | Summary of Findings | RoB Assessment Tool | RoB Assessment on study or outcome level | Method for determining overall RoB | Use of RoB in data synthesis |
---|---|---|---|---|---|---|---|---|
Arirachakaran et al (2016) | Lateral Elbow | 10 | PRP, Autologous blood, corticosteroid injection | PRP can improve pain and has fewer complications. Autologous blood can improve pain, function, and pain pressure thresholds but has higher complication rates. | Cochrane | Study |
Overall RoB not determined |
None |
Arirachakaran et al (2017) | Shoulder calcific | 7 | ESWT, US-guided lavage, corticosteroid injection, and combined treatment | US-guided lavage is the treatment of choice | Cochrane | Study |
Overall RoB not determined |
None |
Bannuru et al (2014) | Shoulder calcific | 28 | ESWT | High-energy ESWT is effective at improving pain and function | Cochrane | Study |
Overall RoB not determined |
Subgroup analysis “including high-quality studies” |
Bjordal et al (2008) | Lateral Elbow | 18 | Laser therapy | Laser therapy administered with optimal doses can provide short-term pain relief and improve disability | PEDro | Study |
“Good quality” ≥6 |
Subgroup analysis “excluding low-quality studies” |
Boudreault et al (2014) | Shoulder | 12 | Oral NSAIDs | Oral NSAIDs effective at reducing short-term pain but not function | Cochrane | Study | “Good quality” >70% (scoring system used) | Evidence grading (arbitrary) |
Catapano et al (2020) | Shoulder | 5 | Dextrose Prolotherapy | Prolotherapy is potentially useful adjunct to physical therapy | Cochrane | Study |
Overall RoB not determined |
None |
Challoumas et al (2019a) | All | 12 | Surgery | Surgery superior to no treatment/placebo but not sham surgery or physiotherapy | Cochrane | Study | Combined assessment of overall RoB, external validity, and precision | Evidence grading (Cochrane BRG) |
Challoumas et al (2019b) | All | 10 | Topical GTN | Topical GTN superior to placebo in medium term | Cochrane | Study | Combined assessment of overall RoB, external validity, and precision | Evidence grading (Cochrane BRG) |
Chen et al (2019) | Patellar | 11 | Non-surgical treatments | LR-PRP is most effective non-surgical treatment | PEDro | Study |
Overall RoB not determined |
None |
Coombes et al (2010) | All | 41 | Corticosteroid and other injections | Corticosteroid injections are effective in the short-term, other injections may provide long-term benefit for lateral elbow tendinopathy | Modified PEDro | Study | “Good quality” score >6/13 | Only “high-quality studies” included in SR |
Dan et al (2019) | Patellar | 2 | Surgery | Inconclusive due to low quality of evidence; surgery likely no more effective than eccentric exercise | Cochrane | Outcome |
Overall RoB not determined |
Evidence grading (GRADE) |
de Vos et al (2014) | Lateral Elbow | 6 | PRP | PRP not effective | PEDro | Study | “Good quality” ≥6 | Evidence grading (Cochrane BRG) |
Desjardins-Charbonneau et al (2015a) | Shoulder | 10 | Taping | Inconclusive due to low quality of evidence | Cochrane | Study |
Overall RoB not determined* |
None |
Desjardins-Charbonneau et al (2015b) | Shoulder | 21 | Manual therapy | Manual therapy may decrease pain, but it is unclear if it improves function | Cochrane | Study |
Overall RoB not determined* |
None |
Desmeules et al (2016a) | Shoulder | 10 | Exercise | Exercise is effective at treating workers and promotes return to work | Cochrane | Study |
Overall RoB not determined* |
None |
Desmeules et al (2016b) | Shoulder | 6 | TENS | Inconclusive due to low quality of evidence | Cochrane | Study |
Overall RoB not determined* |
None |
Desmeules et al (2015) | Shoulder | 11 | Therapeutic US | Therapeutic US administered with exercise no more superior than exercise alone. Compared to laser treatment it is less effective at alleviating pain | Cochrane | Study |
Overall RoB not determined* |
None |
Dong et al (2015) | Shoulder | 33 | All | Exercise-based treatments and acupuncture ideal for early disease. Surgery recommended for long-term disease. Corticosteroid injections and laser treatment discouraged. | Cochrane | Study |
“High overall RoB” if <3 “low RoB” domains |
Subgroup analysis “excluding low-quality studies” |
Dong et al (2016) | Lateral Elbow | 27 | Injection therapies | Some injection therapies can be effective (eg, BOTOX and PRP) but not corticosteroids. Hyaluronate and prolotherapy need more research. | Cochrane | Study |
Method not described |
Subgroup analysis “excluding low-quality studies” |
Fitzpatrick et al (2017) | All | 18 | PRP | Good evidence to support single injection of PRP under US guidance | Modified Cochrane | Study |
High risk if >3 high-risk domains |
Subgroup analysis “excluding high RoB studies” |
Haslerud et al (2015) | Shoulder | 17 | Laser therapy | Laser therapy can offer clinically relevant pain relief and improvement in symptoms alone and in combination with physiotherapy | PEDro | Study |
“Low quality” if <5 |
Evidence grading (arbitrary) |
Ioppolo et al (2013) | Shoulder calcific | 6 | ESWT | ESWT effective in terms of pain, function and resorption of calcific deposits | PEDro | Study |
Overall RoB not determined |
None |
Lafrance et al (2019) | Shoulder calcific | 3 | US-guided lavage |
US-guided lavage is more effective than shockwave therapy or a corticosteroid injection alone |
Cochrane | Study |
Overall RoB not determined* |
None |
Lee et al (2011) | Shoulder calcific | 9 | ESWT | Inconclusive due to low quality of evidence | PEDro | Study |
“Low risk” if ≥7 |
Evidence grading (NHMRC) |
Li et al (2019) | Lateral Elbow | 7 | PRP, corticosteroid injection | Corticosteroid injection superior to PRP in short-term but PRP more effective in long-term | Cochrane | Study |
Overall RoB not determined |
None |
Liao et al (2018) | Lower limb tendinopathies | 29 | ESWT | ESWT is effective for pain and function | PEDro | Study | “Good or excellent quality” ≥6 | None |
Lin et al (2020) | Shoulder | 5 | PRP | PRP may be beneficial for long-term pain | Cochrane | Study |
Overall RoB not determined |
Subgroup analysis “excluding low-quality studies” |
Lin et al (2019) | Shoulder | 7 | Injection therapies | Corticosteroid effective in short but not long-term, PRP and prolotherapy superior in the long-term | Cochrane | Outcome |
Method not described |
Subgroup analysis “excluding low-quality studies” |
Lin et al (2018) | Lateral Elbow | 6 | Botulinum toxin injection (BOTOX) | BOTOX injections superior to placebo and as effective as corticosteroid injections (though less effective for short-term pain) | Cochrane | Study |
Overall RoB not determined |
Evidence grading (arbitrary) |
Louwerens et al (2014)* | Shoulder calcific | 20 | Minimally invasive therapies | High-energy ESWT safe and effective in short- and mid-term | Cochrane | Study |
Overall RoB not determined |
Evidence grading (GRADE) |
Martimbianco et al (2020) | Achilles | 4 | Laser therapy | Inconclusive due to low quality of evidence | Cochrane | Study |
Overall RoB not determined |
Subgroup analysis “excluding low-quality studies” and evidence grading (GRADE) |
Mendonca et al (2020) |
Patellar | 9 | Conservative treatment | Inconclusive due to low quality of evidence | PEDro | Study |
“High risk” <5 |
Evidence grading (GRADE) |
Miller et al (2017) | All | 16 | PRP | PRP more efficacious than control | Cochrane | Study |
Overall RoB not determined |
None |
Mohamadi et al (2017) | Shoulder | 14 | Corticosteroid injections | Corticosteroid injections provide minimal transient pain relief in a small number of patients | Cochrane, Jadad | Study |
Overall RoB not determined |
None for Cochrane tool |
Murphy et al (2019) | Achilles | 7 | Heavy eccentric calf training (HECT) | HECT may be superior to no treatment and traditional physiotherapy but inferior to other exercise interventions | RoB 2 | Study |
According to tool instructions |
Evidence grading (GRADE) |
Ortega-Castillo & Medina-Porqueres (2016) | Shoulder & Lateral elbow | 12 | Eccentric exercise | Eccentric exercise effective for pain and strength but its effectiveness compared to other treatments remains questionable | PEDro | Study |
Overall RoB not determined |
Evidence grading (Cochrane BRG) |
Sussmilch-Leitch et al (2012) | Achilles | 19 | Physical therapies | Eccentric exercise recommended as first line with or without laser therapy. ESWT may be equally effective | Modified PEDro, Cochrane | Study | “High risk” if <3 “low RoB” domains of Cochrane tool | Subgroup analysis “excluding low-quality studies” |
Tsikopoulos et al (2016) | All | 5 | PRP | PRP provided no more clinical benefit than placebo or dry needling | Cochrane | Study |
Overall RoB not determined |
None |
Toliopoulos et al (2014) | Shoulder | 15 | Surgery | Surgery no more effective than exercises. Arthroscopic surgery may be superior to open for some outcome measures | Cochrane | Study |
Overall RoB not determined* |
None |
Van der Vlist et al (2020) | Achilles | 29 | All | No clinically relevant difference among treatments at 3 or 12 mo follow-up | RoB 2 | Outcome |
According to tool instructions |
Evidence grading (GRADE) |
Wasielewski & Kotsko (2007) | Lower Limb tendinopathies | 11 | Eccentric exercise | Eccentric exercise may improve pain and strength | PEDro | Study | Overall RoB not determined | None |
Woodley et al (2007) | All | 11 | Eccentric exercise | Inconclusive due to low quality of evidence | PEDro, Cochrane BRG | Study |
“High quality if ≥6 |
Evidence grading (Cochrane BRG) |
Wu et al (2017) | Shoulder calcific | 14 | Non-operative treatments | US-guided needling and ESWT (radial and high-energy focused) alleviate pain and achieve complete resolution of calcium deposits | Cochrane, PEDro | Study |
Overall RoB not determined |
None |
Xiong et al (2019) | Lateral Elbow | 4 | ESWT vs Corticosteroid | ESWT may be superior to corticosteroids | Jadad, Cochrane | Study | Overall RoB not determined | None |
Yan et al (2019) | Lateral Elbow | 5 | US therapy and ESWT | ESWT superior to US therapy up to 6 mo for pain and pain-free grip strength | Modified Jadad, Cochrane | Study | Overall RoB not determined | None |
Zhang et al (2019) | Shoulder calcific | 8 | US-guided lavage | US-guided lavage may be superior to ESWT in pain relief and calcification clearance | Cochrane | Study | Overall RoB not determined | None |
- Abbreviations: BRG, back review group; ESWT, extracorporeal shock wave therapy; GRADE, grading of recommendations, assessment, development and evaluations; GTN, glyceryl trinitrate; LR-PRP, leukocyte-rich platelet-rich plasma; NHMRC, national health and medical research council; NSAIDs, no-steroidal anti-inflammatory drugs; PEDro, physiotherapy evidence database scale; PRP, platelet-rich plasma; RoB, risk of bias; TENS, transcutaneous electrical nerve stimulation; US, ultrasound.
- * Scoring system used to calculate mean score of all RCTs but cutoffs for high and low risk not specified
3.1.1 Overall RoB determination
- RoB 2: according to the instructions of the tool (n = 2)
- Cochrane Collaboration tool: (a) “overall high RoB” where <3 domains had low RoB (n = 2) or where >3 domains had high RoB (n = 1); (b) “overall low RoB” where the total score of the study was >70% (out of 16; low RoB scored 2, unclear RoB 1, and high RoB 0, n = 1); (c) “good quality study” where no more than 1 domains of the tool, precision and external validity were high RoB (n = 2); (d) method not described (n = 2)
- PEDro: (a) “overall good quality/low RoB” where total score ≥6/10 (n = 4), ≥7/10 (n = 1 lee) or ≥7/13 for modified PEDro (n = 1); (b) “overall low quality/high RoB” where total score < 5/10 (n = 2)
3.2 Assessment of consistency of risk of bias assessment
Table 2 shows the proportion of “overall high RoB” RCTs as determined by (a) the authors of the original SRs where performed, using their own “high overall RoB” criteria and (b) the first author of the present review (DC) based on the RoB assessment performed by the SR authors using our pre-defined “high overall RoB” criteria for each tool. Mean percentages were calculated for each tool.
Tool | SR | SR authors' “high overall RoB” | DC “high overall RoB” Cochrane Collaboration | DC “high overall RoB” PEDro | |
---|---|---|---|---|---|
PEDro | ≥6/10 | ≥8/10 | |||
Bjordal et al (2008) | 1/18 (6%) | - | NA | NA | |
Chen et al (2019) | ND | - | 2/11 (18%) | 4/11 (36%) | |
Coombes et al (2010) | 23/64 (36%) | - | 29/64 (45%) | 46/64 (72%) | |
de Vos et al (2014) | 2/6 (33%) | - | 2/6 (33%)* | 4/6 (66%) | |
Haslerud et al (2015) | 0/17 (0%) | - | 3/17 (18%) | 14/17 (82%) | |
Ioppolo et al (2013) | ND | - | NA | NA | |
Lee et al (2011) | 3/9 (33%) | - | 3/9 (33%)* | 6/9 (66%) | |
Liao et al (2018) | 0/29 (0%) | - | 0/29 (0%)* | 13/29 (45%) | |
Mendonca et al (2020) |
2/9 (22%) | - | 3/9 (33%) | 5/9 (56%) | |
Ortega-Castillo & Medina-Porqueres (2016) | ND | - | 2/12 (17%) | 10/12 (83%) | |
Wasielewski & Kotsko (2007) | ND | - | 5/11 (45%) | 9/11 (82%) | |
Wu et al (2017) | ND | - | NA | NA | |
Mean Proportion | 18.6% | - | 29.2% | 65.4% | |
Cochrane Collaboration | Arirachakaran et al (2016) | ND | 7/10 (70%) | - | - |
Arirachakaran et al (2017) | ND | 3/7 (43%) | - | - | |
Bannuru et al (2014) | ND | NA | - | - | |
Boudreault et al (2014) | ND | 7/12 (58%) | - | - | |
Catapano et al (2020) | ND | 3/6 (50%) | - | - | |
Challoumas et al (2019a) | ND | 9/12 (75%) | - | - | |
Challoumas et al (2019b) | ND | 6/10 (60%) | - | - | |
Dan et al (2019) | ND | 2/2 (100%) | - | - | |
Desjardins-Charbonneau et al (2015a) | ND | 10/10 (100%) | - | - | |
Desjardins-Charbonneau et al (2015b) | 16/21 (76%) | 20/21 (95%) | - | - | |
Desmeules et al (2016a) | 8/10 (80%) | 10/10 (100%) | - | - | |
Desmeules et al (2016b) | ND | 6/6 (100%) | - | - | |
Desmeules et al (2015) | ND | 9/11 (82%) | - | - | |
Dong et al (2015) | 1/33 (3%) | 24/33 (73%) | - | - | |
Dong et al (2016) | 1/27 (4%) | 10/27 (37%) | - | - | |
Fitzpatrick et al (2017) | 0/18 (0%) | 13/18 (72%) | - | - | |
Lafrance et al (2019) | 2/3 (66%) | 2/3 (66%) | - | - | |
Li et al (2019) | ND | 4/7 (57%) | - | - | |
Lin et al (2020) | ND | 2/5 (40%) | - | - | |
Lin et al (2019) | 0/7 (0%) | NA | - | - | |
Lin et al (2018) | 0/6 (0%) | 0/6 (0%) | - | - | |
Louwerens et al (2014) | ND | 0/20 (0%) | - | - | |
Martimbianco et al (2020) | 4/4 (100%) | 1/4 (25%) | - | - | |
Miller et al (2017) | ND | 13/16 (81%) | |||
Mohamadi et al (2017) | ND | 4/14 (29%) | - | - | |
Sussmilch-Leitch et al (2012) | 4/23 (17%) | - | 11/23 (48%)*** | 15/23 (65%)*** | |
Tsikopoulos et al (2016) | ND | 4/5 (80%) | - | - | |
Toliopoulos et al (2014) | ND | 7/15 (47%) | - | - | |
Xiong et al (2019) | ND | 0/4 (0%) | - | - | |
Yan et al (2019) | ND | 0/5 (0%) | - | - | |
Zhang et al (2019) | ND | 0/8 (0%) | |||
Mean Proportion | 34.6% | 55% | - | - | |
RoB 2 | Murphy et al (2019) | 2/7 (29%) | NP** | - | - |
Van der Vlist et al (2020) | 21/28 (75%) | NP** | - | - | |
Mean Proportion | 52% | - | - | - |
Abbreviations
- NA, not available; ND, not determined; SR, systematic review; RoB, risk of bias.
- * Systematic review authors and author of present review (DC) used same criteria.
- ** Not performed as tool includes instructions on determination of overall risk of bias.
- *** Systematic review authors presented results of modified PEDro scale but assessed overall risk of bias based on Cochrane Collaboration tool.
3.2.1 Consistency among tools
Based on the overall RoB assessments reported by the authors of the original SRs, the RoB 2 tool was the most likely to determine a “high overall RoB” (mean proportion of high RoB RCTs 52%), followed by the Cochrane Collaboration tool (mean proportion 34.6%). The PEDro scale was associated with the lowest mean proportion of “high overall RoB” RCTs (18.6%).
When the pre-defined criteria of the authors of the present review were applied, the PEDro ≥ 8 was associated with the highest proportion of high RoB studies (65.4%), followed by the Cochrane Collaboration tool (55%), and finally the PEDro ≥ 6 (29.2%).
3.2.2 Consistency when different criteria used (SR authors vs authors of present review)
Where we determined “high overall RoB” using our criteria based on the RoB assessment results of the SR authors, the mean proportion of “high overall RoB” studies was substantially higher compared to that of the SR authors for the Cochrane Collaboration tool (55% vs 34.6%) and for the PEDro ≥ 8 (65.4% vs 18.6%). For the PEDro ≥ 6, the difference was less significant (29.2% vs 18.2%) as the majority of SR authors using the PEDro chose a ≥6 cutoff too. The highest variability for individual SRs between the proportion of studies with “high overall RoB” of the SR authors and ours was observed in the Cochrane tool (eg, 3% vs 73% for Dong et al29; 0% vs 72% for Fitzpatrick et al31) and the PEDro ≥ 8 (eg, 0% vs 82% for Haslerud et al32).
3.2.3 Inter-tool reliability in example systematic review
Tables 3a and 3b shows the RoB assessment that we performed for the 29 RCTs of the van der Vlist7 SR using the Cochrane Collaboration tool (Table 3a) and PEDro scale (≥6 and ≥ 8) (Table 3b) with our criteria. Table 3c shows the RoB assessment as performed by van der Vlist et al7 using the RoB 2 tool and the results of the overall RoB assessment from the other two tools as derived from Tables 3a and 3b, highlighting the generally poor inter-tool reliability. The only comparison that produced substantial reliability (k = 0.76) was that between the Cochrane tool and the PEDro ≥ 8. Fair reliability was found for the comparisons between the Cochrane tool and the PEDro ≥ 6 (k = 0.36), the Cochrane and the RoB 2 (k = 0.29), and the RoB 2 and PEDro ≥ 8 (k = 0.26). Finally, inter-tool reliability between the RoB 2 and the PEDro ≥ 6 was only slight (k = 0.03).
First Author (y) |
Internal Validity (Cochrane's Collaboration Tool for Assessing Risk of Bias) |
Overall RoB | ||||||
---|---|---|---|---|---|---|---|---|
Selection bias |
Performance bias |
Detection bias |
Attrition bias |
Reporting bias |
Other | |||
Random sequence generation | Allocation concealment | Blinding of patients and staff | Blinding of outcome measures | Completeness of outcome data | Selective reporting | |||
Balius et al (2016) | Low | ? | High | High | Low | Low | Low | High |
Bell et al (2013) | Low | Low | Low | Low | Low | Low | Low | Low |
Beyer et al (2015) | Low | ? | High | High | ? | Low | ? | High |
Boesen et al (2017) | Low | Low | Low | Low | Low | Low | Low | Low |
De Jonge et al (2010) | ? | Low | High | High | High | Low | Low | High |
De Jonge et al (2011) | Low | Low | Low | Low | Low | Low | Low | Low |
Ebbesen et al (2017) | ? | Low | Low | Low | Low | Low | High | Low |
Heinemeier et al (2017) | Low | Low | Low | Low | ? | Low | ? | Low |
Herrington & McCulloch (2007) | High | ? | High | High | Low | Low | Low | High |
Hutchison et al (2013) | Low | Low | Low | Low | Low | Low | Low | Low |
Krogh et al (2016) | Low | Low | Low | High | Low | Low | Low | High |
Lynen et al (2017) | Low | Low | High | High | Low | Low | Low | High |
Morrison et al (2017) | Low | Low | Low | High | Low | Low | Low | High |
Munteanu et al (2015) | Low | Low | Low | Low | Low | Low | Low | Low |
Njawaya et al (2018) | Low | Low | High | High | Low | High | High | High |
Pearson et al (2012) | Low | ? | High | High | High | High | High | High |
Rompe et al (2008) | Low | Low | High | High | Low | Low | Low | High |
Rompe et al (2009) | Low | Low | High | High | Low | Low | High | High |
Rompe et al (2009) | Low | Low | High | High | Low | Low | Low | High |
Roos et al (2004) | Low | ? | High | High | High | Low | High | High |
Silbernagel et al (2001) | ? | ? | High | High | High | High | High | High |
Silbernagel et al (2007) | Low | Low | High | High | ? | High | Low | High |
Stevens & Tan (2014) | ? | Low | High | High | High | Low | Low | High |
Tumilty et al (2016) | Low | Low | Low | Low | High | Low | Low | Low |
Tumilty et al (2012) | Low | Low | Low | Low | Low | Low | Low | Low |
Usuelli et al (2018) | ? | Low | High | High | Low | High | Low | High |
Yelland et al (2009) | Low | ? | High | High | Low | Low | High | High |
Zhang et al (2013) | Low | Low | High | High | Low | Low | Low | High |
Study | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Total Score | Overall ≥ 6 | Overall ≥ 8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Balius et al (2016) | Yes | No | Yes | No | No | No | Yes | Yes | Yes | Yes | 6 | Low | High |
Bell et al (2013) | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | 9 | Low | Low |
Beyer et al (2015) | Yes | No | Yes | No | No | No | No | No | Yes | Yes | 4 | High | High |
Boesen et al (2017) | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | 9 | Low | Low |
De Jonge et al (2010) | Yes | Yes | Yes | No | No | No | No | Yes | Yes | No | 5 | High | High |
De Jonge et al (2011) | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | 10 | Low | Low |
Ebbesen et al (2017) | No | Yes | Yes | Yes | No | Yes | Yes | Yes | No | Yes | 7 | Low | High |
Heinemeier et al (2017) | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | 10 | Low | Low |
Herrington & McCulloch (2007) | No | No | Yes | No | No | No | Yes | Yes | Yes | No | 4 | High | High |
Hutchison et al (2013) | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | 10 | Low | Low |
Krogh et al (2016) | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | Yes | 8 | Low | Low |
Lynen et al (2017) | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | Yes | 7 | Low | High |
Morrison et al (2017) | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | Yes | 8 | Low | Low |
Munteanu et al (2015) | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | 9 | Low | Low |
Njawaya et al (2018) | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | No | 6 | Low | High |
Pearson et al (2012) | Yes | No | Yes | No | No | No | No | Yes | No | No | 3 | High | High |
Rompe et al (2008) | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | Yes | 7 | Low | High |
Rompe et al (2009) | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | Yes | 7 | Low | High |
Rompe et al (2007) | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | Yes | 7 | Low | High |
Roos et al (2004) | Yes | No | No | No | No | No | No | Yes | Yes | Yes | 4 | High | High |
Silbernagel et al (2001) | Yes | No | No | No | No | No | No | Yes | No | Yes | 3 | High | High |
Silbernagel et al (2007) | Yes | Yes | Yes | No | No | No | No | Yes | No | Yes | 5 | High | High |
Stevens & Tan (2014) | No | Yes | Yes | No | No | No | No | Yes | Yes | Yes | 5 | High | High |
Tumilty et al (2016) | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | 8 | Low | Low |
Tumilty et al (2012) | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | 10 | Low | Low |
Usuelli et al (2018) | No | Yes | Yes | No | No | No | Yes | Yes | Yes | No | 5 | High | High |
Yelland et al (2009) | Yes | Yes | No | No | No | No | Yes | Yes | Yes | Yes | 6 | Low | High |
Zhang et al (2013) | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | Yes | 7 | Low | High |
Study | Randomization | Deviations from protocol | Missing data | Measurement of outcome | Selection of result | Overall RoB | Cochrane (DC) | PEDro ≥ 6 (DC) | PEDro ≥ 8 (DC) |
---|---|---|---|---|---|---|---|---|---|
Balius et al (2016) | High | Some concerns | Low | High | High | High | High | Low | High |
Bell et al (2013) | Low | Low | Low | Low | Some concerns | Some concerns | Low | Low | Low |
Beyer et al (2015) | Some concerns | Some concerns | High | High | High | High | High | High | High |
Boesen et al (2017) | High | Some concerns | Low | Low | Some concerns | High | Low | Low | Low |
De Jonge et al (2010) | Some concerns | Some concerns | Low | High | Some concerns | High | High | High | High |
De Jonge et al (2011) | Some concerns | Low | Low | Low | Some concerns | Some concerns | Low | Low | Low |
Ebbesen et al (2017) | High | High | Some concerns | Low | High | High | Low | Low | High |
Heinemeier et al (2017) | Some concerns | Low | Low | Low | Some concerns | Some concerns | Low | Low | Low |
Herrington & McCulloch (2007) | Some concerns | Low | Low | High | Some concerns | High | High | High | High |
Hutchison et al (2013) | High | High | High | Low | High | High | Low | Low | Low |
Krogh et al (2016) | Some concerns | High | High | Low | Some concerns | High | High | Low | Low |
Lynen et al (2017) | Low | Low | Some concerns | High | High | High | High | Low | High |
Morrison et al (2017) | High | Low | Low | Low | Some concerns | High | High | Low | Low |
Munteanu et al (2015) | Low | High | Low | Low | Low | High | Low | Low | Low |
Njawaya et al (2018) | Some concerns | Low | Some concerns | High | High | High | High | Low | High |
Pearson et al (2012) | Some concerns | Some concerns | High | High | Some concerns | High | High | High | High |
Rompe et al (2008) | Low | High | High | High | Some concerns | High | High | Low | High |
Rompe et al (2009) | Low | Some concerns | High | High | Some concerns | High | High | Low | High |
Rompe et al (2007) | Low | Low | Some concerns | High | Some concerns | High | High | Low | High |
Roos et al (2004) | Some concerns | High | Some concerns | Low | Some concerns | High | High | High | High |
Silbernagel et al (2001) | Some concerns | High | High | Some concerns | Some concerns | High | High | High | High |
Silbernagel et al (2007) | Some concerns | Some concerns | Some concerns | Low | Some concerns | Some concerns | High | High | High |
Stevens & Tan (2014) | Some concerns | Some concerns | Low | Some concerns | Some concerns | Some concerns | High | High | High |
Tumilty et al (2016) | Some concerns | Some concerns | Some concerns | High | Low | High | Low | Low | Low |
Tumilty et al (2012) | Low | Low | Some concerns | Low | Some concerns | Some concerns | Low | Low | Low |
Usuelli et al (2018) | Some concerns | Low | Low | High | Some concerns | High | High | High | High |
Yelland et al (2009) | Some concerns | Low | Some concerns | Some concerns | Some concerns | Some concerns | High | Low | High |
Zhang et al (2013) | Some concerns | Some concerns | Some concerns | High | Some concerns | High | High | Low | High |
Total Overall RoB | - | - | - | - | - | 0 low, 7 some concerns, 21 high | 9 Low, 19 High, 0 unclear | 19 Low, 9 High | 18 High, 10 Low |
Note
- DC, as determined by first author of present review.
4 DISCUSSION
We have demonstrated several problems relating to the use of RoB assessment in SRs of tendinopathy management that need the attention of the research community. In our scoping review, we found that almost half of the included SRs did not use their RoB assessment in data synthesis. Additionally, only 6.5% of SRs assessed RoB on an outcome level and not a study level while only 30% of all SRs used their RoB assessment for evidence grading, which is the primary purpose of performing a RoB assessment. In light of the substantial subjectivity and lack of transparency and reproducibility that governs the conduct of SRs, we strongly recommend that future SR authors determine overall RoB for each study (on an outcome level) with the use of clear and reproducible pre-defined criteria.
Whether overall RoB should be determined or not for each RCT is a controversial question and this controversy is apparent in the tools themselves. Although the creators of the original Cochrane Collaboration tool4 advised against rating overall RoB for each study but determining overall RoB on a domain level instead, this was neither explained further with clear, reproducible instructions nor was it applicable in practice for evidence grading. The revised Cochrane Collaboration tool (RoB 2)5 published last year includes instructions on determining overall RoB for each study; however, the creators highlight that this needs to be done on an outcome level. Finally, the PEDro scale,6, 7 which its creators define as “a scale to measure the quality of reports of RCTs,” does not define specific criteria or score cutoffs and is often incorrectly labeled as a “quality assessment” and not “RoB” tool. In addition to internal validity (RoB), measures of study quality include external validity (generalizability) and precision (freedom from random error), which the 10-item scale does not include. This is also acknowledged by the creators themselves.7
The comparison of the likelihood of each one of the three tools rating an RCT as “high overall risk” demonstrated clearly that the PEDro was overly generous as used by the SR authors, rating the majority of assessed RCTs (81.7%) as “low overall RoB”/”good overall quality.” The possibility of that substantial proportion of tendinopathy RCTs actually being of “low overall RoB” is not even entertained; many of them are not double-blinded (due to their nature) and besides, the other two RoB assessment tools demonstrated greater proportions of “high overall RoB” RCTs. Finally, inter-tool reliability among the three tools was generally poor except for the comparison of the Cochrane Collaboration tool and the PEDro ≥ 8, which reinforces the need for PEDro to be used with stricter criteria.
When we assessed our own pre-defined criteria against those used by the SR authors, it was apparent that especially for the Cochrane Collaboration tool there were substantial discrepancies. One might argue that our strict criteria resulted in a very low threshold of rating an RCT as “high overall RoB”; however, the recently published RoB 2 is very close to our criteria in that respect as all it takes for a “high overall RoB” is high RoB in a single domain. These marked disparities reflect the significant effects that subjectivity, inconsistency, and lack of reproducibility can have on the results of the same SRs with regard to grading the quality of evidence. If we demonstrated inconsistencies this significant only by using different criteria for RoB assessment results as reported by the SR authors, one can imagine how much more substantial these disparities can be when the same RCTs are assessed by different people, with different tools, using different criteria for each tool. Finally, a naturally arising question is therefore “how much bias is enough to distort the true result of an RCT?”; unfortunately, this and other similarly subjective judgments are needed for the conduct and reporting of all SRs.
The ideal RoB assessment tool does not exist. Subjectivity can never be removed completely from RoB assessment; however, this needs to be kept to a minimum and be complemented by transparency and reproducibility. These are exactly the aims of the revised Cochrane Collaboration tool, the creators of which state that they expect the new tool to be more likely to rate studies as “low overall RoB.”5 This was clearly not the case with the example SR used in the present review by van der Vlist et al8 who rated none of the 29 RCTs as “low risk.” Reasons for that might be either the actual presence of bias in all the included RCTs, strict thresholds used by the SR authors or poor performance of the tool itself. The same tool applied in the other SR46 included in this review identified a much higher proportion of “low overall RoB” RCTs (4/7). Despite attempts of the creators to make the tool more user friendly and reproducible,4 there is still significant subjectivity in some of its signaling questions (eg, “could assessment have been influenced by knowledge of intervention?” or “likely that missingness depended on true value”). However, importantly the tool includes clear instructions on determining both RoB for each individual domain and overall RoB for each study and this is why we advocate its use by all future SR authors.
4.1 Recommendations
In order to minimize inconsistency in RoB assessment and its use in data synthesis, we suggest the consistent use of RoB assessment across all journals publishing SRs. This will be achieved through the use of a single RoB assessment tool that can be incorporated in the “Instructions for authors” section of each journal's website or even in the PRISMA statement57 and other SR guidance documents. Additionally, for subjectivity and lack of transparency to be kept to a minimum, RCT authors could include a RoB assessment of their own study (with justifications) that will remove the need for authors' judgments at an SR level. Similarly, this could be achieved by the consistent use of the same tool across publishing journals and its introduction in RCT guidance documents (eg, CONSORT).58 Finally, journals and reviewers should apply more stringent criteria for accepting low-quality RCTs and SRs with inadequate transparency and reproducibility.
5 CONCLUSION
In the present review, we demonstrate several issues regarding the use of RoB assessment in tendinopathy SRs both relating to the tools themselves and their use by authors. Most importantly, there appears to be a lack of understanding on the appropriate use of RoB assessment and its incorporation in data syntheses. We recommend the consistent use of a single RoB assessment tool across all publishing journals and guidance documents and the application of more stringent criteria when both RCTs and SRs are assessed for publication.
CONFLICT OF INTERESTS
The authors declare no competing financial interests.
AUTHOR CONTRIBUTIONS
DC and NLM conceived and designed the study, performed analysis, and wrote the manuscript. All authors analyzed the data.
Open Research
DATA AVAILABILITY STATEMENT
DC has access to all the data, and data are available upon request.