Predicting Functional Outcome Using 24-Hour Post-Treatment Characteristics: Application of Machine Learning Algorithms in the STRATIS Registry
Abstract
Summary for Social Media
@AliciaCastongu2, @FazalZaidi9, @oozaidat, @Mouhammad_Jumaa
Objective
Machine learning (ML) algorithms have emerged as powerful predictive tools in the field stroke. Here, we examine the predictive accuracy of ML models for predicting functional outcomes using 24-hour post-treatment characteristics in the Systematic Evaluation of Patients Treated With Neurothrombectomy Devices for Acute Ischemic Stroke (STRATIS) Registry.
Methods
ML models, adaptive boost, random forest (RF), classification and regression trees (CART), C5.0 decision tree (C5.0), support vector machine (SVM), least absolute shrinkage and selection operator (LASSO), and logistic regression (LR), and traditional LR models were used to predict 90-day functional outcome (modified Rankin Scale score 0–2). Twenty-four-hour National Institutes of Health Stroke Scale (NIHSS) was examined as a continuous or dichotomous variable in all models. Model accuracy was assessed using the area under characteristic curve (AUC).
Results
The 24-hour NIHSS score was a top-predictor of functional outcome in all models. ML models using the continuous 24-hour NIHSS scored showed moderate-to-good predictive performance (range mean AUC: 0.76–0.92); however, RF (AUC: 0.92 ± 0.028) outperformed all ML models, except LASSO (AUC: 0.89 ± 0.023, p = 0.0958). Importantly, RF demonstrated a significantly higher predictive value than LR (AUC: 0.87 ± 0.031, p = 0.048) and traditional LR (AUC: 85 ± 0.06, p = 0.035) when using the 24-hour continuous NIHSS score. Predictive accuracy was similar between the 24-hour NIHSS score dichotomous and continuous ML models.
Interpretation
In this substudy, we found similar predictive accuracy for functional outcome when using the 24-hour NIHSS score as a continuous or dichotomous variable in ML models. ML models had moderate-to-good predictive accuracy, with RF outperforming LR models. External validation of these ML models is warranted. ANN NEUROL 2023;93:40–49
Over the past decade, machine learning (ML) technology has been successfully applied and used in the field of acute ischemic stroke (AIS).1-6 Current applications include ML algorithms that assist in stroke diagnosis, several of which are commercially available that aide in the detection of large vessel occlusion (LVO) and identification of patients with mismatch on perfusion imaging.7-9
As ML constitutes a diverse class of methods and has the ability to process linear and non-linear interactions within a dataset, application of ML algorithms in AIS may hold promise in the prediction of patient outcomes.6, 10, 11 Previous studies using ML models to predict outcomes in AIS patients after MT have largely reported similar predictive accuracy between traditional logistic regression methods and ML algorithms.6, 11 Although ML models were trained on both baseline and treatment variables, variable selection methods differed in these studies as well as variable availability.
Several studies have demonstrated the association of early neurological status (measured by 24-hour NIHSS) with 90-day functional outcome.9, 12-14 As such, the 24-hour NIHSS score remains a potential surrogate marker for long-term functional outcome in AIS patients treated with endovascular therapy (EVT). Recently, Mistry et al investigated the 24-hour NIHSS score as a predictor of 90-day functional outcome in patients from the multicenter Blood Pressure After Endovascular Stroke Therapy (Best) study.12 Twenty-four hour NIHSS score, when adjusted for baseline NIHSS, was shown as the strongest predictor for both dichotomous and ordinal 90-day functional outcomes.
In this study, we examine the predictive accuracy of multiple ML algorithms trained on 24-hour NIHSS score versus traditional logistic regression (T-LR) for predicting functional outcomes in the prospective Systematic Evaluation of Patients Treated With Neurothrombectomy Devices for Acute Ischemic Stroke (STRATIS) Registry.
Methods
Study Population
STRATIS Registry
We included patients from the STRATIS Registry,15 a multicenter, prospective, nonrandomized, observational study that evaluated the use of the Solitaire Revascularization Device (Solitaire) and Mindframe Capture Low profile Revascularization (Mindframe) (https://www.clinicaltrials.gov. Unique identifier: NCT02239640) in patients with anterior circulation occlusions at 55 centers. Key inclusion criteria included: age ≥ 18 with confirmed, symptomatic intracranial LVO with associated symptoms; National Institutes of Health Stroke Scale (NIHSS) score of 8–30; use of Medtronic market-released neurothrombectomy device as the initial device; premorbid modified Rankin Scale (mRS) of ≤1, or treatment ≤8 h of stroke symptoms onset. Ethics approval was obtained by the institutional review board at each center and subjects provided written informed consent prior to enrollment into the registry. Details and final results of the STRATIS Registry are published elsewhere.15
Data Processing
Supplemental Figure 1 shows the data pre-processing workflow. Of the 984 STRATIS Registry patients, we excluded patients that had baseline mRS ≥ 2 and posterior circulation occlusion and missing IV-rtPA status, final mTICI score, 24-hour NIHSS score, and 90-day mRS.
A total of 70 variables were used for input into multiple ML algorithms (Tables 1 and 2). Missing values were imputed with mean or median values based on skewness of distribution.16 Univariate variable imputation was implemented using SimpleImputer in Python. Data were scaled with a min-max scaler prior to dimensionality reduction to ensure that the gradient descent converge more quickly and smoothly toward the minima,17 specifically for the ML models that gradient descent is used for training.18
Features | Missing data | All N = 620 (N, %) | mRS 0–2 N = 423 (N, %) | mRS 3–6 N = 197 (N, %) | p |
---|---|---|---|---|---|
Baseline | |||||
Age | 73.0 ± 14.8 | 72.0 ± 15.0 | 75.0 ± 14.3 | 0.02 | |
Gender | 0.46 | ||||
Female | 276 (44.5) | 184 (43.5) | 92 (46.7) | ||
Male | 344 (55.5) | 239 (56.5) | 105 (53.3) | ||
NIHSS (median, IQR) | 17 [12–21] | 16 [12–20] | 18 [15–22] | 0.005 | |
mRS score | 0.23 | ||||
0 | 499 (80.5) | 346 (81.8) | 153 (77.7) | ||
1 | 121 (19.5) | 77 (18.2) | 44 (22.3) | ||
Acute ischemic stroke subtype | 0.66 | ||||
Cardioembolic | 511 (82.4) | 350 (82.7) | 161 (81.7) | ||
Large artery | 109 (17.6) | 73 (17.3) | 36 (18.3) | ||
Inpatient stroke | 54 (8.7) | 37 (8.7) | 17 (8.6) | 0.07 | |
Transfer | 273 (44.0) | 172 (40.7) | 101 (51.3) | 0.01 | |
Intravenous thrombolysis | 416 (67.1) | 286 (67.6) | 130 (66) | 0.69 | |
Co-morbidities | |||||
Hypertension | 443 (71.5) | 285 (67.4) | 158 (80.2) | <0.001 | |
Diabetes | 146 (23.5) | 92 (21.7) | 54 (27.4) | 0.12 | |
Atrial fibrillation | 229 (36.9) | 150 (35.5) | 79 (40.1) | 0.27 | |
Myocardial disease/coronary artery disease | 156 (25.2) | 109 (25.8) | 47 (39.9) | 0.61 | |
Hyperlipidemia | 266 (42.9) | 185 (43.7) | 81 (41.1) | 0.54 | |
Carotid artery disease | 47 (7.6) | 36 (8.5) | 11 (5.6) | 0.20 | |
Peripheral artery disease | 15 (2.4) | 10 (2.4) | 5 (2.5) | 0.90 | |
Smoking | 356 (57.4) | 228 (53.9) | 128 (65.0) | <0.001 | |
Previous history | |||||
No pre-existing conditions | 47 (7.6) | 39 (9.2) | 8 (4.1) | 0.02 | |
No neurological disorders | 433 (69.8) | 294 (69.5) | 139 (70.6) | 0.79 | |
Ischemic stroke | 78 (12.6) | 51 (12.1) | 27 (13.7) | 0.56 | |
Hemorrhagic stroke | 2 (0.3) | 1 (0.2) | 1 (0.5) | 0.58 | |
Transient ischemic attack | 35 (5.6) | 23 (5.4) | 12 (6.1) | 0.74 | |
Brain aneurysm | 3 (0.5) | 2 (0.5) | 1 (0.5) | 0.95 | |
Imaging | |||||
ASPECTS (median, IQR) | 293 | 9 [9–9] | 9 [8–10] | 9 [8–10] | 0.59 |
Initial infarct of >1/3 MCA | 68 (11) | 43 (10.2) | 25 (12.7) | 0.30 | |
Site of occlusion | |||||
Cervical carotid | 13 (2.1) | 7 (1.7) | 6 (3.1) | 0.06 | |
Carotid T | 33 (5.3) | 15 (3.5) | 18 (9.1) | 0.004 | |
ICA | 117 (18.9) | 74 (17.5) | 43 (26.9) | 0.20 | |
M1 | 422 (68.1) | 284 (67.1) | 138 (70.1) | 0.43 | |
M2 | 151 (24.4) | 108 (25.5) | 43 (21.8) | 0.36 | |
M3 | 11 (1.8) | 8 (1.9) | 3 (1.5) | 0.32 | |
ACA | 7 (1.1) | 3 (0.7) | 4 (2.0) | 0.31 | |
Time metrics | |||||
Time of onset to revascularization (minutes) (mean, SD) | 306.1 ± 209.8 | 286.1 ± 189.5 | 349.0 ± 242.8 | 0.001 | |
Time of onset to groin puncture (minutes) (mean, SD) | 222.0 ± 105.7 | 214.9 ± 191.0 | 237.3 ± 106.1 | 0.01 |
Features | Missing data | All N = 620 (N, %) | mRS 0–2 N = 423 (N, %) | mRS 3–6 N = 197 (N, %) | p |
---|---|---|---|---|---|
Procedural | |||||
General anesthesia | 151 (24.4) | 91 (21.5) | 60 (30.5) | 0.005 | |
Sedation | 395 (63.7) | 278 (65.7) | 117 (59.4) | 0.063 | |
IA-tPA | 90 (14.5) | 60 (14.2) | 30 (15.2) | 0.12 | |
Glycoprotein IIb/IIIa | 14 (2.3) | 10 (2.4) | 4 (2.0) | 0.80 | |
Rescue therapy | 44 (4.1) | 26 (6.1) | 18 (9.1) | 0.35 | |
Lowest BP reading during procedure-systolic | 36 | 116.9 ± 21.6 | 117.5 ± 22.4 | 115.5 ± 21.9 | 0.32 |
Lowest BP reading during procedure-diastolic | 35 | 65.2 ± 16.2 | 65.8 ± 16.3 | 63.96 ± 17.56 | 0.22 |
Stenting and/or angioplasty | 68 (11) | 44 (10.4) | 24 (12.2) | 0.51 | |
Mechanical device | 58 (9.4) | 32 (7.6) | 26 (13.2) | 0.03 | |
Balloon guide catheter | 101 (16.3) | 73 (17.3) | 28 (14.2) | 0.34 | |
Balloon guide and distal aspiration | 37 (6.0) | 26 (6.1) | 11 (5.6) | 0.78 | |
Device passes | 1 [1–2] | 1 [1–2] | 2 [1–3] | <0.001 | |
Angiographic | |||||
Successful recanalization (mTICI≥2b) | 507 (81.8) | 366 (86.5) | 141 (71.6) | <0.001 | |
Final mTICI | <0.001 | ||||
0 | 41 (6.6) | 19 (4.5) | 22 (11.2) | ||
1 | 11 (1.8) | 5 (1.2) | 6 (3.0) | ||
2a | 61 (9.8) | 33 (7.8) | 28 (14.2) | ||
2b | 178 (28.7) | 117 (27.9) | 61 (31.0) | ||
2c | 22 (3.5) | 18 (4.3) | 4 (2.0) | ||
3 | 307 (49.5) | 231 (54.6) | 76 (38.6) | ||
Embolization into new territory | 21 (3.4) | 10 (2.4) | 11 (5.6) | 0.11 | |
Distal emboli | 76 (12.3) | 53 (12.5) | 23 (11.7) | 0.50 | |
Clinical outcomes | |||||
Lowest BP within 24 h post procedure-systolic | 26 | 104.7 ± 16.9 | 14.5 ± 16.9 | 105.3 ± 18.0 | 0.62 |
Lowest BP within 24 h post procedure-diastolic | 26 | 58.4 ± 14.7 | 58.4 ± 14.9 | 58.4 ± 15.2 | 0.98 |
Highest BP within 24 h post procedure- systolic | 26 | 154.7 ± 23.4 | 153.3 ± 23.2 | 157.9 ± 25.4 | 0.031 |
Highest BP within 24 h post procedure- diastolic | 26 | 84.22 ± 21.9 | 83.53 ± 21.7 | 86.20 ± 23.8 | 0.20 |
24-hour NIHSS score | 6 [2–13] | 4 [2–8] | 13 [8–18] | <0.001 | |
Neurological deterioration | 27 (4.4) | 5 (1.2) | 22 (11.2) | <0.001 | |
sICH | 2 (0.3) | 0 | 2 (1.0) | 0.12 | |
New ischemic stroke | 11 (1.8) | 7 (1.7) | 4 (2.0) | 0.74 | |
Expansion of index infarct | 35 (5.6) | 21 (5.0) | 14 (7.1) | 0.28 | |
SAH2 | 21 (3.4) | 10 (2.4) | 11 (5.6) | 0.04 | |
PHI | 18 (2.9) | 6 (1.4) | 12 (6.1) | 0.001 | |
PH2 | 5 (0.8) | 1 (0.2) | 4 (2.0) | 0.020 | |
RIH | 9 (1.5) | 4 (0.9) | 5 (2.5) | 0.12 | |
IVH | 6 (1.0) | 2 (0.5) | 4 (2.0) | 0.07 | |
Malignant cerebral edema | 3 (0.5) | 2 (0.5) | 1 (0.5) | 0.95 | |
Mass effect | 19 (3.1) | 4 (0.9) | 15 (7.6) | <0.001 | |
Normal follow-up imaging | 186 (30.0) | 156 (36.9) | 30 (15.2) | <0.001 | |
Infarct in new vascular territory | 31 (5.0) | 22 (5.2) | 9 (4.6) | 0.74 | |
Radiographic mass effect | 47 (7.6) | 14 (3.3) | 33 (16.8) | <0.001 |
The primary outcome was defined as mRS ≤ 2 at 90-days. Twenty-four NIHSS score was examined as either a continuous or dichotomous variable in all models.
Variable Selection for ML Models
Seven ML models were used to measure the effect of the 24-hour NIHSS on predicting the 90-day functional outcome in this study. As random forest (RF), logistic regression (LR), adaptive boost, classification and regression trees (CART), C5.0 decision tree (C5.0), least absolute shrinkage and selection operator (LASSO), and support vector machine (SVM) are commonly used algorithms in clinical data analysis, these 7 algorithms were selected for this study.
To identify how many of the variables with the highest score may improve the performance of the machine learning models, sequential forward selection (SFS) was applied on each model and the variables were ordered from the highest score to the lowest. SFS iterated 69 times and in each iteration added a variable with the highest score to its output list. The SFS was fit on the training subset and the test subset was transformed during the evaluation process. The best number of input variables for RF, LR, adaptive boost, CART, C5.0, LASSO, and SVM were 36, 10, 38, 11, 29, 50, and 20, respectively (Supplemental Table 1). The 24-hour NIHSS score was selected as one of the top input variables in all the models and was examined as both a continuous and dichotomous variable along with the other top input variables to predict the primary outcome, functional outcome (mRS 0–2) at 90-day.
Model Deployment
Five-fold cross-validation method was implemented to split the dataset into train and test subsets.19 In this method, the performance of each model is evaluated in 5 iterations. In each iteration, the dataset is split into train and test subsets. The test subset is the unseen portion of the dataset to confirm that the ML algorithms were trained effectively. To decrease the probability of variance, stratified sampling20 was used along with the cross-validation method.21 The Grid Search method22 was used to tune the hyper-parameters. To evaluate the effect of 24-hour NIHSS on 90-day outcome (mRS ≤ 2), the 24-hour NIHSS score was fed into the models (along with the selected top variables from SFS), as either a continuous or dichotomous variable. For the dichotomous 24-hour NIHSS score, each ML model was trained 35 times with the pre-processed training subset. Following the training, the models were evaluated on the test subsets (Supplemental Figure 1).
Predictive Performance
The predictive performance of all 7 ML models and traditional LR models was assessed using sensitivity, specificity, accuracy, and the area under characteristic curve (AUC). The performance of the models was compared by the method proposed by Delong et al. ,23 which used the identity between AUC and the Wilcoxon-Mann–Whitney test statistic to make an estimate of the correlation between the AUC curves. All p-values were calculated in Python 3.10.0. A threshold of ≤0.05 was considered significant.
Model Calibration and Sample Size
Calibration curves were used to assess model calibration for regression-based and ML algorithms with clinical knowledge-based on SFS feature selection methods (Supplemental Figure 2A and B; Supplemental Table 2). In perfectly calibrated models, all points align on a 45° diagonal line in relation to the x and y axes. Poor model calibration is denoted by curve deviation from this 45° line. Most models derived from SFS selected variables were better calibrated than those with clinical knowledge-based variables. The confidence of models in predicting the 90-day mRS score was measured by the Hosmer–Lemeshow test (Supplemental Table 3). The minimum sample size required for 90-day mRS score prediction was estimated using NX Cross Validation method24 (Supplemental Figure 3).
Statistical analyses were performed using SPSS v27. The multivariate analysis (traditional logistic regression) was done with multinomial logistic regression and the univariate analysis was done with binary logistic regression. A threshold of <0.05 was considered significant.
Results
Study Population Characteristics
Baseline
A total of 620 patients were included in the final dataset and analysis, of which 423 (68.2%) had a 90-day mRS ≤ 2 and 197 (31.8%) mRS > 2 (Table 1). Mean age (72.0 ± 15.0 vs 75.0 ± 14.2, p = 0.018) and median baseline NIHSS (16 vs 18, p = 0.005) was significantly lower in the mRS ≤ 2 cohort compared to patient with mRS > 2. No difference in baseline ASPECTS was observed between the groups. In the mRS ≤ 2 cohort, patients had lower rates hypertension (67.4% vs 80.2%, p = <0.001) and smoking history (46.1% vs 35.0%, p = <0.001). The majority of patients had M1 occlusion (67.1% vs 70.1%, p = 0.43); however, patients with mRS > 2 had a higher rate of carotid terminus occlusions (9.1% vs 3.5%, p = 0.004).
Procedural
General anesthesia was used more frequently in patients with mRS > 2 (30.5% vs 21.5%, p = 0.005) (Table 2). Those in the mRS ≤ 2 group had shorter times from symptoms onset to groin puncture (214.9 ± 191.0 vs 237.3 ± 106.1, p = 0.014) and revascularization (286.1 ± 189.5 vs 349.0 ± 242.8, p = 0.001). Successful revascularization (mTICI ≥ 2b) was achieved more frequently in patients with mRS ≤ 2 (86.5% vs 71.6%, p = <0.001). Median total number of passes was also significantly lower in the mRS ≤ 2 cohort (1 (IQR 1–2) vs 2 (IQR 1–3), p = <0.001).
Outcomes
At 24 h, the median NIHSS score was higher in the mRS >2 group (13 IQR 8–18 vs 4 IQR 2–8, p = <0.001) (Table 2). No difference was seen in rates of sICH between the groups (0% vs 1.0%, p = 0.12). Patients with mRS > 2 had a higher rate of mass effect compared to those with good functional outcome (7.6% vs 0.9%, p < 0.001).
Comparison of Machine Learning Algorithms
Variable Selection
Supplemental Figure 1 describes the overall process for data pre-processing and model training and testing. For each of the 7 ML models (RF, C5.0, LASSO, SVM, CART, adaptive boost, and LR), variable importance for prediction of 90-day functional outcome was calculated and selected by SFS (Supplemental Table). Only the top variables selected by an ML model were fed into the model in training and testing process. All the models, except C5.0, ranked 24-hour NIHSS score as the top predictor of 90-day functional outcome.
ML Model Training and Testing
Continuous 24-Hour NIHSS Score
Seven ML models were trained and tested on the continuous 24-hour NIHSS score (Supplemental figure 1). All models had moderate to good predictive value (mean range AUC 0.76–0.92), moderate to good sensitivity (mean range: 0.85–0.93), and poor to moderate specificity (mean range: 0.3–0.65) (Supplemental Figure 4a, Table 3). Of the ML models trained and tested on the continuous 24-hour NIHSS score, RF (AUC 0.92 ± 0.03) outperformed all models except LASSO (AUC 0.89 ± 0.02, p = 0.0958) (Table 4). Importantly, RF had significantly better predictive accuracy than LR (AUC 0.87 ± 0.03, p = 0.048) and traditional LR (T-RL) (AUC 0.83 ± 0.06, p = 0.035).
24-Hour NIHSS score | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Continuous | Dichotomous | Continuous | Dichotomous | Continuous | Dichotomous | Continuous* | Dichotomous* | Continuous | Dichotomous | ||
Models | Sensitivity | Sensitivity | Specificity | Specificity | Accuracy | Accuracy | AUC | AUC | 95% CI | 95% CI | p-value* |
RF | 0.93 ± 0.03 | 0.91 ± 0.05 | 0.65 ± 0.12 | 0.60 ± 0.11 | 0.85 ± 0.04 | 0.79 ± 0.03 | 0.92 ± 0.03 | 0.90 ± 0.026 | [0.895–0.945] | [0.877–0.923] | 0.31 |
LASSO | 0.92 ± 0.07 | 0.90 ± 0.07 | 0.58 ± 0.10 | 0.66 ± 0.15 | 0.81 ± 0.04 | 0.82 ± 0.04 | 0.89 ± 0.02 | 0.91 ± 0.040 | [0.870–0.910] | [0.875–0.945] | 0.19 |
SVM | 0.90 ± 0.05 | 0.90 ± 0.09 | 0.63 ± 0.01 | 0.53 ± 0.13 | 0.81 ± 0.01 | 0.79 ± 0.03 | 0.87 ± 0.02 | 0.86 ± 0.043 | [0.849–0.891] | [0.822–0.898] | 0.42 |
C5.0 | 0.87 ± 0.05 | 0.87 ± 0.04 | 0.67 ± 0.13 | 0.62 ± 0.13 | 0.80 ± 0.04 | 0.79 ± 0.02 | 0.86 ± 0.03 | 0.85 ± 0.034 | [0.835–0.885] | [0.820–0.880] | 0.50 |
Adaptive | 0.85 ± 0.05 | 0.82 ± 0.10 | 0.63 ± 0.13 | 0.57 ± 0.17 | 0.78 ± 0.04 | 0.75 ± 0.02 | 0.84 ± 0.05 | 0.82 ± 0.027 | [0.793–0.887] | [0.796–0.844] | 0.76 |
CART | 0.97 ± 0.04 | 0.81 ± 0.08 | 0.30 ± 0.24 | 0.68 ± 0.10 | 0.74 ± 0.07 | 0.80 ± 0.04 | 0.76 ± 0.04 | 0.84 ± 0.046 | [0.721–0.799] | [0.800–0.880] | 0.02 |
LR | 0.89 ± 0.08 | 0.89 ± 0.08 | 0.60 ± 0.12 | 0.60 ± 0.14 | 0.80 ± 0.05 | 0.80 ± 0.04 | 0.87 ± 0.03 | 0.87 ± 0.038 | [0.843–0.897] | [0.837–0.903] | 0.89 |
T-LR | 0.90 ± 0.03 | 0.89 ± 0.06 | 0.52 ± 0.10 | 0.58 ± 0.09 | 0.78 ± 0.03 | 0.79 ± 0.02 | 0.85 ± 0.06 | 0.87 ± 0.04 | [0.802–0.898] | [0.837–0.903] | 0.35 |
- Adaptive = adaptive boost; C5.0 = decision tree; CART = classification and regression tree; LASSO = least absolute shrinkage and selection operator; LR = logistic regression; RF = random forest; SVM = support vector machine; T-LR = traditional logistical regression.
- * p-value for comparison of AUC.
Continuous 24-hour NIHSS models | ||||||||
---|---|---|---|---|---|---|---|---|
Model AUC | RF | LR | C5.0 | SVM | Adaptive | CART | LASSO | T-LR |
0.92 ± 0.028 | 0.87 ± 0.031 | 0.86 ± 0.029 | 0.87 ± 0.024 | 0.84 ± 0.054 | 0.76 ± 0.044 | 0.89 ± 0.023 | 85 ± 0.06 | |
p | p | p | p | p | p | p | p | |
RF 0.92 ± 0.028 |
1 | 0.048 | 0.019 | 0.035 | 0.023 | 0.002 | 0.096 | 0.035 |
LR 0.87 ± 0.031 |
0.048 | 1 | 0.25 | 0.9 | 0.33 | 0.01 | 0.31 | 0.62 |
C5.0 0.86 ± 0.029 |
0.019 | 0.25 | 1 | 0.39 | 0.38 | 0.033 | 0.067 | 0.98 |
SVM 0.87 ± 0.024 |
0.035 | 0.9 | 0.39 | 1 | 0.37 | 0.021 | 0.36 | 0.59 |
Adaptive 0.84 ± 0.054 |
0.023 | 0.33 | 0.38 | 0.37 | 1 | 0.062 | 0.092 | 0.92 |
CART 0.76 ± 0.044 |
0.002 | 0.01 | 0.033 | 0.021 | 0.062 | 1 | 0.0081 | 0.059 |
LASSO 0.89 ± 0.023 |
0.096 | 0.31 | 0.067 | 0.36 | 0.092 | 0.0081 | 1 | 0.097 |
T-LR 85 ± 0.06 |
0.035 | 0.62 | 0.98 | 0.59 | 0.92 | 0.059 | 0.097 | 1 |
Dichotomous (≤6) 24-hour NIHSS models | ||||||||
---|---|---|---|---|---|---|---|---|
Model AUC | RF | LR | C5.0 | SVM | Adaptive | CART | LASSO | T-LR |
90 ± 0.026 | 0.87 ± 0.038 | 0.85 ± 0.034 | 0.86 ± 0.043 | 0.82 ± 0.027 | 0.84 ± 0.046 | 0.91 ± 0.04 | 0.87 ± 0.04 | |
p | p | p | p | p | p | p | p | |
RF 90 ± 0.026 |
1 | 0.31 | 0.051 | 0.056 | 0.012 | 0.058 | 0.24 | 0.39 |
LR 0.87 ± 0.038 |
0.31 | 1 | 0.03 | 0.046 | 0.01 | 0.18 | 0.52 | 0.61 |
C5.0 0.85 ± 0.034 |
0.051 | 0.03 | 1 | 0.7 | 0.15 | 0.71 | 0.03 | 0.049 |
SVM 0.86 ± 0.043 |
0.056 | 0.046 | 0.7 | 1 | 0.11 | 0.77 | 0.03 | 0.04 |
Adaptive 0.82 ± 0.027 |
0.012 | 0.01 | 0.15 | 0.11 | 1 | 0.66 | 0.008 | 0.029 |
CART 0.84 ± 0.046 |
0.058 | 0.18 | 0.71 | 0.77 | 0.66 | 1 | 0.03 | 0.22 |
LASSO 0.91 ± 0.04 |
0.24 | 0.52 | 0.03 | 0.03 | 0.008 | 0.03 | 1 | 0.21 |
T-LR 0.87 ± 0.04 |
0.39 | 0.61 | 0.049 | 0.04 | 0.029 | 0.22 | 0.21 | 1 |
- Adaptive = adaptive boost; C5.0 = decision tree; CART = classification and regression tree; LASSO = least absolute shrinkage and selection operator; LR = logistic regression; RF = random forest; SVM = support vector machine; T-LR = traditional logistical regression.
Dichotomous 24-Hour NIHSS Score
To test if a dichotomous 24-hour NIHSS score would have better predictive accuracy than continuous 24-hour NIHSS score, we examined different cutoffs. A cutoff of ≤6 was chosen for the dichotomous 24-hour NIHSS, as this cutoff showed the best predictive accuracy for the models.
The models were trained and tested on the 24-hour NIHSS Score ≤ 6. All models had moderate to good predictive value (mean range AUC 0.82–0.91). All models had good predictive value (mean range AUC 0.82–0.91), moderate to good sensitivity (mean range: 0.81–0.91), and poor to moderate specificity (mean range: 0.53–0.68) (Supplemental Figure 4b, Table 3). LASSO (AUC 0.91 ± 0.04) outperformed C5.0 (AUC 0.85 ± 0.034, p = 0.03), SVM (AUC 0.86 ± 0.043, p = 0.03), Adaptive (AUC 0.82 ± 0.027, p = 0.008), and CART (AUC 0.84 ± 0.046, p = 0.058) but had a similar predictive value as RF (AUC 90 ± 0.026, p = 0.24), LR (AUC 0.87 ± 0.038, p = 0.52), and T-LR (AUC 0.87 ± 0.04, p = 0.21) (Supplemental Figure 4b, Table 4).
Overall, when comparing similar ML models of continuous and dichotomous 24-hour NIHSS score, no difference in AUC was observed, with the exception of dichotomous CART (AUC 0.84 ± 0.046), which outperformed the continuous CART model (AUC 0.76 ± 0.044, p = 0.02) (Table 3).
ML versus Traditional Logistic Regression
No difference was observed when comparing our traditional LR models (i.e. known clinically relevant variables, Table 5) with ML LR models (automated variable selection) (Tables 3 and 4).
Continuous 24-NIHSS scorea | Dichotomized (≤6) 24-NIHSS scorea | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Features | 95% CI | OR | Log odds | p | 95% CI | OR | Log odds | p | ||
Age | 0.961 | 1.003 | 0.981 | −0.019 | 0.076 | 0.975 | 1.006 | 0.990 | −0.010 | 0.232 |
Baseline NIHSS | 0.961 | 1.067 | 1.013 | 0.013 | 0.636 | 0.941 | 1.023 | 0.981 | −0.019 | 0.378 |
Hypertension (HTN) | 0.198 | 0.776 | 0.392 | −0.936 | 0.007 | 0.319 | 0.969 | 0.556 | −0.587 | 0.038 |
Time of onset to revascularization | 0.999 | 1.001 | 1.000 | 0.000 | 0.896 | 0.999 | 1.001 | 1.000 | 0.000 | 0.376 |
Time of onset to groin puncture | 0.997 | 1.002 | 0.999 | −0.001 | 0.560 | 0.997 | 1.002 | 1.000 | 0.000 | 0.771 |
Successful recanalization | 0.563 | 6.727 | 1.946 | 0.666 | 0.293 | 0.945 | 7.037 | 2.579 | 0.947 | 0.064 |
24-Hour NIHSS scorea | 0.776 | 0.857 | 0.815 | −0.205 | <0.000 | 3.654 | 9.346 | 5.844 | 1.765 | <0.000 |
Carotid terminus occlusion | 0.275 | .997 | 0.523 | −0.648 | 0.049 | 0.284 | 0.873 | 0.498 | −0.697 | 0.015 |
No pre-existing conditions | 0.315 | 3.765 | 1.089 | 0.085 | 0.893 | 0.423 | 3.564 | 1.228 | 0.205 | 0.706 |
Tobacco use | 0.578 | 1.218 | 0.839 | −0.176 | 0.355 | 0.536 | 1.039 | 0.746 | −0.293 | 0.083 |
Presence of neurological deterioration | 0.456 | 7.730 | 1.878 | 0.630 | 0.383 | 0.118 | 1.110 | 0.361 | −1.019 | 0.075 |
Highest systolic BP within 24 h post procedure | 1.001 | 1.025 | 1.013 | 0.013 | 0.041 | 0.993 | 1.012 | 1.002 | 0.002 | 0.635 |
SAH2 | 0.277 | 4.668 | 1.138 | 0.129 | 0.858 | 0.272 | 2.491 | 0.823 | −0.195 | 0.730 |
PHI | 0.039 | .582 | 0.151 | −1.890 | 0.006 | 0.074 | 0.806 | 0.244 | −1.411 | 0.021 |
PH2 | 0.042 | 6.543 | 0.526 | −0.642 | 0.617 | 0.028 | 4.337 | 0.348 | −1.056 | 0.412 |
Normal follow-up imaging | 1.216 | 4.077 | 2.226 | 0.800 | 0.010 | 1.269 | 3.584 | 2.133 | 0.758 | 0.004 |
Radiographic mass effect | 0.139 | .829 | 0.340 | −1.079 | 0.018 | 0.162 | 0.743 | 0.347 | −1.058 | 0.006 |
Total pass number | 0.661 | 1.131 | 0.865 | −1.145 | 0.289 | 0.673 | 1.024 | 0.830 | −0.186 | 0.082 |
Mechanical device | 0.187 | 1.058 | 0.445 | −0.810 | 0.067 | 0.301 | 1.222 | 0.607 | −0.499 | 0.162 |
General anesthesia | 0.452 | 1.138 | 0.717 | −0.333 | 0.158 | 0.588 | 0.953 | 0.748 | −0.290 | 0.019 |
Transfer | 0.598 | 1.814 | 1.041 | 0.040 | 0.887 | 0.662 | 1.715 | 1.065 | 0.063 | 0.795 |
- a Models for the prediction of 90-day functional outcome included either dichotomized or continuous 24-hour NIHSS score, as noted.
Discussion
In this study, which used ML models to predict 90-day outcomes in the STRATIS Registry, the 24-hour NIHSS score was found to be a top predictor of functional outcome in all ML models. Although ML models using the continuous 24-hour NIHSS scored showed moderate to good predictive performance (range mean AUC 0.76–0.92), RF outperformed all ML models except LASSO. Importantly, RF demonstrated a significantly higher predictive value than LR and T-LR. When applying a dichotomous 24-hour NIHSS score, only small differences were observed in predictive accuracy (with the exception of CART) when compared to continuous 24-hour NIHSS score ML models. There was no difference in the predictive performance of traditional logistic regression and ML logistic regression models.
Previous studies investigating ML algorithms for predicting 90-day functional outcomes after EVT in AIS patients demonstrated similar predictive accuracy between ML and LR models.6, 11 Van Os. et al, using data from the Multicenter Randomized Clinical Trial of Endovascular Treatment in the Netherlands (MR CLEAN) Registry, applied several machine algorithms (RF, SVM, Neural Network, and Super Learner) to predict 90-day functional outcomes after EVT, and found negligible differences in predictive performance (range mean AUC 0.88–0.91) compared to logistic regression.6 Similar to our results, their RF model had the highest predictive value (mean AUC 0.91) of all ML models tested; however, variable selection varied between the present study. As variable selection is based on variables contained in a given dataset, these differences may be attributed to variable availability between the STRATIS Registry15 and the MR CLEAN Registry.25 Variables identified as important in their RF model (Glascow Coma Scale, creatinine, C-reactive protein, thrombocyte count) were not captured in the STRATIS Registry. Another study, which explored the use of ML algorithms (RF, CART, C5.0), to predict 90-day functional impairment risk after EVT (mRS > 2) using data from PROVE-IT and INTERRSeCT, found no difference in predictive accuracy between logistic regression and ML models (AUC range 0.65–0.72).11
In the present study, we examined both continuous and dichotomous 24-hour NIHSS score in our ML models for prediction of functional outcome. Several studies have demonstrated the association of early neurological status (measured by 24-hour NIHSS) with 90-day functional outcomes.9, 12-14, 26, 27 Mistry et al investigated the 24-hour NIHSS Score as a predictor of 90-day outcome and observed that 24-hour NIHSS score, when adjusted for baseline NIHSS, was shown as the strongest predictor for both dichotomous and ordinal 90-day functional outcomes in the study. For prediction of mRS 0–2, the optimal threshold for 24-hour NIHSS was ≤7 (sensitivity 80.1%, specificity 80.4%, p < 0.0001).12 Another study which sought to externally validate 24-hour NIHSS as a predictor for long-term outcome (mRS 0–2), identified a cutoff of NIHSS ≤ 8 at 24-h after MT as an independent predictor for favorable outcome.28 The results of our study further highlight the importance of the 24-hour NIHSS as a predictor of functional outcome, and suggests that the continuous 24-hour NIHSS score may serve to improve stroke outcome prediction in ML models. These results highlight the potential role of ML models in the prediction of functional outcome based on a patient's 24-hour NIHSS score and the utility of the 24-hour NIHSS to serve as a potential surrogate marker for long-term functional outcome in AIS patients treated with EVT.
Limitations
Our study as several important limitations. As the STRATIS Registry was an observational registry, our ML models were limited to the variables that were collected within the registry.15 As such, our population was restricted to patients that were treated with MT within 8 hours of symptoms onset, pre-morbid mRS 0–1, and anterior circulation occlusion, which limits the generalizability of our models. For our analysis, we excluded patients with missing 24-hour NIHSS and 90-day mRS, which may lead to selection bias in this study. Furthermore, missing data for certain variables, such as ASPECTS, was imputed by median score for missing observations. Although we used a 5-fold cross-validation method for our models, this study lacks an external validation cohort. We used 7 different ML algorithms for prediction of 90-day outcome. In our study, RF outperformed LR when trained on the selected variables in our dataset. RF, which is ensemble learning method, has been shown to accommodate most datasets, even those with missing data.29 However, we did not explore other algorithms, such as deep neural networks, which may also have utility for prediction of 90-day outcome.
Conclusion
In this substudy of the STRATIS Registry, when using 24-hour NIHSS score as a continuous variable, we found that RF had higher predictive accuracy than LR models and T-RL for the prediction of 90-day functional outcome. External validation of these ML models is warranted in larger datasets.
Acknowledgements
We acknowledge Oscar Bolanos (Medtronic) for his editorial support. This study was sponsored by Medtronic.
Author Contributions
Conception and design of the study (A.C.C., Z.Z., M.A.J.).
Acquisition and analysis of data (A.C.C., Z.Z., M.A.J.).
Drafting a significant portion of the manuscript or figures (A.C.C., Z.Z., O.O.Z., R.B., S.F.Z., D.L., N.M.-K., M.A.J.).
Potential Conflicts of Interest
Dr. Jumaa reports research grant support from Medtronic. Dr. Zaidat is a consultant and speaker for Medtronic, and reports research grant support from Medtronic Neurovascular (modest); honoraria from Medtronic Neurovascular (modest); and is a consultant/advisory board member at Medtronic Neurovascular. Dr. Mueller-Kronast is a modest consultant for Medtronic Neurovascular. Dr. Liebeskind serves as a consultant as Imaging and Angiography Core Lab. The other authors report no conflicts.
Open Research
Data Availability
All supporting data from this study are available within the article and corresponding online-only data.