Volume 93, Issue 1 pp. 40-49
Research Article
Open Access

Predicting Functional Outcome Using 24-Hour Post-Treatment Characteristics: Application of Machine Learning Algorithms in the STRATIS Registry

Alicia C. Castonguay PhD

Alicia C. Castonguay PhD

Department of Neurology, University of Toledo, Toledo, OH

Search for more papers by this author
Zeinab Zoghi MS

Zeinab Zoghi MS

ProMedica Stroke Network, ProMedica Toledo Hospital, Toledo, OH

Search for more papers by this author
Osama O. Zaidat MD, MS

Osama O. Zaidat MD, MS

Neuroscience, St. Vincent Mercy Hospital, Toledo, OH

Search for more papers by this author
Richard E. Burgess MD, PhD

Richard E. Burgess MD, PhD

Department of Neurology, University of Toledo, Toledo, OH

Search for more papers by this author
Syed F. Zaidi MD

Syed F. Zaidi MD

Department of Neurology, University of Toledo, Toledo, OH

ProMedica Stroke Network, ProMedica Toledo Hospital, Toledo, OH

Search for more papers by this author
Nils Mueller-Kronast MD

Nils Mueller-Kronast MD

Neuroscience, Advanced Neuroscience Network, Tenet, FL

Search for more papers by this author
David S. Liebeskind MD

David S. Liebeskind MD

Department of Neurology, University of California Los Angeles, Los Angeles, CA

Search for more papers by this author
Mouhammad A. Jumaa MD

Corresponding Author

Mouhammad A. Jumaa MD

Department of Neurology, University of Toledo, Toledo, OH

ProMedica Stroke Network, ProMedica Toledo Hospital, Toledo, OH

Address correspondence to Jumaa, MD, Professor, Neurology, 2130 Central Ave, Suite 201, Toledo, OH 43606. E-mail: [email protected]

Search for more papers by this author
First published: 10 October 2022

Abstract

Summary for Social Media

@AliciaCastongu2, @FazalZaidi9, @oozaidat, @Mouhammad_Jumaa

Objective

Machine learning (ML) algorithms have emerged as powerful predictive tools in the field stroke. Here, we examine the predictive accuracy of ML models for predicting functional outcomes using 24-hour post-treatment characteristics in the Systematic Evaluation of Patients Treated With Neurothrombectomy Devices for Acute Ischemic Stroke (STRATIS) Registry.

Methods

ML models, adaptive boost, random forest (RF), classification and regression trees (CART), C5.0 decision tree (C5.0), support vector machine (SVM), least absolute shrinkage and selection operator (LASSO), and logistic regression (LR), and traditional LR models were used to predict 90-day functional outcome (modified Rankin Scale score 0–2). Twenty-four-hour National Institutes of Health Stroke Scale (NIHSS) was examined as a continuous or dichotomous variable in all models. Model accuracy was assessed using the area under characteristic curve (AUC).

Results

The 24-hour NIHSS score was a top-predictor of functional outcome in all models. ML models using the continuous 24-hour NIHSS scored showed moderate-to-good predictive performance (range mean AUC: 0.76–0.92); however, RF (AUC: 0.92 ± 0.028) outperformed all ML models, except LASSO (AUC: 0.89 ± 0.023, p = 0.0958). Importantly, RF demonstrated a significantly higher predictive value than LR (AUC: 0.87 ± 0.031, p = 0.048) and traditional LR (AUC: 85 ± 0.06, p = 0.035) when using the 24-hour continuous NIHSS score. Predictive accuracy was similar between the 24-hour NIHSS score dichotomous and continuous ML models.

Interpretation

In this substudy, we found similar predictive accuracy for functional outcome when using the 24-hour NIHSS score as a continuous or dichotomous variable in ML models. ML models had moderate-to-good predictive accuracy, with RF outperforming LR models. External validation of these ML models is warranted. ANN NEUROL 2023;93:40–49

Over the past decade, machine learning (ML) technology has been successfully applied and used in the field of acute ischemic stroke (AIS).1-6 Current applications include ML algorithms that assist in stroke diagnosis, several of which are commercially available that aide in the detection of large vessel occlusion (LVO) and identification of patients with mismatch on perfusion imaging.7-9

As ML constitutes a diverse class of methods and has the ability to process linear and non-linear interactions within a dataset, application of ML algorithms in AIS may hold promise in the prediction of patient outcomes.6, 10, 11 Previous studies using ML models to predict outcomes in AIS patients after MT have largely reported similar predictive accuracy between traditional logistic regression methods and ML algorithms.6, 11 Although ML models were trained on both baseline and treatment variables, variable selection methods differed in these studies as well as variable availability.

Several studies have demonstrated the association of early neurological status (measured by 24-hour NIHSS) with 90-day functional outcome.9, 12-14 As such, the 24-hour NIHSS score remains a potential surrogate marker for long-term functional outcome in AIS patients treated with endovascular therapy (EVT). Recently, Mistry et al investigated the 24-hour NIHSS score as a predictor of 90-day functional outcome in patients from the multicenter Blood Pressure After Endovascular Stroke Therapy (Best) study.12 Twenty-four hour NIHSS score, when adjusted for baseline NIHSS, was shown as the strongest predictor for both dichotomous and ordinal 90-day functional outcomes.

In this study, we examine the predictive accuracy of multiple ML algorithms trained on 24-hour NIHSS score versus traditional logistic regression (T-LR) for predicting functional outcomes in the prospective Systematic Evaluation of Patients Treated With Neurothrombectomy Devices for Acute Ischemic Stroke (STRATIS) Registry.

Methods

Study Population

STRATIS Registry

We included patients from the STRATIS Registry,15 a multicenter, prospective, nonrandomized, observational study that evaluated the use of the Solitaire Revascularization Device (Solitaire) and Mindframe Capture Low profile Revascularization (Mindframe) (https://www.clinicaltrials.gov. Unique identifier: NCT02239640) in patients with anterior circulation occlusions at 55 centers. Key inclusion criteria included: age ≥ 18 with confirmed, symptomatic intracranial LVO with associated symptoms; National Institutes of Health Stroke Scale (NIHSS) score of 8–30; use of Medtronic market-released neurothrombectomy device as the initial device; premorbid modified Rankin Scale (mRS) of ≤1, or treatment ≤8 h of stroke symptoms onset. Ethics approval was obtained by the institutional review board at each center and subjects provided written informed consent prior to enrollment into the registry. Details and final results of the STRATIS Registry are published elsewhere.15

Data Processing

Supplemental Figure 1 shows the data pre-processing workflow. Of the 984 STRATIS Registry patients, we excluded patients that had baseline mRS ≥ 2 and posterior circulation occlusion and missing IV-rtPA status, final mTICI score, 24-hour NIHSS score, and 90-day mRS.

A total of 70 variables were used for input into multiple ML algorithms (Tables 1 and 2). Missing values were imputed with mean or median values based on skewness of distribution.16 Univariate variable imputation was implemented using SimpleImputer in Python. Data were scaled with a min-max scaler prior to dimensionality reduction to ensure that the gradient descent converge more quickly and smoothly toward the minima,17 specifically for the ML models that gradient descent is used for training.18

TABLE 1. Baseline characteristics in the overall population and mRS 0–2 versus mRS 3–6 populations in the STRATIS Registry
Features Missing data All N = 620 (N, %) mRS 0–2 N = 423 (N, %) mRS 3–6 N = 197 (N, %) p
Baseline
Age 73.0 ± 14.8 72.0 ± 15.0 75.0 ± 14.3 0.02
Gender 0.46
Female 276 (44.5) 184 (43.5) 92 (46.7)
Male 344 (55.5) 239 (56.5) 105 (53.3)
NIHSS (median, IQR) 17 [12–21] 16 [12–20] 18 [15–22] 0.005
mRS score 0.23
0 499 (80.5) 346 (81.8) 153 (77.7)
1 121 (19.5) 77 (18.2) 44 (22.3)
Acute ischemic stroke subtype 0.66
Cardioembolic 511 (82.4) 350 (82.7) 161 (81.7)
Large artery 109 (17.6) 73 (17.3) 36 (18.3)
Inpatient stroke 54 (8.7) 37 (8.7) 17 (8.6) 0.07
Transfer 273 (44.0) 172 (40.7) 101 (51.3) 0.01
Intravenous thrombolysis 416 (67.1) 286 (67.6) 130 (66) 0.69
Co-morbidities
Hypertension 443 (71.5) 285 (67.4) 158 (80.2) <0.001
Diabetes 146 (23.5) 92 (21.7) 54 (27.4) 0.12
Atrial fibrillation 229 (36.9) 150 (35.5) 79 (40.1) 0.27
Myocardial disease/coronary artery disease 156 (25.2) 109 (25.8) 47 (39.9) 0.61
Hyperlipidemia 266 (42.9) 185 (43.7) 81 (41.1) 0.54
Carotid artery disease 47 (7.6) 36 (8.5) 11 (5.6) 0.20
Peripheral artery disease 15 (2.4) 10 (2.4) 5 (2.5) 0.90
Smoking 356 (57.4) 228 (53.9) 128 (65.0) <0.001
Previous history
No pre-existing conditions 47 (7.6) 39 (9.2) 8 (4.1) 0.02
No neurological disorders 433 (69.8) 294 (69.5) 139 (70.6) 0.79
Ischemic stroke 78 (12.6) 51 (12.1) 27 (13.7) 0.56
Hemorrhagic stroke 2 (0.3) 1 (0.2) 1 (0.5) 0.58
Transient ischemic attack 35 (5.6) 23 (5.4) 12 (6.1) 0.74
Brain aneurysm 3 (0.5) 2 (0.5) 1 (0.5) 0.95
Imaging
ASPECTS (median, IQR) 293 9 [9–9] 9 [8–10] 9 [8–10] 0.59
Initial infarct of >1/3 MCA 68 (11) 43 (10.2) 25 (12.7) 0.30
Site of occlusion
Cervical carotid 13 (2.1) 7 (1.7) 6 (3.1) 0.06
Carotid T 33 (5.3) 15 (3.5) 18 (9.1) 0.004
ICA 117 (18.9) 74 (17.5) 43 (26.9) 0.20
M1 422 (68.1) 284 (67.1) 138 (70.1) 0.43
M2 151 (24.4) 108 (25.5) 43 (21.8) 0.36
M3 11 (1.8) 8 (1.9) 3 (1.5) 0.32
ACA 7 (1.1) 3 (0.7) 4 (2.0) 0.31
Time metrics
Time of onset to revascularization (minutes) (mean, SD) 306.1 ± 209.8 286.1 ± 189.5 349.0 ± 242.8 0.001
Time of onset to groin puncture (minutes) (mean, SD) 222.0 ± 105.7 214.9 ± 191.0 237.3 ± 106.1 0.01
TABLE 2. Procedural characteristics and angiographic and clinical outcomes in the overall population and mRS 0–2 versus mRS 3–6 populations in the STRATIS Registry
Features Missing data All N = 620 (N, %) mRS 0–2 N = 423 (N, %) mRS 3–6 N = 197 (N, %) p
Procedural
General anesthesia 151 (24.4) 91 (21.5) 60 (30.5) 0.005
Sedation 395 (63.7) 278 (65.7) 117 (59.4) 0.063
IA-tPA 90 (14.5) 60 (14.2) 30 (15.2) 0.12
Glycoprotein IIb/IIIa 14 (2.3) 10 (2.4) 4 (2.0) 0.80
Rescue therapy 44 (4.1) 26 (6.1) 18 (9.1) 0.35
Lowest BP reading during procedure-systolic 36 116.9 ± 21.6 117.5 ± 22.4 115.5 ± 21.9 0.32
Lowest BP reading during procedure-diastolic 35 65.2 ± 16.2 65.8 ± 16.3 63.96 ± 17.56 0.22
Stenting and/or angioplasty 68 (11) 44 (10.4) 24 (12.2) 0.51
Mechanical device 58 (9.4) 32 (7.6) 26 (13.2) 0.03
Balloon guide catheter 101 (16.3) 73 (17.3) 28 (14.2) 0.34
Balloon guide and distal aspiration 37 (6.0) 26 (6.1) 11 (5.6) 0.78
Device passes 1 [1–2] 1 [1–2] 2 [1–3] <0.001
Angiographic
Successful recanalization (mTICI≥2b) 507 (81.8) 366 (86.5) 141 (71.6) <0.001
Final mTICI <0.001
0 41 (6.6) 19 (4.5) 22 (11.2)
1 11 (1.8) 5 (1.2) 6 (3.0)
2a 61 (9.8) 33 (7.8) 28 (14.2)
2b 178 (28.7) 117 (27.9) 61 (31.0)
2c 22 (3.5) 18 (4.3) 4 (2.0)
3 307 (49.5) 231 (54.6) 76 (38.6)
Embolization into new territory 21 (3.4) 10 (2.4) 11 (5.6) 0.11
Distal emboli 76 (12.3) 53 (12.5) 23 (11.7) 0.50
Clinical outcomes
Lowest BP within 24 h post procedure-systolic 26 104.7 ± 16.9 14.5 ± 16.9 105.3 ± 18.0 0.62
Lowest BP within 24 h post procedure-diastolic 26 58.4 ± 14.7 58.4 ± 14.9 58.4 ± 15.2 0.98
Highest BP within 24 h post procedure- systolic 26 154.7 ± 23.4 153.3 ± 23.2 157.9 ± 25.4 0.031
Highest BP within 24 h post procedure- diastolic 26 84.22 ± 21.9 83.53 ± 21.7 86.20 ± 23.8 0.20
24-hour NIHSS score 6 [2–13] 4 [2–8] 13 [8–18] <0.001
Neurological deterioration 27 (4.4) 5 (1.2) 22 (11.2) <0.001
sICH 2 (0.3) 0 2 (1.0) 0.12
New ischemic stroke 11 (1.8) 7 (1.7) 4 (2.0) 0.74
Expansion of index infarct 35 (5.6) 21 (5.0) 14 (7.1) 0.28
SAH2 21 (3.4) 10 (2.4) 11 (5.6) 0.04
PHI 18 (2.9) 6 (1.4) 12 (6.1) 0.001
PH2 5 (0.8) 1 (0.2) 4 (2.0) 0.020
RIH 9 (1.5) 4 (0.9) 5 (2.5) 0.12
IVH 6 (1.0) 2 (0.5) 4 (2.0) 0.07
Malignant cerebral edema 3 (0.5) 2 (0.5) 1 (0.5) 0.95
Mass effect 19 (3.1) 4 (0.9) 15 (7.6) <0.001
Normal follow-up imaging 186 (30.0) 156 (36.9) 30 (15.2) <0.001
Infarct in new vascular territory 31 (5.0) 22 (5.2) 9 (4.6) 0.74
Radiographic mass effect 47 (7.6) 14 (3.3) 33 (16.8) <0.001

The primary outcome was defined as mRS ≤ 2 at 90-days. Twenty-four NIHSS score was examined as either a continuous or dichotomous variable in all models.

Variable Selection for ML Models

Seven ML models were used to measure the effect of the 24-hour NIHSS on predicting the 90-day functional outcome in this study. As random forest (RF), logistic regression (LR), adaptive boost, classification and regression trees (CART), C5.0 decision tree (C5.0), least absolute shrinkage and selection operator (LASSO), and support vector machine (SVM) are commonly used algorithms in clinical data analysis, these 7 algorithms were selected for this study.

To identify how many of the variables with the highest score may improve the performance of the machine learning models, sequential forward selection (SFS) was applied on each model and the variables were ordered from the highest score to the lowest. SFS iterated 69 times and in each iteration added a variable with the highest score to its output list. The SFS was fit on the training subset and the test subset was transformed during the evaluation process. The best number of input variables for RF, LR, adaptive boost, CART, C5.0, LASSO, and SVM were 36, 10, 38, 11, 29, 50, and 20, respectively (Supplemental Table 1). The 24-hour NIHSS score was selected as one of the top input variables in all the models and was examined as both a continuous and dichotomous variable along with the other top input variables to predict the primary outcome, functional outcome (mRS 0–2) at 90-day.

Model Deployment

Five-fold cross-validation method was implemented to split the dataset into train and test subsets.19 In this method, the performance of each model is evaluated in 5 iterations. In each iteration, the dataset is split into train and test subsets. The test subset is the unseen portion of the dataset to confirm that the ML algorithms were trained effectively. To decrease the probability of variance, stratified sampling20 was used along with the cross-validation method.21 The Grid Search method22 was used to tune the hyper-parameters. To evaluate the effect of 24-hour NIHSS on 90-day outcome (mRS ≤ 2), the 24-hour NIHSS score was fed into the models (along with the selected top variables from SFS), as either a continuous or dichotomous variable. For the dichotomous 24-hour NIHSS score, each ML model was trained 35 times with the pre-processed training subset. Following the training, the models were evaluated on the test subsets (Supplemental Figure 1).

Predictive Performance

The predictive performance of all 7 ML models and traditional LR models was assessed using sensitivity, specificity, accuracy, and the area under characteristic curve (AUC). The performance of the models was compared by the method proposed by Delong et al. ,23 which used the identity between AUC and the Wilcoxon-Mann–Whitney test statistic to make an estimate of the correlation between the AUC curves. All p-values were calculated in Python 3.10.0. A threshold of ≤0.05 was considered significant.

Model Calibration and Sample Size

Calibration curves were used to assess model calibration for regression-based and ML algorithms with clinical knowledge-based on SFS feature selection methods (Supplemental Figure 2A and B; Supplemental Table 2). In perfectly calibrated models, all points align on a 45° diagonal line in relation to the x and y axes. Poor model calibration is denoted by curve deviation from this 45° line. Most models derived from SFS selected variables were better calibrated than those with clinical knowledge-based variables. The confidence of models in predicting the 90-day mRS score was measured by the Hosmer–Lemeshow test (Supplemental Table 3). The minimum sample size required for 90-day mRS score prediction was estimated using NX Cross Validation method24 (Supplemental Figure 3).

Statistical analyses were performed using SPSS v27. The multivariate analysis (traditional logistic regression) was done with multinomial logistic regression and the univariate analysis was done with binary logistic regression. A threshold of <0.05 was considered significant.

Results

Study Population Characteristics

Baseline

A total of 620 patients were included in the final dataset and analysis, of which 423 (68.2%) had a 90-day mRS ≤ 2 and 197 (31.8%) mRS > 2 (Table 1). Mean age (72.0 ± 15.0 vs 75.0 ± 14.2, p = 0.018) and median baseline NIHSS (16 vs 18, p = 0.005) was significantly lower in the mRS ≤ 2 cohort compared to patient with mRS > 2. No difference in baseline ASPECTS was observed between the groups. In the mRS ≤ 2 cohort, patients had lower rates hypertension (67.4% vs 80.2%, p = <0.001) and smoking history (46.1% vs 35.0%, p = <0.001). The majority of patients had M1 occlusion (67.1% vs 70.1%, p = 0.43); however, patients with mRS > 2 had a higher rate of carotid terminus occlusions (9.1% vs 3.5%, p = 0.004).

Procedural

General anesthesia was used more frequently in patients with mRS > 2 (30.5% vs 21.5%, p = 0.005) (Table 2). Those in the mRS ≤ 2 group had shorter times from symptoms onset to groin puncture (214.9 ± 191.0 vs 237.3 ± 106.1, p = 0.014) and revascularization (286.1 ± 189.5 vs 349.0 ± 242.8, p = 0.001). Successful revascularization (mTICI ≥ 2b) was achieved more frequently in patients with mRS ≤ 2 (86.5% vs 71.6%, p = <0.001). Median total number of passes was also significantly lower in the mRS ≤ 2 cohort (1 (IQR 1–2) vs 2 (IQR 1–3), p = <0.001).

Outcomes

At 24 h, the median NIHSS score was higher in the mRS >2 group (13 IQR 8–18 vs 4 IQR 2–8, p = <0.001) (Table 2). No difference was seen in rates of sICH between the groups (0% vs 1.0%, p = 0.12). Patients with mRS > 2 had a higher rate of mass effect compared to those with good functional outcome (7.6% vs 0.9%, p < 0.001).

Comparison of Machine Learning Algorithms

Variable Selection

Supplemental Figure 1 describes the overall process for data pre-processing and model training and testing. For each of the 7 ML models (RF, C5.0, LASSO, SVM, CART, adaptive boost, and LR), variable importance for prediction of 90-day functional outcome was calculated and selected by SFS (Supplemental Table). Only the top variables selected by an ML model were fed into the model in training and testing process. All the models, except C5.0, ranked 24-hour NIHSS score as the top predictor of 90-day functional outcome.

ML Model Training and Testing

Continuous 24-Hour NIHSS Score

Seven ML models were trained and tested on the continuous 24-hour NIHSS score (Supplemental figure 1). All models had moderate to good predictive value (mean range AUC 0.76–0.92), moderate to good sensitivity (mean range: 0.85–0.93), and poor to moderate specificity (mean range: 0.3–0.65) (Supplemental Figure 4a, Table 3). Of the ML models trained and tested on the continuous 24-hour NIHSS score, RF (AUC 0.92 ± 0.03) outperformed all models except LASSO (AUC 0.89 ± 0.02, p = 0.0958) (Table 4). Importantly, RF had significantly better predictive accuracy than LR (AUC 0.87 ± 0.03, p = 0.048) and traditional LR (T-RL) (AUC 0.83 ± 0.06, p = 0.035).

TABLE 3. Predictive accuracy of ML models and logistic regression using continuous versus dichotomous 24 hour NIHSS score
24-Hour NIHSS score
Continuous Dichotomous Continuous Dichotomous Continuous Dichotomous Continuous Dichotomous Continuous Dichotomous
Models Sensitivity Sensitivity Specificity Specificity Accuracy Accuracy AUC AUC 95% CI 95% CI p-value
RF 0.93 ± 0.03 0.91 ± 0.05 0.65 ± 0.12 0.60 ± 0.11 0.85 ± 0.04 0.79 ± 0.03 0.92 ± 0.03 0.90 ± 0.026 [0.895–0.945] [0.877–0.923] 0.31
LASSO 0.92 ± 0.07 0.90 ± 0.07 0.58 ± 0.10 0.66 ± 0.15 0.81 ± 0.04 0.82 ± 0.04 0.89 ± 0.02 0.91 ± 0.040 [0.870–0.910] [0.875–0.945] 0.19
SVM 0.90 ± 0.05 0.90 ± 0.09 0.63 ± 0.01 0.53 ± 0.13 0.81 ± 0.01 0.79 ± 0.03 0.87 ± 0.02 0.86 ± 0.043 [0.849–0.891] [0.822–0.898] 0.42
C5.0 0.87 ± 0.05 0.87 ± 0.04 0.67 ± 0.13 0.62 ± 0.13 0.80 ± 0.04 0.79 ± 0.02 0.86 ± 0.03 0.85 ± 0.034 [0.835–0.885] [0.820–0.880] 0.50
Adaptive 0.85 ± 0.05 0.82 ± 0.10 0.63 ± 0.13 0.57 ± 0.17 0.78 ± 0.04 0.75 ± 0.02 0.84 ± 0.05 0.82 ± 0.027 [0.793–0.887] [0.796–0.844] 0.76
CART 0.97 ± 0.04 0.81 ± 0.08 0.30 ± 0.24 0.68 ± 0.10 0.74 ± 0.07 0.80 ± 0.04 0.76 ± 0.04 0.84 ± 0.046 [0.721–0.799] [0.800–0.880] 0.02
LR 0.89 ± 0.08 0.89 ± 0.08 0.60 ± 0.12 0.60 ± 0.14 0.80 ± 0.05 0.80 ± 0.04 0.87 ± 0.03 0.87 ± 0.038 [0.843–0.897] [0.837–0.903] 0.89
T-LR 0.90 ± 0.03 0.89 ± 0.06 0.52 ± 0.10 0.58 ± 0.09 0.78 ± 0.03 0.79 ± 0.02 0.85 ± 0.06 0.87 ± 0.04 [0.802–0.898] [0.837–0.903] 0.35
  • Adaptive = adaptive boost; C5.0 = decision tree; CART = classification and regression tree; LASSO = least absolute shrinkage and selection operator; LR = logistic regression; RF = random forest; SVM = support vector machine; T-LR = traditional logistical regression.
  • * p-value for comparison of AUC.
TABLE 4. Comparison of ML models and traditional logistic regression (continuous and dichotomous (≤6) 24-Hour NIHSS Score)
Continuous 24-hour NIHSS models
Model AUC RF LR C5.0 SVM Adaptive CART LASSO T-LR
0.92 ± 0.028 0.87 ± 0.031 0.86 ± 0.029 0.87 ± 0.024 0.84 ± 0.054 0.76 ± 0.044 0.89 ± 0.023 85 ± 0.06
p p p p p p p p

RF 0.92 ± 0.028

1 0.048 0.019 0.035 0.023 0.002 0.096 0.035

LR 0.87 ± 0.031

0.048 1 0.25 0.9 0.33 0.01 0.31 0.62

C5.0 0.86 ± 0.029

0.019 0.25 1 0.39 0.38 0.033 0.067 0.98

SVM 0.87 ± 0.024

0.035 0.9 0.39 1 0.37 0.021 0.36 0.59

Adaptive 0.84 ± 0.054

0.023 0.33 0.38 0.37 1 0.062 0.092 0.92

CART 0.76 ± 0.044

0.002 0.01 0.033 0.021 0.062 1 0.0081 0.059

LASSO 0.89 ± 0.023

0.096 0.31 0.067 0.36 0.092 0.0081 1 0.097

T-LR 85 ± 0.06

0.035 0.62 0.98 0.59 0.92 0.059 0.097 1
Dichotomous (≤6) 24-hour NIHSS models
Model AUC RF LR C5.0 SVM Adaptive CART LASSO T-LR
90 ± 0.026 0.87 ± 0.038 0.85 ± 0.034 0.86 ± 0.043 0.82 ± 0.027 0.84 ± 0.046 0.91 ± 0.04 0.87 ± 0.04
p p p p p p p p

RF 90 ± 0.026

1 0.31 0.051 0.056 0.012 0.058 0.24 0.39

LR 0.87 ± 0.038

0.31 1 0.03 0.046 0.01 0.18 0.52 0.61

C5.0 0.85 ± 0.034

0.051 0.03 1 0.7 0.15 0.71 0.03 0.049

SVM 0.86 ± 0.043

0.056 0.046 0.7 1 0.11 0.77 0.03 0.04

Adaptive 0.82 ± 0.027

0.012 0.01 0.15 0.11 1 0.66 0.008 0.029

CART 0.84 ± 0.046

0.058 0.18 0.71 0.77 0.66 1 0.03 0.22

LASSO 0.91 ± 0.04

0.24 0.52 0.03 0.03 0.008 0.03 1 0.21

T-LR 0.87 ± 0.04

0.39 0.61 0.049 0.04 0.029 0.22 0.21 1
  • Adaptive = adaptive boost; C5.0 = decision tree; CART = classification and regression tree; LASSO = least absolute shrinkage and selection operator; LR = logistic regression; RF = random forest; SVM = support vector machine; T-LR = traditional logistical regression.

Dichotomous 24-Hour NIHSS Score

To test if a dichotomous 24-hour NIHSS score would have better predictive accuracy than continuous 24-hour NIHSS score, we examined different cutoffs. A cutoff of ≤6 was chosen for the dichotomous 24-hour NIHSS, as this cutoff showed the best predictive accuracy for the models.

The models were trained and tested on the 24-hour NIHSS Score ≤ 6. All models had moderate to good predictive value (mean range AUC 0.82–0.91). All models had good predictive value (mean range AUC 0.82–0.91), moderate to good sensitivity (mean range: 0.81–0.91), and poor to moderate specificity (mean range: 0.53–0.68) (Supplemental Figure 4b, Table 3). LASSO (AUC 0.91 ± 0.04) outperformed C5.0 (AUC 0.85 ± 0.034, p = 0.03), SVM (AUC 0.86 ± 0.043, p = 0.03), Adaptive (AUC 0.82 ± 0.027, p = 0.008), and CART (AUC 0.84 ± 0.046, p = 0.058) but had a similar predictive value as RF (AUC 90 ± 0.026, p = 0.24), LR (AUC 0.87 ± 0.038, p = 0.52), and T-LR (AUC 0.87 ± 0.04, p = 0.21) (Supplemental Figure 4b, Table 4).

Overall, when comparing similar ML models of continuous and dichotomous 24-hour NIHSS score, no difference in AUC was observed, with the exception of dichotomous CART (AUC 0.84 ± 0.046), which outperformed the continuous CART model (AUC 0.76 ± 0.044, p = 0.02) (Table 3).

ML versus Traditional Logistic Regression

No difference was observed when comparing our traditional LR models (i.e. known clinically relevant variables, Table 5) with ML LR models (automated variable selection) (Tables 3 and 4).

TABLE 5. Multivariate models for prediction of 90-day functional outcome
Continuous 24-NIHSS score Dichotomized (≤6) 24-NIHSS score
Features 95% CI OR Log odds p 95% CI OR Log odds p
Age 0.961 1.003 0.981 −0.019 0.076 0.975 1.006 0.990 −0.010 0.232
Baseline NIHSS 0.961 1.067 1.013 0.013 0.636 0.941 1.023 0.981 −0.019 0.378
Hypertension (HTN) 0.198 0.776 0.392 −0.936 0.007 0.319 0.969 0.556 −0.587 0.038
Time of onset to revascularization 0.999 1.001 1.000 0.000 0.896 0.999 1.001 1.000 0.000 0.376
Time of onset to groin puncture 0.997 1.002 0.999 −0.001 0.560 0.997 1.002 1.000 0.000 0.771
Successful recanalization 0.563 6.727 1.946 0.666 0.293 0.945 7.037 2.579 0.947 0.064
24-Hour NIHSS score 0.776 0.857 0.815 −0.205 <0.000 3.654 9.346 5.844 1.765 <0.000
Carotid terminus occlusion 0.275 .997 0.523 −0.648 0.049 0.284 0.873 0.498 −0.697 0.015
No pre-existing conditions 0.315 3.765 1.089 0.085 0.893 0.423 3.564 1.228 0.205 0.706
Tobacco use 0.578 1.218 0.839 −0.176 0.355 0.536 1.039 0.746 −0.293 0.083
Presence of neurological deterioration 0.456 7.730 1.878 0.630 0.383 0.118 1.110 0.361 −1.019 0.075
Highest systolic BP within 24 h post procedure 1.001 1.025 1.013 0.013 0.041 0.993 1.012 1.002 0.002 0.635
SAH2 0.277 4.668 1.138 0.129 0.858 0.272 2.491 0.823 −0.195 0.730
PHI 0.039 .582 0.151 −1.890 0.006 0.074 0.806 0.244 −1.411 0.021
PH2 0.042 6.543 0.526 −0.642 0.617 0.028 4.337 0.348 −1.056 0.412
Normal follow-up imaging 1.216 4.077 2.226 0.800 0.010 1.269 3.584 2.133 0.758 0.004
Radiographic mass effect 0.139 .829 0.340 −1.079 0.018 0.162 0.743 0.347 −1.058 0.006
Total pass number 0.661 1.131 0.865 −1.145 0.289 0.673 1.024 0.830 −0.186 0.082
Mechanical device 0.187 1.058 0.445 −0.810 0.067 0.301 1.222 0.607 −0.499 0.162
General anesthesia 0.452 1.138 0.717 −0.333 0.158 0.588 0.953 0.748 −0.290 0.019
Transfer 0.598 1.814 1.041 0.040 0.887 0.662 1.715 1.065 0.063 0.795
  • a Models for the prediction of 90-day functional outcome included either dichotomized or continuous 24-hour NIHSS score, as noted.

Discussion

In this study, which used ML models to predict 90-day outcomes in the STRATIS Registry, the 24-hour NIHSS score was found to be a top predictor of functional outcome in all ML models. Although ML models using the continuous 24-hour NIHSS scored showed moderate to good predictive performance (range mean AUC 0.76–0.92), RF outperformed all ML models except LASSO. Importantly, RF demonstrated a significantly higher predictive value than LR and T-LR. When applying a dichotomous 24-hour NIHSS score, only small differences were observed in predictive accuracy (with the exception of CART) when compared to continuous 24-hour NIHSS score ML models. There was no difference in the predictive performance of traditional logistic regression and ML logistic regression models.

Previous studies investigating ML algorithms for predicting 90-day functional outcomes after EVT in AIS patients demonstrated similar predictive accuracy between ML and LR models.6, 11 Van Os. et al, using data from the Multicenter Randomized Clinical Trial of Endovascular Treatment in the Netherlands (MR CLEAN) Registry, applied several machine algorithms (RF, SVM, Neural Network, and Super Learner) to predict 90-day functional outcomes after EVT, and found negligible differences in predictive performance (range mean AUC 0.88–0.91) compared to logistic regression.6 Similar to our results, their RF model had the highest predictive value (mean AUC 0.91) of all ML models tested; however, variable selection varied between the present study. As variable selection is based on variables contained in a given dataset, these differences may be attributed to variable availability between the STRATIS Registry15 and the MR CLEAN Registry.25 Variables identified as important in their RF model (Glascow Coma Scale, creatinine, C-reactive protein, thrombocyte count) were not captured in the STRATIS Registry. Another study, which explored the use of ML algorithms (RF, CART, C5.0), to predict 90-day functional impairment risk after EVT (mRS > 2) using data from PROVE-IT and INTERRSeCT, found no difference in predictive accuracy between logistic regression and ML models (AUC range 0.65–0.72).11

In the present study, we examined both continuous and dichotomous 24-hour NIHSS score in our ML models for prediction of functional outcome. Several studies have demonstrated the association of early neurological status (measured by 24-hour NIHSS) with 90-day functional outcomes.9, 12-14, 26, 27 Mistry et al investigated the 24-hour NIHSS Score as a predictor of 90-day outcome and observed that 24-hour NIHSS score, when adjusted for baseline NIHSS, was shown as the strongest predictor for both dichotomous and ordinal 90-day functional outcomes in the study. For prediction of mRS 0–2, the optimal threshold for 24-hour NIHSS was ≤7 (sensitivity 80.1%, specificity 80.4%, p < 0.0001).12 Another study which sought to externally validate 24-hour NIHSS as a predictor for long-term outcome (mRS 0–2), identified a cutoff of NIHSS ≤ 8 at 24-h after MT as an independent predictor for favorable outcome.28 The results of our study further highlight the importance of the 24-hour NIHSS as a predictor of functional outcome, and suggests that the continuous 24-hour NIHSS score may serve to improve stroke outcome prediction in ML models. These results highlight the potential role of ML models in the prediction of functional outcome based on a patient's 24-hour NIHSS score and the utility of the 24-hour NIHSS to serve as a potential surrogate marker for long-term functional outcome in AIS patients treated with EVT.

Limitations

Our study as several important limitations. As the STRATIS Registry was an observational registry, our ML models were limited to the variables that were collected within the registry.15 As such, our population was restricted to patients that were treated with MT within 8 hours of symptoms onset, pre-morbid mRS 0–1, and anterior circulation occlusion, which limits the generalizability of our models. For our analysis, we excluded patients with missing 24-hour NIHSS and 90-day mRS, which may lead to selection bias in this study. Furthermore, missing data for certain variables, such as ASPECTS, was imputed by median score for missing observations. Although we used a 5-fold cross-validation method for our models, this study lacks an external validation cohort. We used 7 different ML algorithms for prediction of 90-day outcome. In our study, RF outperformed LR when trained on the selected variables in our dataset. RF, which is ensemble learning method, has been shown to accommodate most datasets, even those with missing data.29 However, we did not explore other algorithms, such as deep neural networks, which may also have utility for prediction of 90-day outcome.

Conclusion

In this substudy of the STRATIS Registry, when using 24-hour NIHSS score as a continuous variable, we found that RF had higher predictive accuracy than LR models and T-RL for the prediction of 90-day functional outcome. External validation of these ML models is warranted in larger datasets.

Acknowledgements

We acknowledge Oscar Bolanos (Medtronic) for his editorial support. This study was sponsored by Medtronic.

    Author Contributions

    Conception and design of the study (A.C.C., Z.Z., M.A.J.).

    Acquisition and analysis of data (A.C.C., Z.Z., M.A.J.).

    Drafting a significant portion of the manuscript or figures (A.C.C., Z.Z., O.O.Z., R.B., S.F.Z., D.L., N.M.-K., M.A.J.).

    Potential Conflicts of Interest

    Dr. Jumaa reports research grant support from Medtronic. Dr. Zaidat is a consultant and speaker for Medtronic, and reports research grant support from Medtronic Neurovascular (modest); honoraria from Medtronic Neurovascular (modest); and is a consultant/advisory board member at Medtronic Neurovascular. Dr. Mueller-Kronast is a modest consultant for Medtronic Neurovascular. Dr. Liebeskind serves as a consultant as Imaging and Angiography Core Lab. The other authors report no conflicts.

    Data Availability

    All supporting data from this study are available within the article and corresponding online-only data.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.

      click me