Volume 33, Issue 4 e14127
RESEARCH ARTICLE
Open Access

Multi-centre arousal scoring agreement in the Sleep Revolution

Henna Pitkänen

Corresponding Author

Henna Pitkänen

Department of Technical Physics, University of Eastern Finland, Kuopio, Finland

Diagnostic Imaging Center, Kuopio University Hospital, Kuopio, Finland

Correspondence

Henna Pitkänen, Department of Technical Physics, University of Eastern Finland, Kuopio, Finland.

Email: [email protected]

Contribution: Conceptualization, ​Investigation, Methodology, Funding acquisition, Writing - original draft, Writing - review & editing, Visualization, Formal analysis

Search for more papers by this author
Sami Nikkonen

Sami Nikkonen

Department of Technical Physics, University of Eastern Finland, Kuopio, Finland

Diagnostic Imaging Center, Kuopio University Hospital, Kuopio, Finland

Contribution: Data curation, Supervision, Writing - review & editing, Methodology

Search for more papers by this author
Marika Rissanen

Marika Rissanen

Department of Technical Physics, University of Eastern Finland, Kuopio, Finland

Department of Clinical Neurophysiology, Seinäjoki Central Hospital, Seinäjoki, Finland

Contribution: Writing - review & editing

Search for more papers by this author
Anna Sigridur Islind

Anna Sigridur Islind

Department of Computer Science, Reykjavik University, Reykjavik, Iceland

Reykjavik University Sleep Institute, School of Technology, Reykjavik University, Reykjavik, Iceland

Contribution: Writing - review & editing

Search for more papers by this author
Heidur Gretarsdottir

Heidur Gretarsdottir

Reykjavik University Sleep Institute, School of Technology, Reykjavik University, Reykjavik, Iceland

Contribution: Writing - review & editing, Data curation

Search for more papers by this author
Erna Sif Arnardottir

Erna Sif Arnardottir

Reykjavik University Sleep Institute, School of Technology, Reykjavik University, Reykjavik, Iceland

Contribution: Writing - review & editing, Project administration, Funding acquisition

Search for more papers by this author
Timo Leppänen

Timo Leppänen

Department of Technical Physics, University of Eastern Finland, Kuopio, Finland

Diagnostic Imaging Center, Kuopio University Hospital, Kuopio, Finland

School of Electrical Engineering and Computer Science, The University of Queensland, Brisbane, Queensland, Australia

Contribution: Funding acquisition, Writing - review & editing, Project administration, Supervision

Search for more papers by this author
Henri Korkalainen

Henri Korkalainen

Department of Technical Physics, University of Eastern Finland, Kuopio, Finland

Diagnostic Imaging Center, Kuopio University Hospital, Kuopio, Finland

Contribution: Supervision, Funding acquisition, Writing - review & editing, Methodology

Search for more papers by this author
First published: 26 December 2023
Citations: 8

Summary

We investigated arousal scoring agreement within full-night polysomnography in a multi-centre setting. Ten expert scorers from seven centres annotated 50 polysomnograms using the American Academy of Sleep Medicine guidelines. The agreement between arousal indexes (ArIs) was investigated using intraclass correlation coefficients (ICCs). Moreover, kappa statistics were used to evaluate the second-by-second agreement in whole recordings and in different sleep stages. Finally, arousal clusters, that is, periods with overlapping arousals by multiple scorers, were extracted. The overall similarity of the ArIs was fair (ICC = 0.41), varying from poor to excellent between the scorer pairs (ICC = 0.04–0.88). The ArI similarity was better in respiratory (ICC = 0.65) compared with spontaneous (ICC = 0.23) arousals. The overall second-by-second agreement was fair (Fleiss’ kappa = 0.40), varying from poor to substantial depending on the scorer pair (Cohen's kappa = 0.07–0.68). Fleiss’ kappa increased from light to deep sleep (0.45, 0.45, and 0.53 for stages N1, N2, and N3, respectively), was moderate in the rapid eye movement stage (0.48), and the lowest in the wake stage (0.25). Over a half of the arousal clusters were scored by one or two scorers, and less than a third by at least five scorers. In conclusion, the scoring agreement varied depending on the arousal type, sleep stage, and scorer pair, but was overall relatively low. The most uncertain areas were related to spontaneous arousals and arousals scored in the wake stage. These results indicate that manual arousal scoring is generally not reliable, and that changes are needed in the assessment of sleep fragmentation for clinical and research purposes.

1 INTRODUCTION

Recording physiological activity during sleep is of paramount importance in the diagnosis of various sleep disorders (Troester et al., 2023). In-laboratory polysomnography (PSG) is the gold standard of sleep recordings, including electroencephalogram (EEG), chin electromyogram (EMG), and electrooculogram (EOG), which are crucial in the assessment of the sleep structure. In the clinical practice of sleep medicine, PSGs are mostly scored manually via visual inspection. This inspection often includes sleep stages (wake stage W, non-rapid eye movement stages N1, N2, and N3, and rapid eye movement stage R), arousals from sleep, respiratory, limb movement (LM), periodic limb movement (PLM), and bruxism events, among others (Troester et al., 2023). Arousals are often identified from the PSG to calculate the arousal index (ArI) and to score hypopneas not associated with a ≥3% blood oxygen desaturation (Troester et al., 2023).

According to the arousal scoring rules in the American Academy of Sleep Medicine (AASM) Manual for the Scoring of Sleep and Associated Events, Version 3 (2023) (Troester et al., 2023), an arousal should be scored in stages N1, N2, N3, or R if there is “an abrupt shift of EEG frequency including alpha, theta, and/or frequencies greater than 16 Hz (but not spindles) that lasts at least 3 s, with at least 10 s of stable sleep preceding the change”. Moreover, arousals occurring in stage R must be accompanied by “a concurrent increase in submental EMG lasting at least 1 s” (Troester et al., 2023).

Interrater variations in arousal scoring are known to exist. However, interrater agreements studied in previous literature vary and are conflicting at times. Interrater agreement in arousal scoring has been studied for example by investigating the intraclass correlation coefficients (ICCs) of the number of arousals or ArIs between scorers (Loredo et al., 1999; Magalang et al., 2013; Smurra et al., 2001; Whitney et al., 1998). Loredo et al. (1999) studied the ArI ICCs between two scorers using five different arousal scoring criteria in 40 PSGs. Between the scoring criteria, the agreement varied from slight to near perfect (ICC = 0.19–0.92), being the lowest when including <3 s arousals into the criteria, and the highest when including arousals accompanied by increases in chin EMG activity, associated with leg movements or, especially, respiratory events (Loredo et al., 1999). When the American Sleep Disorders Association (now the AASM) 1992 arousal definition (Bonnet et al., 1992) was used as the scoring criteria, the ICC between two scorers was 0.84 (Loredo et al., 1999). In addition, Smurra et al. (2001) found near perfect agreement (ArI ICC = 0.96) between two scorers in 20 PSGs using the AASM 1992 arousal criteria (Bonnet et al., 1992). However, as a part of the Sleep Heart Health Study (SHHS) (Quan et al., 1997), Whitney et al. (1998) studied the ArIs of three scorers in 30 PSGs using the same AASM 1992 arousal criteria (Bonnet et al., 1992) and found only a moderate to good agreement (ICC = 0.54–0.72), depending on the level of training and experience of the scorers. In addition, as the SHHS progressed over the years, the ArI ICCs increased to 0.72–0.78 (Ding et al., 2004). Moreover, Magalang et al. (2013) found a moderate agreement (ArI ICC = 0.68) between nine scorers in 15 PSGs, using the AASM 2007 (Iber et al., 2007) guidelines.

Besides studying the ICC of the ArIs between the scorers, arousal scoring agreement has been investigated in event-by-event bases (Drinnan et al., 1998; Thomas, 2003). In a study by Drinnan et al. (1998), 14 scorers classified 90 pre-selected EEG segments into the 1992 AASM-defined arousals (Bonnet et al., 1992) or stable sleep. The agreement between the scorers was moderate, with a Fleiss’ kappa of 0.47. When the segments were grouped based on the sleep stage they occurred in, the Fleiss’ kappa values were 0.28 and 0.60 for Rechtschaffen and Kales (R&K) (Rechtschaffen & Kales, 1968) -defined sleep stages 1+2 and 3+4, respectively, and 0.52 for rapid-eye movement stage. In a study by Thomas et al. (2003), two scorers classified respiratory event terminations from 17 PSGs into six different categories, one of which was the 1992 AASM-defined arousal (Bonnet et al., 1992). The agreement between the scorers, involving all categories, was 90.5% when both EEG and respiratory signals were used in the scoring process, but dropped to 58.7% when using the EEG signal only.

In addition to only investigating arousal numbers and the agreement in selected EEG segments, Ruehland et al. (2011, 2015) studied the temporal arousal scoring agreement during whole night recordings by using a kappa value modified for continuous measurements (Conger, 1985). Other than the temporal agreement, they investigated the proportion of specific agreement (PSA) (Fleiss et al., 2003) based on whether the scored arousals overlapped. When they used AASM 1992 rules (Bonnet et al., 1992) between four scorers in 15 PSGs, the kappa value was 0.53 and the PSA was 66% (Ruehland et al., 2015). When using the AASM 2007 rules (Iber et al., 2007) between three scorers in 10 PSGs, the kappa value was 0.42 and the PSA was 58% (Ruehland et al., 2011).

Although the interrater agreement in arousal scoring has been studied, the focus has mostly been on the number of arousals and selected EEG segments, rather than on the temporal agreement between the scorers throughout the whole recording. A detailed analysis scrutinising arousal types and durations has not been conducted. Moreover, most of the previous studies have either had a small number of scorers or recordings. In addition, most studies including more than two scorers, (Drinnan et al., 1998; Magalang et al., 2013; Ruehland et al., 2011; Ruehland et al., 2015; Smurra et al., 2001), have only reported one parameter for all scorers, without investigating the variation in agreement between individual scorer pairs. Only the SHHS papers (Ding et al., 2004; Whitney et al., 1998) reported additional ICC values just between the two scorers with the most experience out of the three. Furthermore, there are some limitations when using ICC (Shrout & Fleiss, 1979) as a descriptive measure, as guidelines for choosing and interpreting different ICC models are not widely known (Koo & Li, 2016; McGraw & Wong, 1996). Out of the studies investigating ICCs (Loredo et al., 1999; Magalang et al., 2013; Smurra et al., 2001; Whitney et al., 1998), only one (Loredo et al., 1999) disclosed which ICC model was used.

In the present study, we aimed to conduct an in-depth investigation into the interrater agreement in arousal scoring based on the AASM Scoring Manual v2.6 (Berry et al., 2020) between 10 scorers from seven sleep centres participating in the Sleep Revolution (Arnardottir et al., 2022). In addition, we aimed to investigate both number-wise and temporal agreements and to highlight grey areas in arousal scoring, that is, the areas with the most uncertainty between scorers. Based on the previous literature, we hypothesised the agreement to be better in N3 and R stages compared with N1 and N2 (Drinnan et al., 1998), and spontaneous arousals to have poorer agreement compared with arousals associated with a cue event, such as respiratory arousals (Loredo et al., 1999). Furthermore, we hypothesised that the agreement would be higher between scorers from the same sleep centres.

2 METHODS

2.1 Dataset

The dataset consisted of 50 type II PSGs recorded with Nox A1 devices (Nox Medical, Reykjavik, Iceland) between February and June 2021. The PSG setup was hooked up at Reykjavik University Sleep Institute and the subjects slept at home. The PSG recordings included AASM recommended derivations (Troester et al., 2023) of EEG, EOG, chin and leg EMG, and EKG, as well as signals of airflow, respiratory effort, oxygen saturation, and body position. Data collection was approved by the National Bioethics Committee of Iceland (21-070, 16/3/2021) and all subjects gave informed written consent. The study population (Table 1) consisted of individuals with an increased risk for restless legs syndrome (RLS), insomnia, or obstructive sleep apnea (OSA), as well as a group of healthy individuals. The risk assessment was conducted by screening the subjects using the STOP-Bang questionnaire (Chung et al., 2012), the Insomnia Severity Index (Morin et al., 2011), and the International Restless Legs Syndrome Study Group Questionnaire (IRLS) (Horiguchi et al., 2003), using the RLS criteria from a previous study (Benediktsdottir et al., 2010) in the latter. The aim was to obtain a similar ratio of subjects from each group.

TABLE 1. Demographic information on the 50 subjects.
Parameter n
Male 29
Female 21
Obstructive sleep apnea risk 29
Insomnia risk 17
Restless legs syndrome risk 11
Healthy subjects 9
Mean (SD)
Age (years) 42.9 (13.7)
Body mass index (kg/m2) 27.3 (5.8)
Epworth sleepiness scale 9.2 (4.8)
Mean (SD)
Apnea–hypopnea index (events/h) 14.9 (14.0)
Arousal index (events/h) 17.8 (6.8)
Oxygen desaturation index (events/h) 14.1 (14.8)
Total sleep time (h) 7.1 (1.0)
N1 (% of total sleep time) 9.0 (4.8)
N2 (% of total sleep time) 47.5 (6.8)
N3 (% of total sleep time) 22.2 (7.5)
R (% of total sleep time) 21.3 (4.6)
  • Note: Healthy subjects did not have obstructive sleep apnea, restless legs syndrome, or insomnia risk.
  • Abbreviation: SD, standard deviation.
  • a Insomnia risk questionnaire unanswered by three subjects.
  • b Mean and SD of the scorers’ subject-wise means within the subject population.

The PSGs were scored by 10 trained sleep specialists from seven different sleep centres (Table 2) using Noxturnal Research software version 6.1.0.30257 (Nox Medical, Reykjavik, Iceland). Scorer pairs 1&2, 4&5, and 6&7 were from the same sleep centres. The scorers were blinded to the questionnaire outcomes and asked to analyse the recordings in accordance with the AASM Scoring Manual v2.6 (Berry et al., 2020), without centre-specific scoring habits. The scorers were given the same detailed instructions on how to use the scoring software, and to score in the following order: sleep stages and arousals, respiratory events and desaturations, LMs, forming PLMs from the LMs where appropriate, and lastly classifying the arousals to respiratory, LM, or PLM related, or spontaneous. The PLM formation and arousal classification were instructed to be conducted with automatic tools in the scoring software. Before using the arousal classification tool, the scorers could manually change the default time limits to link events to arousals (in the case of respiratory arousals, for which the AASM does not define a time limit, the default limit was <5 s). In addition to the automatic tools, PLMs and arousals could be labelled manually. Furthermore, the scorers were initially given an option to run the automatic PSG and Respiratory Analysis in the scoring software, which included arousal scoring, before manually scoring, double-checking, and correcting the events where needed. This option was given to replicate everyday clinical practice as closely as possible.

TABLE 2. Number of scorers from the participating sleep centres.
Sleep Centre Number of scorers
Charité – Universitätsmedizin Berlin, Berlin, Germany 2
Reykjavík University, Reykjavík, Iceland 2
University of Lisbon, Lisbon, Portugal 2
Istituti Clinici Scientifici Maugeri Spa Società Benefit, Pavia, Italy 1
Princess Alexandra Hospital, Brisbane, Australia 1
Turku University Central Hospital, Turku, Finland 1
University of Gothenburg, Gothenburg, Sweden 1

Respiratory effort-related arousals (RERAs) were left outside the analyses, as, unlike the name implies, RERAs are respiratory events scored based on the respiratory signals, not cortical arousals scored based on the EEG (Troester et al., 2023). Based on discussions with the scorers after the scoring process of all 50 subjects was over, the scorers implied that not all had scored RERAs, and that some may have placed the RERAs on the respiratory signals and some on the EEG. Therefore, in this study, RERAs were only reported in raw numbers scored by each scorer separate from the arousals (Table 4).

2.2 Data preparation

For each of the 50 PSGs analysed by 10 different scorers, a majority sleep stage score was formed based on the most scored sleep stage in each epoch. If there was a tie between two or more stages, the stage with the highest priority was chosen in the priority order of W, N1, N2, N3, and R. The majority sleep stage score is described in more detail by Nikkonen et al. (2024). To cut excess wake from the recordings, stage W epochs without arousals from any scorers were excluded from the beginning and end of the recordings. In other words, total recording time (TRT) was defined to span from the onset of the first 30 s epoch either scored as sleep or containing an arousal (by any scorer), to the termination of the last 30 s epoch either scored as sleep or containing an arousal (by any scorer).

2.3 Similarity of arousal indexes

ArIs were calculated based on the scorer-specific sleep stages instead of the majority sleep stages. Bruxism arousals were left out of the ArI investigation, as they were only scored by 2 out of 10 scorers, thus providing limited insight into the interrater agreement of such arousals. The similarity of the ArIs was evaluated with ICC(A,1) (McGraw & Wong, 1996), a two-way mixed effects model describing the level of absolute agreement between the scorers.

2.4 Second-by-second agreement

The second-by-second analysis was calculated using two separate approaches: binary classification (each second was classified either as arousal or no arousal), and arousal type-wise classification (each second was classified either as no arousal, respiratory, PLM, LM, spontaneous, bruxism, or unlabelled arousal). Moreover, a majority arousal score was formed based on the second-wise binary classification. A second in the majority score was set as arousal if five or more scorers had scored an arousal during that second. Otherwise, the second was set as no arousal.

Second-by-second agreement between the scorers was evaluated using Fleiss’ kappa (1972) for multi-rater agreement and Cohen's kappa (1960) for pairwise comparisons. Fleiss’ kappa values were computed using both binary and type-wise classifications, and also separately for different majority sleep stages using the binary classification. Pairwise comparisons were conducted between each scorer pair, as well as between each scorer and the majority score, using the binary classification.

2.5 Arousal numbers and durations

Arousals were investigated in raw numbers, by calculating the scorer-wise number of arousals in total, as well as in groups by arousal type (respiratory, PLM, LM, spontaneous, bruxism, or unlabelled arousal), duration (<3 s, 3–5 s, 5–10 s, 10–15 s, 15–30 s, or >30 s), and sleep stage where the arousal onset was scored (stages W, N1, N2, N3, or R; investigated separately in scorer-specific and majority sleep scores). Moreover, arousal densities in the sleep stages were computed. In the case of the scorer-specific sleep scores, the sum of arousals from all scorers in each scorer-specific sleep stage was divided by the sum of the given sleep stage epochs from all scorers. In the case of the majority sleep score, the sum of arousals from all scorers in each majority sleep stage was divided by 10 times the sum of the given majority sleep stage epochs. Furthermore, arousal durations between the scorers were investigated in a cumulative manner in all recordings.

2.6 Arousal clusters

The ground truth of the arousals was estimated by grouping coincidingly scored arousals by multiple scorers into clusters. An arousal cluster was defined to start at the onset of the first arousal by any scorer after a period with no arousals, and to end at the termination of the last arousal by any scorer before the next period with no arousals. Consequently, every arousal in a cluster overlapped with at least one other arousal in the same cluster (Figure 1), unless the cluster consisted of only one arousal, that is, if there was only one scorer involved in the cluster. The within-cluster agreement of the scorers was then evaluated by investigating the number of scorers in each cluster, as well as computing the pairwise accuracies of those scorers (the duration where the scorers agreed inside the cluster as a proportion of the duration of the whole cluster). The start- and end-time differences of the arousals were also calculated between each scorer pair in the clusters. If a scorer had scored more than one arousal in the cluster, the start time was chosen as the onset of the first arousal, and the end time was chosen as the termination of the last arousal from that scorer.

Details are in the caption following the image
An example of two consecutive arousal clusters (red rectangles) formed based on the arousals scored by ten scorers (blue shaded areas). The first cluster contains arousals scored by four scorers, and the second cluster contains arousals scored by eight scorers. In this example Scorer 9 has scored two separate arousals in the second cluster.

3 RESULTS

3.1 Arousal index

The ICC of the ArIs was 0.41 between all scorers (Figure 2, Table 3). The pairwise ICC varied between 0.04 and 0.88, with a mean of 0.46 (Figure 3). The pairwise ICC was higher than average in two out of three scorer pairs from the same sleep centres. When excluding arousals scored in stage W and only including arousals scored in sleep, the overall ArI ICC increased to 0.55. Furthermore, there was almost no correlation in arousals scored only in stage W (ArI ICC = 0.09). In the type-wise ArIs, the ICCs were the highest in respiratory and PLM arousals (0.65 and 0.54, respectively) and the lowest in spontaneous and LM arousals (0.23 and 0.14, respectively).

Details are in the caption following the image
Arousal indexes (ArIs) of 50 subjects scored by 10 scorers as boxplots. The intraclass correlation coefficient of the ArIs was 0.41.
TABLE 3. Arousal index mean and standard deviation, and their intraclass correlation coefficients with 95% confidence interval in different categories.
Category ArI mean (SD) ICC (95% CI)
All arousals 17.7 (10.2) 0.41 (0.30–0.55)
Arousals during sleep 14.3 (7.9) 0.55 (0.41–0.68)
Arousals during wake 10.3 (14.2) 0.09 (0.04–0.17)
Respiratory arousals 6.8 (8.0) 0.65 (0.51–0.76)
LM arousals 4.1 (3.3) 0.14 (0.07–0.23)
PLM arousals 1.1 (1.8) 0.54 (0.43–0.65)
Spontaneous arousals 5.5 (4.2) 0.23 (0.14–0.34)
  • Note: ArIs were calculated using scorer-specific total sleep times in all categories except for, where scorer-specific times in stage W were used. ArI mean and SD were computed from all 10 ArIs in all 50 subjects (i.e., from 500 different ArIs).
  • Abbreviations: ArI, arousal index; CI, confidence interval; ICC, intraclass correlation coeffifient; LM, limb movement; PLM, periodic limb movement; SD, standard deviation.
  • a During scorer-specific non-rapid eye movement (stages N1, N2, and N3), and rapid eye movement (stage R).
  • b During scorer-specific wake (stage W).
Details are in the caption following the image
Pairwise second-by-second Cohen's kappa values (lower left triangle in blue), and pairwise intraclass correlation coefficients of arousal indexes (upper right triangle in red). Values between the scorers from the same sleep centres are emphasised with rectangles.

3.2 Second-by-second agreement

The second-by-second Fleiss’ kappa of all arousals was 0.40 when the binary classification was used. When the seconds were further classified according to the arousal type, the Fleiss’ kappa decreased to 0.34. The binary second-by-second Fleiss’ kappa values for arousals in stages W, N1, N2, N3, and R were 0.25, 0.45, 0.45, 0.53, and 0.48, respectively.

In the pairwise comparison between the scorers, using the binary classification, the Cohen's kappa varied between 0.07 and 0.68, with a mean of 0.38 (Figure 3). The pairwise Cohen's kappa values were higher than average in all scorer pairs from the same sleep centres.

3.3 Arousal numbers

A total of 61,954 arousals were scored in the 50 recordings by the 10 scorers. Arousal durations varied mostly between 3 and 10 s (Figure 4), and the average arousal duration was 8.4 s (SD 5.4 s, min 0.3 s, max 47.5 s). Respiratory, spontaneous, and LM arousals were scored the most (n = 23,243, n = 19,563, and n = 14,395, respectively), followed by PLM and bruxism arousals (n = 3785 and n = 79, respectively) (Figure 5).

Details are in the caption following the image
Number of scored arousals in different duration groups according to each scorer. The average arousal duration was 8.36 s (standard deviation 5.4 s, minimum 0.3 s, maximum 47.5 s).
Details are in the caption following the image
Number of scored arousal types according to each scorer. In the scoring, arousals were labelled as respiratory (Resp), periodic limb-movement (PLM), limb-movement (LM), spontaneous (Spont), or bruxism (Brux) -related.

Within the scorer-specific sleep stages, arousals were mostly scored in stage N2 (n = 22,685), followed by stages N1 and W (n = 13,220 and n = 11,770, respectively), and the least in stages R and N3 (n = 10,438 and n = 3841, respectively) (Figure 6a). Like in the scorer-specific stages, in the majority stages arousals were mostly scored in stage N2 (n = 24,359), and the least in N3 (n = 3400). However, in contrast to the scorer-specific stages where arousals were scored the second most in stage N1, majority stage N1 had the second least arousals scored (n = 9803). Instead, the second most arousals were scored in majority stages W and R (n = 13,705 and n = 10,687, respectively) (Figure 6b). However, the arousal densities (i.e., the average number of arousals in one epoch) were similar in both sleep staging approaches, being the highest in stage N1 and decreasing towards stages N2, R, W, and N3.

Details are in the caption following the image
(a) Number of arousals scored in scorer-specific stages wake (W), non-rapid eye movement (N1, N2, and N3), and rapid eye movement (R). Corresponding numbers in the majority sleep stages are presented in (b). In addition, below the arousal numbers are the arousal densities in the sleep stages, i.e., the average number of scored arousals by one scorer in (a) the scorer-specific stages, and (b) the majority stages.

3.4 Arousal clusters

A total of 19,192 groups of coinciding arousals, that is, arousal clusters, were formed from the scored arousals. The average cluster duration was 9.5 s (SD 7.0 s), and nearly all clusters (n = 19,087, 99.5% of the clusters) comprised a maximum of one arousal per scorer. The rest of the clusters (n = 105, 0.5%) comprised a maximum of two arousals per scorer. The average number of scorers scoring arousals in a cluster was 3.2 (SD 2.8). The majority agreement, that is, ≥5 scorers in a cluster, was reached in less than a third (n = 5281, 27.5%) of the clusters. Most often (n = 11,319, 58.9%), the clusters only comprised arousals from one or two scorers (Figure 7a).

Details are in the caption following the image
(a) Number of arousal clusters grouped based on the number of scorers in the clusters. Pairwise accuracies of the scorers in these clusters (i.e., the duration where the scorers agreed inside the cluster as a proportion of the duration of the whole cluster) are shown in (b) as boxplots. Outliers in the boxplots are shown as blue dots but due to their high number and proximity they may not be independently resolvable.

The median accuracy of the scorer pairs in the clusters was 74.8%. Moreover, the median accuracy varied between 69.4% and 91.9% depending on the number of scorer pairs in the clusters (Figure 7b). Moreover, the average arousal start time difference between the scorers in all clusters was 1.7 s (SD 2.5 s), and the average end time difference was 3.1 s (SD 3.8 s).

3.5 Cumulative arousal times

The TRT of all 50 recordings was 29534.5 min. Within the TRT, the total arousal time (TAT), that is, the timespan when arousals were scored by at least one scorer, was 3025.0 min (10.2% of TRT). However, the scorer-specific arousal timespans varied only between 315.5 and 1239.7 min (1.1%–4.2% of TRT) (Table 4). Most of the TAT (1816.4 min, 60.1% of TAT) was scored with one or two scorers agreeing. The majority agreement, that is, ≥5 scorers scoring arousals at the same time points, was reached for 642.7 min (21.3% of TAT).

TABLE 4. Scorer-specific numbers of scored arousals, cumulative arousal times, and arousal indexes of each scorer, as well as second-by-second Cohen's kappa values between each scorer and the majority score. The second-by-second majority arousal score used in the Cohen's kappa computation is presented on the last row. Separately from the arousals, the scorer-specific numbers of respiratory effort-related arousals are presented.
Scorer Number of arousals AT [min (% of TRT)] ArI mean (SD) κ Number of RERAs
1 6379 1086.6 (3.7%) 18.4 (8.4) 0.67 180
2 7223 1205.0 (4.1%) 20.9 (8.6) 0.60 570
3 2737 378.3 (1.3%) 7.6 (3.9) 0.56 0
4 6986 875.4 (3.0%) 19.5 (8.0) 0.52 0
5 8679 1228.1 (4.2%) 25.1 (12.7) 0.51 0
6 6029 788.7 (2.7%) 17.4 (13.2) 0.57 4
7 4632 600.4 (2.0%) 13.2 (6.4) 0.71 0
8 7037 907.0 (3.1%) 18.8 (7.6) 0.67 0
9 6098 315.5 (1.1%) 17.5 (11.0) 0.16 1688
10 6154 1239.7 (4.2%) 18.5 (9.5) 0.60 53
Majority 5213 717.6 (2.4%) 14.4 (6.6) - 0
  • Note: ArIs are presented as mean and SD within all 50 subjects. κ values were calculated second-wise, classifying each second as arousal or no arousal. The majority arousal score was computed so that a second was classified as arousal if ≥5 scorers had scored an arousal during that second. The majority ArI was computed based on the second by-second majority arousal score and the TST from the majority sleep score.
  • Abbreviations: ArI, arousal index; AT, arousal time; κ, Cohen's kappa; RERA, respiratory effort-related arousal; SD, standard deviation; TRT, total recording time.

4 DISCUSSION

In this study, we investigated the interrater agreement in arousal scoring between 10 scorers from seven centres in 50 PSGs. We found both the correlation of the ArIs as well as the second-by-second agreement between all scorers to be moderate (ArI ICC = 0.41 and second-by-second Fleiss’ kappa = 0.40). As hypothesised, the agreement of the scored arousals increased from light to deep sleep. Moreover, respiratory arousals had better agreement compared to spontaneous arousals, that is, arousals without any related cue effects. Furthermore, all three scorer pairs from the same sleep centres had higher than average Cohen's kappa values, but only two out of these three scorer pairs had higher than average ArI ICCs.

4.1 Arousal durations

The durations of the arousals were most often 5–10 s in 9 out of 10 scorers, whereas one scorer (Scorer 9) mostly scored 3–5 s arousals (Figure 4). These arousals were often precisely 3 s long, ergo the minimum duration to score an arousal according to the scoring rules (Berry et al., 2020; Troester et al., 2023). As arousals are required to score hypopneas without a ≥3% blood oxygen desaturation and to calculate the ArI, both of which only regard the presence or number of arousals, some scorers may consider scoring the duration of the arousals irrelevant in the current clinical practice. However, it would be beneficial to acquire information on the arousal durations when estimating the magnitude of sleep fragmentation more accurately and in developing alternative metrics to describe arousals and sleep fragmentation, for example, arousal intensity (Amatoury et al., 2016; Azarbarzin et al., 2014), arousal burden (Shahrbabaki et al., 2021), and arousal-AUC (area under the curve) (Malatantis-Ewert et al., 2022). In addition, information on the arousal duration is crucial for linking arousals to other events, especially in a PLM series, where the arousal should be associated with a limb movement if they overlap or have <0.5 s in between, regardless of which one occurs first (Troester et al., 2023).

4.2 Arousal types

The two most scored arousal types were respiratory and spontaneous arousals. Consistently with previous literature (Loredo et al., 1999), the agreement between respiratory ArIs (ICC = 0.65) was the highest of all arousal types, whereas spontaneous ArIs had only a slight agreement (ICC = 0.23). As a respiratory arousal is required to score some hypopneas (Berry et al., 2020; Troester et al., 2023), the attention towards identifying arousals in the EEG may increase around these hypopneas. In contrast, the focus on spontaneous arousals may stay lower as they are not associated with any other events. The second most scored arousal types were LM and PLM arousals. Although both arousal types are related to limb movements, the similarity of the ArIs was considerably higher in PLM arousals (ICC = 0.54) compared with LM arousals (ICC = 0.14). This difference in the agreement may originate from LMs not being as clinically significant as PLMs, which may result in scorers spending less time and effort on scoring individual LMs outside a PLM sequence. Secondly, the PLM scoring rules in the AASM manual include a description of a PLM related arousal, but the manual does not define an LM related arousal outside a PLM sequence (Troester et al., 2023). These two aspects may decrease the agreement in LM related arousal scoring. Furthermore, 2 out of 10 scorers had scored bruxism arousals, although one of these two scorers had scored only one bruxism arousal. These arousals have been classified as bruxism related manually, as scoring them was not mentioned in the instructions given to the scorers, nor did the arousal classification tool label bruxism arousals.

Since the AASM Scoring Manual v2.6 (Berry et al., 2020), used in this study, v3 has been published (Troester et al., 2023). In v3, a note was added attached to the arousal rules suggesting that “classifying arousals as related to respiratory or leg movement events, or occurring spontaneously, may be informative”. However, in v2.6 arousal classification was only mentioned regarding respiratory and PLM events in their respective sections, without any mentions of time limits related to respiratory arousals, or associating arousals to LM events outside of a PLM series. This lack of clear guidelines for arousal classification, still lacking in v3, most likely affects the interscorer reliability.

4.3 Effect of sleep stage

Both in the majority- and scorer-specific sleep stages, arousals were the most densely scored in stage N1, with decreasing scoring density towards stages N2 and N3. Arousals are known to occur more likely in stage N1 compared with stages N2 and N3, as the arousal threshold, related to for example airway occlusion, is known to increase towards deeper sleep (i.e., towards stage N3) (Berry et al., 1998; Berry & Gleeson, 1997).

The temporal agreement between the scorers was the lowest in majority stages N1 and N2 and increased towards stage N3 (Fleiss’ kappas 0.45, 0.45, and 0.53, respectively). This is consistent with the literature, where the interscorer agreement in EEG segment classification into arousals or stable sleep has been lower in R&K (Rechtschaffen & Kales, 1968) stages 1+2 than 3+4 (Drinnan et al., 1998). As EEG high-frequency content related to arousals is known to increase more when the arousals occur in deep sleep compared to light sleep (Pitkänen et al., 2022), the EEG frequency shift indicating an arousal may be more easily distinguishable in deeper sleep.

The largest discrepancies existed in stage W, where 3 out of 10 scorers scored systematically more arousals compared to the other scorers. Similar discrepancies were observed both in the majority and scorer-specific sleep stages, both temporally (Fleiss’ kappa = 0.25), as well as number-wise (ArI ICC = 0.09). These differences may have originated from the fact the scoring window (i.e., “lights out” and “lights on” times) was not specified in the recordings, but the scorers were asked to adjust the scoring window themselves. The arousals scored during stage W were, in fact, often located in the beginning and/or end of the recordings, where all scorers had not started scoring yet, or had already ended their scoring process. Moreover, these arousals in stage W epochs in the beginning and end of the recordings can originate from the initial automatic scoring, as some scorers may simply have failed to adjust the scoring window to omit any automatically scored events before and after the chosen window. However, even omitting the segments where all scorers were not scoring did not increase the agreement, as the Fleiss’ kappa in stage W only increased from 0.25 to 0.26 and the ArI ICC in stage W decreased from 0.09 to 0.06.

Here, again, we point out that the scoring guideline related to arousals preceding a stage W transition has been clarified since the analyses of this study. This guideline was previously a note in the AASM Scoring Manual v2.6 (Berry et al., 2020) and is now a rule in the AASM Scoring Manual v3 (Troester et al., 2023). Moreover, the wording of this guideline changed from “arousal may still be scored” to a stricter “score arousal” preceding a stage W transition, in which case, both arousal and stage W transition will be scored. In addition to arousal rules related to scoring stage W, AASM Manual also states that N1 should be scored if an arousal interrupts stages N2 or R and certain other EEG characteristics apply (Troester et al., 2023). Therefore, the scoring of stages N1, N2, and R is directly influenced by arousal scoring, emphasizing the importance of arousal scoring agreement for correct sleep staging.

4.4 Pairwise variation in agreement

In the pairwise agreement analyses between the scorers, Cohen's kappa varied from no agreement to substantial agreement (0.07–0.68), and the correlation of the ArIs varied from poor to substantial (ICC = 0.04–0.88) (Figure 3). Hence, both temporal (Cohen's kappa) and number-wise (ArI ICC) scoring agreements highly depended on the scorer pair. The second-by-second Cohen's kappa values increased slightly when each scorer was compared with the majority score instead of each other. This was expected, as the majority score is formed based on the individual scorings and therefore will correlate with them. Moreover, Cohen's kappa values were higher than average between scorer pairs from the same sleep centres, which is logical, as these scorer pairs may possess similar tacit knowledge on aspects that leave room for interpretation in the scoring rules.

Some scorers stood out from the rest in the pairwise analyses. Firstly, Scorer 9 had consistently low pairwise Cohen's kappa values with the other scorers (0.07–0.13, Figure 3). This likely originates from Scorer 9 mostly scoring 3 s arousals compared with the longer arousals by the other scorers (Figure 4), as duration differences reasonably decrease the second-by-second agreement. Scorer 9 had also scored the most RERAs (Table 4), which were excluded from analyses. Contrary to the mainly 3 s arousals scored by Scorer 9 (Figure 4), the RERAs scored by Scorer 9 ranged mostly between 10 and 30 s. As Scorer 9 turned out to be an outlier in terms of Cohen's kappa (Figure 3) when excluding RERAs, we computed additional Fleiss’ and Cohen's kappas including RERAs, resulting in the Cohen's kappa values of Scorer 9 still remaining the lowest, ranging between 0.10 and 0.20, and the Fleiss’ kappa remaining at 0.40. However, when computing the Fleiss’ kappa excluding Scorer 9 (and RERAs), the value increased to 0.45. Secondly in terms of outliers, Scorer 3 had consistently low pairwise ArI ICCs with the other scorers (0.04–0.26, Figure 3). These differences are likely due to the relatively low number of spontaneous, and especially, respiratory arousals scored by Scorer 3, as they still scored LM and PLM arousals in similar numbers compared with the other scorers (Figure 5).

Albeit all scorers were instructed to analyse the recordings without implementing any in-house rules, different habitual scoring practices and tacit knowledge may persist for example when scoring these different event-related arousals. The present study illustrates the need for more standardised practices that leave less room for in-house customs as well as the importance of implementing training and method alignment programmes within and between sleep centres.

4.5 Arousal clusters

A total of 19,192 arousal clusters existed in the recordings. However, most of these clusters were scored as an arousal only by one or two scorers at a time (Figure 7a), and in fact, no individual scorer scored more than 8679 arousals in the recordings (Table 4). Hence, despite using the same scoring rules, individual biases exist on what looks like an arousal in the EEG, and where they tend to occur during the night. Based on the present study uncertainties can exist, for example, around sleep–wake transitions and related to different events. Out of the 10,661 clusters where multiple scorers were scoring at the same time, the vast majority (10,556 clusters) comprised a maximum of one arousal per scorer, and only 105 clusters comprised a maximum of two arousals per scorer. In these 105 clusters, the typical composition was two short arousals from some scorers, whereas some others had scored one long arousal extending both of the short ones. Moreover, the accuracy of the scored arousals was good within all clusters with multiple scorers, that is, on average the arousals overlapped each other by most of their duration (Figure 7b). Therefore, the present study demonstrates that despite individual biases, some arousal-like EEG characteristics do exist that are generally agreed on.

4.6 Limitations

This study has certain limitations. Firstly, not all scorers were equally familiar with the scoring software used in the study. Moreover, since some of the scorers used the opportunity to utilise the automatic tool in the software prior to manual scoring, some arousals from these scorers should be considered semi-manually scored. This type of semi-manual scoring corresponds to typical clinical practice, which we wanted to stay true to in our study. However, the differences between clinical and research settings should be acknowledged, as arousal scoring can still be more diligent in the latter. Secondly, not all scorers managed segments with poor signal quality or improper pulse oximeter connection similarly. Some scorers marked these segments as invalid periods, while most scorers did not. Therefore, in the case of the former group, their sleep stage and event annotations were omitted during these segments. In the present analyses, these segments were simply considered to have been scored as stage W with no arousals. However, these segments only covered a minute part in the recordings, so their effect on the total agreement can be considered negligible.

Thirdly, there was no fixed time window for the scoring process in the type II PSGs, ergo the scorers started and ended their analyses at different time points. The resulting unscored segments in the beginnings and ends of the recordings were similarly considered to have been scored as stage W with no arousals. This lack of a fixed monitoring window is a limitation in terms of generalisability to type I PSGs, where the monitoring time is more precisely determined. However, even by retroactively standardising the monitoring window for all scorers, that is, removing all segments where all scorers were not scoring, only resulted in Fleiss’ kappa of 0.43 and ArI ICC of 0.53, revealing that these unscored segments had only minor effects on the results.

Lastly, the ArIs between scorers can vary not only due to uncertainties in the arousal scoring itself, but also in the sleep stage scoring (Nikkonen et al., 2024), as the latter affects the TST required in ArI computation. Approaches without relying on manually scored discrete sleep stages, for example, odds ratio product (Younes et al., 2015) and Cyclic Alternating Pattern (CAP) (Mendonça et al., 2023; Parrino et al., 2012), can provide novel insight into the assessment of sleep depth and level of arousal.

5 CONCLUSION

Overall, the agreement between the scorers was relatively low but varied depending on the arousal type, sleep stage, and the scorer pair. The most uncertain areas were related to spontaneous arousals and arousals scored in stage W. The disagreements between scorers can be due to, for example, scorer-wise biases and sleep centre-specific scoring practices as well as uncertainties in the rules which are left for interpretations by the scorers themselves.

We argue for the standardisation of arousal scoring rules, careful utilisation of automatic scoring tools developed with high-quality datasets to be used alongside manual scoring, and the adaptation of inter-scorer reliability programs within and between sleep centres as paramount when moving towards the improvement of the diagnostics of sleep-related disorders. The current way is not reliable enough for clinical or research purposes. Specifying the rules for monitoring time, arousal duration and classification, moving from the traditional epoch-based scoring towards a more continuous scoring, and implementing other signals, such as electrocardiogram and photoplethysmogram can help to assess arousals and sleep fragmentation in a novel manner.

AUTHOR CONTRIBUTIONS

Henna Pitkänen: Conceptualization; investigation; methodology; funding acquisition; writing – original draft; writing – review and editing; visualization; formal analysis. Sami Nikkonen: Data curation; supervision; writing – review and editing; methodology. Marika Rissanen: Writing – review and editing. Anna Sigridur Islind: Writing – review and editing. Heidur Gretarsdottir: Writing – review and editing; data curation. Erna Sif Arnardottir: Writing – review and editing; project administration; funding acquisition. Timo Leppänen: Funding acquisition; writing – review and editing; project administration; supervision. Henri Korkalainen: Supervision; funding acquisition; writing – review and editing; methodology.

ACKNOWLEDGEMENTS

We would like to thank all scorers who participated in this study, as well as all participating Sleep Revolution centres.

    FUNDING INFORMATION

    European Union's Horizon 2020 research and innovation programme (grant agreement 965417), NordForsk (NordSleep project 90458) via Business Finland (5133/31/2018), the Icelandic Research Fund, the Research Committee of the Kuopio University Hospital Catchment Area for the State Research Funding (5041794, 5041803, 5041804, 5041809, and 5041812), the Competitive State Research Funding of Expert Responsibility Area of Tampere University Hospital (VTR7319), Seinäjoki Central Hospital (7746), the Vilho, Yrjö and Kalle Väisälä Foundation of the Finnish Academy of Science and Letters, Kuopio Area Respiratory Foundation, and Tampere Tuberculosis Foundation.

    DATA AVAILABILITY STATEMENT

    Research data are not shared.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.