Indirect immunofluorescence in autoimmune diseases: Assessment of digital images for diagnostic purpose
Abstract
Background: The recommended method for antinuclear antibodies (ANA) detection is indirect immunofluorescence (IIF). To pursue a high image quality without artefacts and reduce interobserver variability, this study aims at evaluating the reliability of automatically acquired digital images of IIF slides for diagnostic purposes. Methods: Ninety-six sera were screened for ANA by IIF on HEp-2 cells. Two expert physicians looking at both the fluorescence microscope and the digital images on computer monitor performed a blind study to evaluate fluorescence intensity and staining pattern. Cohen's kappa was used as an agreement evaluator between methods and experts. Results: Considering fluorescence intensity, there is a substantial agreement between microscope and monitor analysis in both physicians. Agreement between physicians was substantial at the microscope and perfect at the monitor. Considering IIF pattern, there was a substantial and moderate agreement between microscope and monitor analysis in both physicians. Kappa between physicians was substantial both at the microscope and at the monitor. Conclusions: These preliminary results suggest that digital media is a reliable tool to help physicians in detecting autoantibodies in IIF. Our data represent a first step to validate the use of digital images, thus offering an opportunity for standardizing and automatizing the detection of ANA by IIF. © 2007 Clinical Cytometry Society
Connective tissue diseases (CTD) are autoimmune disorders characterized by a chronic inflammatory process involving different organs. Antinuclear Antibodies (ANA) directed against a variety of nuclear antigens are detectable in the serum of patients with many rheumatic and nonrheumatic diseases (1). The usefulness of ANA tests depends on the clinical situation. If the clinical history and physical examination reveal symptoms or signs suggestive of CTD, a positive ANA test contributes to the diagnosis. In addition, as many CTD have common clinical manifestations, the laboratory may play a fundamental role in formulating the correct diagnosis.
-
the lack of resources and adequately trained personnel (3);
-
the low level of standardization (5);
-
the photobleatching effect, which bleaches significantly in a few seconds biological tissues stained with fluorescent dyes (6);
-
interobserver variability, which limits the reproducibility of IIF readings (7);
-
the lack of automatized procedures.
To date, the highest level of automation in IIF tests is the preparation of slides with robotic devices performing dilution, dispensation, and washing operations (8, 9). Although this greatly helps in speeding up the routine part of the tests and in improving the standardization level, it does not affect most of the earlier problems.
These observations suggest that the development of a Computer-Aided Diagnosis (CAD) system supporting the IIF diagnostic procedure would be beneficial in many respects. Being able to determine the presence of autoantibodies in IIF automatically would enable easier and faster test execution and result reporting, increase test repeatability, and lower costs.
The first issue that should be considered in a systematic way to provide effective and viable CAD solutions is the introduction of digital images in IIF practice and full validation of their use both in manual and assisted diagnosis.
Preliminary results on evaluation of image acquisition of IIF images have been reported previously (10). In the present study we focus on the use of automatically acquired digital images for diagnostic purposes, i.e., if the physicians may reliably use digital IIF images in place of direct microscope observations to carry out the diagnosis. Specifically, we present data about diagnoses performed by experts looking at the fluorescence microscope and at digital images on the computer screen. Our goals are twofold. On the one hand we would assess that the use of digital images neither introduces artefacts, nor leads to losses of useful information that significantly change the results of IIF tests. On the other hand we would provide an evaluation of the improvement that could be expected using automatically acquired digital images in place of direct inspection of IIF samples at the fluorescence microscope.
MATERIALS AND METHODS
Slide Preparation, Acquisition, and Diagnosis Procedure
The samples included in this study were obtained for diagnostic purposes and routine testing from consecutive outpatients and inpatients of the Campus Biomedico, University Hospital of Rome, Italy. Ninety-six sera (73 F; 23M) were screened for ANAs by IIF on HEp-2 cells (ATCC-CCL 23). Mean age was 49 years (73–15 years). Clinical diagnoses are reported in Table 1.
Diagnosis | No. of samples | % |
---|---|---|
Healthy | 44 | 45.8 |
Connective tissue diseases (CTDs) | ||
Rheumatoid arthritis | 4 | 4.2 |
Sieronegative arthritis | 4 | 4.2 |
SLE | 6 | 6.3 |
Sjogren syndrome | 9 | 9.2 |
Undifferentiated connective tissue diseases | 3 | 3.1 |
Other CTDs | 4 | 4.2 |
Other autoimmune disorders | ||
PAPS | 1 | 1 |
Autoimmune thyroid disorders | 6 | 6.3 |
Autoimmune piastrinopenia | 1 | 1 |
Raynaud | 4 | 4.2 |
Autoimmune gastrointestinal disorders (celiac disease, IBD) | 4 | 4.2 |
Viral hepatitis | 6 | 6.3 |
Total | 96 |
Sera were diluted 1:80 in 0.01 M phosphate buffered saline (PBS), pH 7.2 (50 μl of serum in 1,950 μl of PBS). A positive and a negative reference controls were tested in each slide for quality control.
Prediluted sera were overlaid on fixed HEp-2 cells (The Binding Site, UK) for 30 min at room temperature. Slides were washed twice for 5 min each with PBS, overlaid with fluorescently labeled conjugate (goat antihuman IgG, heavy and light chains; Binding site), and incubated for an additional 30 min. After a slide was washed twice, a cover slip was placed over the slide with mounting medium (The Binding Site, UK).
-
4+ brilliant green (maximal fluorescence);
-
3+ less brilliant green fluorescence;
-
2+ defined pattern but dim fluorescence;
-
1+ very subdued fluorescence.
-
Homogeneous: diffuse staining of the interphase nuclei and staining of the chromatin of mitotic cells;
-
Peripheral nuclear (Rim): solid staining, primarily around the outer region of the nucleus, with weaker staining toward the center of the nucleus;
-
Nucleolar: large coarse speckled staining within the nucleus, less than six in number per cell;
-
Speckled: a fine or coarse granular nuclear staining of the interphase cell nuclei;
-
No pattern: unclassifiable pattern.
The images were blindly classified by two physicians, experts of IIF, working at the fluorescence microscope. The diagnoses (fluorescence intensity and staining pattern) were performed independently, at short temporal distance to minimize the effect of fluorescence decay. During the reading phase at the microscope, one of the two experts, randomly chosen, selects three different zones of the well under examination, based on their clinical significance to perform the diagnosis. These areas are then acquired with an acquisition unit consisting of the fluorescence microscope and a monochrome CCD camera, which has squared pixels of equal side to 6.45 μm. The exposure time of slides to incident light is 0.4 s. The images have a resolution of 1,024 × 1,344 pixels and a color depth of 8 bits; they are stored in TIFF format (Fig. 1).

An example of IIF images: on the left is reported a positive serum (a), whereas on the right, two different negative sera are shown (b and c). The sample (a) is labelled as 2+ with respect to serum b, whereas it is labelled as 3+ with respect to c.
At a different time, the same two experts perform again the diagnosis procedure described earlier at a 19′ flat monitor HP L1940, looking at the digital images previously acquired (with the corresponding positive and negative controls). Monitor settings are 1,024 × 1,280 pixels and refresh rate of 60 Hz. Again, also these second diagnoses were independently performed and each physician did not have any information on previous diagnoses.
At the monitor both the physicians examined the same regions of the well, whereas at the microscope they may observe the whole well. Motivation of this choice is that at the microscope it is not possible to manually acquire images of the whole well, at the decided resolution, even though we are developing an automated comprehensive system to acquire images of the whole well (12).
We performed some training session before starting the tests because physicians were not accustomed to look at IIF digital images.
Simplified CDC Criteria
We decided to use a simplified version of CDC guidelines for fluorescence intensity to get a ground truth reliable enough to develop a CAD. On the one hand, this modified version should maintain the diagnostic meaning of IIF test, and, on the other hand, it should allow getting a well-founded data set.
We observed that the disagreement between physicians is twofold. In one case, physicians assign the sample to different classes (i.e., one physician to positive, the other to negative). In the other case, physicians disagree about the subgroups to which a positive sample has to be assigned, i.e., physicians label it with a different number of plus. At a deeper examination, it appears that physicians always agree each other when the sample is marked either with two plus or more, or when it is definitely negative.
According to these observations, we decide to use a simplified classification of data samples into three classes (i.e., negative, positive, and border zone). A sample is assigned to the negative class if both physicians classify it as negative, whereas it is labeled positive if both physicians mark it with two pluses or more. Finally, a sample is assigned to the border zone class when either of the two types of disagreement described earlier happens or when both physicians mark it with one plus.
Different from fluorescence intensity scale, the staining pattern classes are defined by well-distinguished features, as reported earlier. Therefore, we choose to consider only four main classes without modifying them.
Statistical Analysis
The agreement between multiple ratings is indicative of the reliability of the single rating. Our agreement analysis regards both the classification of the fluorescence intensity and the description of the staining pattern. Since in all cases our data basically consist of two independent ratings per subject with respect to a dichotomous outcome, we decided to use the degree of agreement between ratings as our main indicator, adopting to this aim the Cohen's kappa, which is the most widely used index in the literature among the many nonequivalent proposed (13). Its estimate, k, is expressible as a function of observed frequencies. Although the true parameter value varies from a lower bound of −1 to an upper bound of 1, the usual region of interest is k > 0. In the literature, the following guidelines for interpreting kappa values are used: 0 < k < 0.2 implies slight agreement; 0.2 < k < 0.4 implies fair agreement; 0.4 < k < 0.6 implies moderate agreement; 0.6 < k < 0.8 implies substantial agreement, and 0.8 < k < 1 implies almost perfect agreement (14).
RESULTS
To validate the use of digital images in IIF diagnosis, we populated a database of IIF images.
We then analyzed the agreement between couple of diagnoses, reporting data on: (i) the agreement at the microscope and at the monitor for each expert and (ii) the agreement between experts for each procedure.
The first kind of data evaluate the reliability of using digital images in place of direct observation at the fluorescence microscope to carry out IIF diagnoses. The second kind of data evaluates advantages or disadvantages of using digital images in place of traditional instrumentation.
The agreement analysis regards both the classification of the fluorescence intensity and the description of the staining pattern.
Fluorescence Intensity Evaluation
In Tables 2 and 3 the Cohen's kappa and the related confidence interval (P < 0.05) are reported. For each physician the agreement between traditional and digital supported diagnosis is different (Table 2). The computed Cohen's kappa suggests substantial agreement for Physician 1 (k = 0.66 ± 0.12) and almost substantial agreement for Physician 2 (k = 0.56 ± 0.14).
Kappa, k | Confidence interval | |
---|---|---|
Physician 1 looking at the microscope and at the monitor | 0.66 | 0.12 |
Physician 2 looking at the microscope and at the monitor | 0.56 | 0.13 |
Between experts at the microscope | 0.46 | 0.13 |
Between experts at the monitor | 0.65 | 0.12 |
Kappa, k | Confidence interval | |
---|---|---|
Physician 1 looking at the microscope and at the monitor | 0.78 | 0.11 |
Physician 2 looking at the microscope and at the monitor | 0.66 | 0.13 |
Between experts at the microscope | 0.62 | 0.13 |
Between experts at the monitor | 0.84 | 0.09 |
The agreement between pairs of experts remarkably depends on the method used to carry out the diagnosis. The measured kappa is almost 30% more at the monitor than at the microscope. Indeed, the measured kappa implies moderate agreement at the microscope (k = 0.46 ± 0.13) and substantial agreement at the monitor (k = 0.65 ± 0.12).
If we classify the fluorescence intensity in three classes instead of five, the measured kappa rises (Table 3). Specifically, with respect to the fluorescence intensity classification in five classes the kappa increases of 15.85% and of 13.87% for Physician 1 and Physician 2, respectively. Now, the computed Cohen's kappa suggests almost perfect agreement for Physician 1 (k = 0.78 ± 0.11) and substantial agreement for Physician 2 (k = 0.66 ± 0.13).
The measured kappa is nearly 26% more at the monitor than at the microscope, implying substantial agreement at the microscope (k = 0.62 ± 0.13) and perfect agreement at the monitor (k = 0.84 ± 0.09). Hence, with respect to the intensity classification in five classes, the kappa increases of 25.38% and of 22.60% working at the microscope and at the workstation monitor, respectively.
Staining Pattern Evaluation
The observed frequencies of homogeneous, rim, speckled, nucleolar, and unclassifiable pattern class were 21, 1, 29, 1, and 48%, respectively. These data were computed averaging each class rate over the four readings on the same sample.
An analysis similar to the intensity one has been performed for the staining pattern description (Table 4). Once more, the agreement between traditional and digital supported diagnosis is different for each physician, suggesting substantial and moderate agreement (k = 0.66 ± 0.13 and k = 0.56 ± 0.14), respectively.
Kappa, k | Confidence interval | |
---|---|---|
Physician 1 looking at the microscope and at the monitor | 0.66 | 0.13 |
Physician 2 looking at the microscope and at the monitor | 0.56 | 0.14 |
Between experts at the microscope | 0.61 | 0.13 |
Between experts at the monitor | 0.68 | 0.12 |
Again, the agreement between pairs of experts remarkably depends on the method used to carry out the diagnosis. The measured kappa is nearly 10% more at the monitor than at the microscope. Furthermore, the Cohen's kappa implies substantial agreement both at the microscope (k = 0.61 ± 0.13) and at the monitor (k = 0.68 ± 0.12), respectively.
DISCUSSION
IIF is a subjective, semiquantitative test that presents a high analytical variability for the intensity fluorescence as well as for the staining pattern (15). Despite the recent introduction of numerous diagnostic ELISA kits, the literature accords that these cannot yet substitute for the IIF procedure, which could be still considered the reference method for ANA detection.
To our knowledge, the highest level of automation in IIF tests is the preparation of slides with robotic devices. Even if this improves the standardization level, it does not affect most of the problems mentioned in the introduction.
- i
the possibility of performing a preselection of the cases to be examined, both allowing the physicians to concentrate his/her attention only on relevant cases and saving time;
- ii
the possibility of serving as a second reader, thus augmenting the physician capabilities to reduce mistakes;
- iii
the possibility of working as a tool for training and education of medical personnel.
Recently, in the literature, some papers proposed CAD systems to automate the classification of HEp-2 staining patterns (18, 19). While these efforts confirm the potential interest in developing a comprehensive CAD system for IIF tests, they provide just preliminary and partial results, leaving unsolved most problems.
Although we reported preliminary results on a CAD system to support the fluorescence intensity classification (20,21), in our opinion the first issue that should be considered in a systematic way to provide effective and viable CAD solutions is the introduction of digital images in IIF practice and full validation of their use both in manual and assisted diagnosis. In other words, we suggest that procedures to automatically acquire digital images of IIF slides should be defined and validated. Then these images can be used either to perform IIF diagnosis by human experts looking at them on a computer monitor in place of direct observation to the immunofluorescence microscope, or as input to a CAD system. Note that this achievement would hopefully provide significant advantages in a short-term over current practice, since it should greatly reduce the negative effects of both the photobleaching and the interobserver variability. Moreover, the availability of reliable digital images would enable the provisioning of a wide spectrum of useful features, ranging from easy repeatability of diagnosis over time to integration of IIF exams into Electronic Patient Records.
Commercial products for IIF image acquisition have been recently released to the marked (22). Nevertheless, to our knowledge, this is the first work about the validation of using digital images in IIF tests. As the following results discussion demonstrates, digital images can be profitably used to perform diagnosis by human experts working at a computer monitor.
Regarding to the fluorescence intensity evaluation for each physician, the measured kappa implies moderate and substantial agreement at the microscope and at the monitor, respectively (Table 2). At a further analysis, nearly the 85% of disagreements for both physicians occurs on samples that exhibit fluorescent intensity from classes from 0 up to 2+. The samples of such classes are intrinsically hard to classify, because they are on the borderline between positive and negative classes.
It is worth noting that the two diagnoses performed by each physician on the same sample should be considered independent one each other. These observations suggest that diagnosing fluorescence intensity of the sample, using digital media, is at least as much reliable as the classification carried out in the traditional way (i.e., the fluorescence microscope).
The measured kappa on the agreement between pairs of physicians shows moderate and substantial agreement at the microscope and at the monitor, respectively. These data suggest that performing the diagnosis by looking at digital images on a workstation screen allows the physicians to better concentrate on sample examination. Indeed digital images on the one hand avoid the photobleatching effect and, on the other hand, allow for a more careful intensity evaluation, by easily comparing the sample under examination with negative and positive controls.
Such an agreement, however, may not be satisfactory enough to develop a reliable CAD. In this respect, it is important to reliably label a data set with its true category. In the field of pattern recognition, a labeled data set is named ground truth. In IIF application, the ground truth is made by labeled images with both fluorescence intensity and staining pattern classification. IIF is a subjective test and no objective independent test could be used to assess the human expert diagnosis. Based on these observations we utilize two different and independent diagnoses for each sample to get ground truth. Clearly its reliability depends on the degree of agreement between physicians.
To overcome such a limitation, the original fluorescence intensity classification problem on five classes is simplified to a classification problem on three classes, as described in Material and methods. While the motivation for this class revision is the ability to get a better ground truth, it is worth noting that these three classes maintain the clinical significance of the test (i.e., positive, negative, and border zone test).
To evaluate the effectiveness of this simplified classification into three classes, we adopt the previous approach. Obviously, the agreement between the diagnoses of the same experts and between pairs of experts depends on the number of categories, in which they should classify the fluorescence intensity of the examined sample. Indeed, using the classification in three classes, the Cohen's kappa suggests almost perfect and substantial agreement for Physician1 and for Physician 2, respectively. On the other hand, respect to the classification into five classes, the agreement at the microscope and at the monitor rise up from moderate and substantial to substantial and perfect, respectively. These data suggest that a classification in three classes leads to a more reliable ground truth, useful to develop a CAD supporting the classification of fluorescence intensity.
Concerning the staining pattern evaluation, the measured kappa for each physician suggests substantial and moderate agreement, respectively. Since the diagnosis of the expert concerns both the fluorescence intensity and the staining pattern, the difference between experts is the same measured as the fluorescence intensity. Likewise to the grading of the fluorescence intensity, variability in the staining pattern classification is reported in the literature (5, 15). Our data show a reduction in this variability: it suggests that classifying the pattern using digital images is more effective than looking at the microscope.
The disagreements clustered both round the choices speckled vs. homogeneous and speckled vs. unclassifiable pattern; in all cases such disagreements ranged from 3 to 8% over the total number of sera. It is worth noting that most of them occur on samples whose fluorescence intensity was weakly positive, as usual in daily clinical practice.
Analyzing the agreement between pairs of experts, the Cohen's kappa implies substantial agreement both at the microscope and at the monitor. Disagreements between experts mainly occur when the sample exhibits multiple patterns, or when one of the two experts reports an unclassifiable pattern. In this respect its worth noting that CAD systems have the potential for distinguishing not only the predominant pattern but also the minor one.
Different from fluorescence intensity grading, that should be considered a continuous variable, the classes of staining pattern are defined by well-distinguished features, or well-separated using a term of the pattern recognition field.
Our results suggest that observing the pattern by looking at digital images on a workstation monitor allows the physician to better concentrate on sample examination, e.g., to observe carefully fine details without take care of photobleaching effects.
As a final consideration, the physicians were initially not accustomed to diagnose the sample, using the workstation monitor, while they were well skilled in carrying out the diagnosis at the microscope. Potentially, the results on digital image classification could remarkably improve as the expert “feeling” with this kind of diagnostic procedure increases.
Acknowledgements
We thank Dario Malosti for his constant encouragement, support, and precious advices.