On Marginal Dependencies of the 2 × 2 Kappa
Abstract
Cohen’s kappa is a standard tool for the analysis of agreement in a 2 × 2 reliability study. Researchers are frequently only interested in the kappa-value of a sample. Various authors have observed that if two pairs of raters have the same amount of observed agreement, the pair whose marginal distributions are more similar to each other may have a lower kappa-value than the pair with more divergent marginal distributions. Here we present exact formulations of some of these properties. The results provide a better understanding of the 2 × 2 kappa for situations where it is used as a sample statistic.
1. Introduction
Results from experimental studies and research studies can often be summarized in a 2 × 2 table [1]. An example is a reliability study in which two observers rate the same sample of subjects on the presence/absence of a trait or an ability [2, 3]. In this example the four cells of the 2 × 2 table are the proportion of times the observers agreed on the presence of the trait, the proportion of times a trait was present according to the first observer but absent according to the second observer, the proportion of times a trait was absent according to the first observer but present according to the second observer, and the proportion of times the observers agreed on the absence of the trait.
To assess the quality of the ratings, the agreement between the ratings is taken as an indicator of the quality of the category definitions and the observers’ ability to apply them. A standard tool for estimating agreement in a 2 × 2 reliability study is Cohen’s kappa [4–8]. Its value is 1 when there is perfect agreement, 0 when agreement is equal to that expected under independence, and negative when agreement is less than expected by chance.
Several authors have presented population models for Cohen’s kappa [2, 7]. Under these models kappa can be interpreted as an association coefficient. However, kappa is also frequently used as a sample statistic [4, 8–11], for example, when calculating kappa for a sample of subjects is one step in a series of research steps. In this case, researchers are merely interested in the agreement in the sample not that of a population.
As a sample statistic, kappa is known to be marginal or prevalence dependent since it takes the marginal totals with which raters use the rating categories into account [12–14]. The value of kappa depends on the prevalence of the condition being diagnosed. Values of kappa can be quite low if a condition is quite common or very rare. Various authors have shown that if two pairs of observers have the same amount of observed agreement, the pair whose marginal distributions are more similar to each other may have a lower kappa-value than the pair with more divergent marginal distributions [4, 9, 15, 16]. Since observers with similar marginal distributions usually have a higher amount of agreement expected to occur by chance, a fixed amount of observed agreement will lead to a lower kappa-value due to the definition of the statistic [14].
The marginal dependence of Cohen’s kappa has almost exclusively been demonstrated and described by means of examples of 2 × 2 tables [4, 8–11, 16]. However, for better understanding of the behavior of the 2 × 2 kappa as a sample statistic, it is desirable to have exact formulations of its marginal dependence. Such formulations are presented in this paper. The paper is organized as follows. Section 2 is used to introduce notation and define Cohen’s kappa. In Section 3 several concepts for a 2 × 2 table are presented. The main results are presented in Section 4. The results show that the 2 × 2 kappa may exhibit several different forms of marginal dependence. The results do not necessarily suggest that Cohen’s kappa should be discarded as an agreement measure. Instead, the exact formulations provide a better understanding of the 2 × 2 statistic. Section 5 contains several conclusions.
2. Notation and Kappa
In this section we introduce notation and define the 2 × 2 kappa coefficient. Suppose two fixed observers independently rate the same set of subjects using the same two categories 1 and 0. For example, 1 = presence and 0 = absence of a trait. For a sample of subjects, let a, b, c, and d denote, respectively, the proportion classified in category 1 by both observers, the proportion classified by the first observer in category 1 and by the second observer in category 0, the proportion classified by the first observer in category 0 and by the second observer in category 1, and the proportion classified in category 0 by both observers.
A general 2 × 2 table with observed proportions, denoted by P, is presented in Table 1. The row and column totals are the marginal totals that result from summing the relative frequencies. We denote these by p1 and q1 for rater 1 and by p2 and q2 for rater 2. They reflect how often the observers used the categories.
Observer 1 | Observer 2 | Total | |
---|---|---|---|
1 | 0 | ||
1 | a | b | p1 |
0 | c | d | q1 |
Total | p2 | q2 | 1 |
3. Concepts for a 2 × 2 Table
In this section we introduce several concepts for a 2 × 2 table P. The 2-tuples (p1, q1) and (p2, q2) contain the marginal distributions.
Definition 1. Two tuples (p1, q1) and (p2, q2) are said to be
- (i)
similarly arranged if both increase (i.e., p1 < q1 and p2 < q2) or both decrease (i.e., p1 > q1 and p2 > q2);
- (ii)
oppositely arranged if one increases and the other decreases.
In the following definition we use the concepts from Definition 1 to define terminology that will be used to formalize the marginal dependencies of kappa.
Definition 2. A 2 × 2 table P is said to be
- (i)
strongly marginal symmetric if p1 = p2 ≠ 1/2 (and hence q1 = q2);
- (ii)
weakly marginal symmetric if (p1, q1) and (p2, q2) are similarly arranged;
- (iii)
balanced if (p1, q1) or (p2, q2) is balanced;
- (iv)
marginal asymmetric if (p1, q1) and (p2, q2) are oppositely arranged.
Note that strong marginal symmetry implies weak marginal symmetry. Furthermore, strong marginal symmetry coincides with the usual definition of a symmetric matrix.
The following result relates some of the concepts in Definition 2 to the value of the chance-expected agreement p1p2 + q1q2. Lemma 3 is used in the proof of Theorem 5.
Lemma 3. For a 2 × 2 table P with marginal distributions (p1, q1) and (p2, q2) the following equivalences hold.
- (1)
P is weakly marginal symmetric ⇔p1p2 + q1q2 > 1/2;
- (2)
P is balanced ⇔p1p2 + q1q2 = 1/2;
- (3)
P is marginal symmetric ⇔p1p2 + q1q2 < 1/2.
Proof. We prove equivalence 1. The other equivalences follow from using similar arguments.
(⇒) If P is weakly marginal symmetric, (p1, q1) and (p2, q2) are similarly arranged and we have (p1 − q1)(p2 − q2) > 0, or equivalently
(⇐) If p1p2 + q1q2 > 1/2 then (p1 − q1)(p2 − q2) > 0 must hold using the same steps as in (⇒) backwards. Hence, (p1, q1) and (p2, q2) are similarly arranged, and it follows that P is weakly marginal symmetric.
4. Main Results
In this section we present several marginal dependencies of Cohen’s kappa (Theorems 5, 7, and 10). The following lemma will be used repeatedly.
Lemma 4. Coefficient (1) is strictly decreasing in p1p2 + q1q2.
Proof. The first order partial derivative
The following result is a slightly stronger version of a theorem in Warrens [16] for a rating scale with two categories. Theorem 5 shows that, for a fixed value of the proportion of observed agreement a + d, 2 × 2 tables that possess weak marginal symmetry produce lower values of kappa than tables that are marginal asymmetric.
Theorem 5. Let P1, P2, and P3 be 2 × 2 tables with the same value of a + d < 1 that are, respectively, weakly marginal symmetric, balanced, and marginal asymmetric. Furthermore, let κ1, κ2, and κ3 denote the associated values of kappa. Then κ1 < κ2 < κ3.
Proof. Lemma 4 shows that κ is strictly decreasing in p1p2 + q1q2. The result then follows from application of Lemma 3.
Example 6 illustrates Theorem 5.
Example 6. Consider the three hypothetical 2 × 2 tables in Table 2.
Each table has the same proportion of observed agreement a + d = .80. Table 2(a) is strongly marginal symmetric, Table 2(b) is balanced, and Table 2(c) is marginal asymmetric. We have the double inequality κ1 < κ2 < κ3, which illustrates Theorem 5.
1 | 0 | Total | ||
---|---|---|---|---|
1 | .60 | .10 | .70 | a + d = .80 |
0 | .10 | .20 | .30 | p1p2 + q1q2 = .58 |
Total | .70 | .30 | 1 | κ1 = .52 |
1 | 0 | Total | ||
---|---|---|---|---|
1 | .45 | .05 | .50 | a + d = .80 |
0 | .15 | .35 | .50 | p1p2 + q1q2 = .50 |
Total | .60 | .40 | 1 | κ2 = .60 |
1 | 0 | Total | ||
---|---|---|---|---|
1 | .40 | .20 | .60 | a + d = .80 |
0 | .00 | .40 | .40 | p1p2 + q1q2 = .48 |
Total | .40 | .60 | 1 | κ3 = .62 |
Theorem 5 also considers 2 × 2 tables that have asymmetric marginals. However, asymmetric tables may not be relevant in practice. If the classifications are hard to make researchers will often make use of expert observers. Furthermore, novice observers usually receive some training before the actual classifications have to be made. Asymmetric tables are therefore rarely encountered in practice.
Theorem 7 shows that 2 × 2 tables that are symmetric may have lower kappa-values than tables with unequal marginal distributions.
Theorem 7. Let P1 be a weakly marginal symmetric table with marginals (p1, q1) and (p2, q2) and value κ1 and P2 strongly marginal symmetric with marginals (p3, q3) and value κ2. Furthermore, suppose that P1 and P2 have the same proportion of observed agreement a + d < 1. If
Proof. Due to the symmetries of the proportion of expected agreement p1p2 + q1q2, we may assume, without loss of generality, that p1 > q1, p2 > q2, and p3 > q3. It then follows from inequality (5) that p3 > p1, p2. Furthermore, since p3 > q3, we have p3 > 1/2, or 2p3 > 1, and since p1 > q1, we also have 2p1 > 1.
Since 2p3 − 1 > 0, multiplying both sides of inequality p3 > p1 by 2p3 − 1 yields
Example 8 illustrates Theorem 7.
Example 8. Consider the two hypothetical 2 × 2 tables in Table 3.
Both tables have the same proportion of observed agreement a + d = .80. Table 3(a) is weakly marginal symmetric, whereas Table 3(b) is strongly marginal symmetric. Since
1 | 0 | Total | ||
---|---|---|---|---|
1 | .60 | .05 | .65 | a + d = .80 |
0 | .15 | .20 | .35 | p1p2 + q1q2 = .58 |
Total | .75 | .25 | 1 | κ1 = .53 |
1 | 0 | Total | ||
---|---|---|---|---|
1 | .70 | .10 | .80 | a + d = .80 |
0 | .10 | .10 | .20 | p3p3 + q3q3 = .68 |
Total | .80 | .20 | 1 | κ2 = .38 |
Using similar arguments as in the proof of Theorem 7 we may obtain the following result.
Theorem 9. Let P1 be a weakly marginal symmetric table with marginals (p1, q1) and (p2, q2) and κ-value κ1 and P2 strongly marginal symmetric with marginals (p3, q3) and κ-value κ2. Furthermore, suppose that P1 and P2 have the same proportion of observed agreement a + d < 1. If
If we have p1 > q1, p2 > q2, and p3 > q3 then Theorems 7 and 9 cover the cases that p1, p2 > p3 and p1, p2 < p3. The cases p1 > p3 > p2 and p2 > p3 > p1 turn out to be more complicated.
Another marginal dependence of kappa is presented in Theorem 10. The theorem shows that, for a constant value of the proportion of observed agreement a + d, 2 × 2 tables that exhibit weak marginal symmetry may produce higher kappa-values than tables with strong marginal symmetry. Theorem 10 is similar to Theorem 7, but conditions (5) and (15) are different, unrelated requirements.
Theorem 10. Let P1 be a weakly marginal symmetric table with marginals (p1, q1) and (p2, q2) and value κ1 and P2 strongly marginal symmetric with marginals (p3, q3) and value κ2. Furthermore, suppose that P1 and P2 have the same proportion of observed agreement a + d < 1. If
Proof. Due to the symmetries of the proportion of expected agreement p1p2 + q1q2, we may assume, without loss of generality, that p1p2 > q1q2 and p3 > q3. It then follows from inequality (15) that . Furthermore, since P1 is weakly marginal symmetric we must have p1 > q1 and p2 > q2. It follows that p1, p2 > 1/2, and thus .
For p ∈ (0,1) the function p ↦ p(1 − p) is concave with a maximum at p = 1/2. Since , we have
Example 11 illustrates Theorem 10 and a corollary of Theorem 10.
Example 11. Consider the four hypothetical 2 × 2 tables in Table 4.
Each table has the same proportion of observed agreement a + d = .80. Tables 4(a) and 4(d) are strongly marginal symmetric, whereas Tables 4(b) and 4(c) are weakly marginal symmetric. For the largest marginals of Tables 4(a) and 4(b), we have (.70)(.70)>(.60)(.80) and κ1 < κ2, which illustrates Theorem 10.
For many 2 × 2 tables from the literature the converse of Theorem 10 also holds. However, Tables 4(a) and 4(c) provide us with a counterexample that illustrates that the converse does not hold in general. We have κ1 < κ3, but not (.70)(.70)>(.62)(.80) = .50.
Finally, Tables 4(a) and 4(d) illustrate a special application of Theorem 10. If two 2 × 2 tables are strongly marginal symmetric and have the same proportion of observed agreement a + d, then the table with the most skewed (unbalanced) marginals (Table 4(a)) has the lowest value of kappa. This is illustrated by the fact that κ1 < κ4.
1 | 0 | Total | ||
---|---|---|---|---|
1 | .60 | .10 | .70 | a + d = .80 |
0 | .10 | .20 | .30 | p1p2 + q1q2 = .58 |
Total | .70 | .30 | 1 | κ1 = .52 |
1 | 0 | Total | ||
---|---|---|---|---|
1 | .60 | .00 | .60 | a + d = .80 |
0 | .20 | .20 | .40 | p1p2 + q1q2 = .56 |
Total | .80 | .20 | 1 | κ2 = .55 |
1 | 0 | Total | ||
---|---|---|---|---|
1 | .61 | .01 | .62 | a + d = .80 |
0 | .19 | .19 | .38 | p1p2 + q1q2 = .57 |
Total | .80 | .20 | 1 | κ3 = .53 |
1 | 0 | Total | ||
---|---|---|---|---|
1 | .50 | .10 | .60 | a + d = .80 |
0 | .10 | .30 | .40 | p1p2 + q1q2 = .52 |
Total | .60 | .40 | 1 | κ4 = .58 |
5. Conclusions
Cohen’s kappa is presently a standard tool for the analysis of agreement in a 2 × 2 reliability study. The statistic is frequently used as a sample statistic. Various authors have observed in this context that if two pairs of raters have the same amount of observed agreement, the pair whose marginal distributions are more similar to each other may have a lower kappa-value than the pair with more divergent marginal distributions. These properties of Cohen’s kappa have almost exclusively been demonstrated and described by means of examples of 2 × 2 tables [4, 8–11, 16]. In this paper we presented exact formulations and proved several marginal dependencies of this type (Theorems 5, 7, and 10). In general, they show that, for 2 × 2 tables with the same value of observed agreement, tables with marginal distributions that are more similar have lower associated kappa-values than tables with marginal distributions that are less similar. Each result was illustrated by an example with hypothetical 2 × 2 tables. The results provide a better understanding of the 2 × 2 kappa when it is used as a sample statistic.
Theorem 5 considers 2 × 2 tables that have asymmetric marginals. Although several authors have provided examples with asymmetric marginals, asymmetric tables may not be relevant in practice. If the classifications are hard to make researchers will often make use of expert observers. Furthermore, novice observers usually receive some training before the actual classifications have to be made. Asymmetric tables are therefore rarely encountered in practice.
Vach [14] emphasizes that kappa should not simply be interpreted as a measure of agreement but that Cohen’s kappa expresses the degree to which observed agreement exceeds the agreement that was expected by chance. The marginal dependencies are a direct consequence of the definition of kappa and its aim to adjust the observed agreement with respect to the expected amount of agreement under chance conditions [14, p. 659]. It is not a reason for discarding Cohen’s kappa.
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The author thanks two anonymous reviewers for their helpful comments and valuable suggestions on an earlier version of this paper. This research is part of Veni project 451-11-026 funded by the Netherlands Organisation for Scientific Research.