Cohen’s kappa is a standard tool for the analysis of agreement in a 2 × 2 reliability study. Researchers are frequently only interested in the kappa-value of a sample. Various authors have observed that if two pairs of raters have the same amount of observed agreement, the pair whose marginal distributions are more similar to each other may have a lower kappa-value than the pair with more divergent marginal distributions. Here we present exact formulations of some of these properties. The results provide a better understanding of the 2 × 2 kappa for situations where it is used as a sample statistic.

1. Introduction

Results from experimental studies and research studies can often be summarized in a 2 × 2 table [1]. An example is a reliability study in which two observers rate the same sample of subjects on the presence/absence of a trait or an ability [2, 3]. In this example the four cells of the 2 × 2 table are the proportion of times the observers agreed on the presence of the trait, the proportion of times a trait was present according to the first observer but absent according to the second observer, the proportion of times a trait was absent according to the first observer but present according to the second observer, and the proportion of times the observers agreed on the absence of the trait.

To assess the quality of the ratings, the agreement between the ratings is taken as an indicator of the quality of the category definitions and the observers’ ability to apply them. A standard tool for estimating agreement in a 2 × 2 reliability study is Cohen’s kappa [4–8]. Its value is 1 when there is perfect agreement, 0 when agreement is equal to that expected under independence, and negative when agreement is less than expected by chance.

Several authors have presented population models for Cohen’s kappa [2, 7]. Under these models kappa can be interpreted as an association coefficient. However, kappa is also frequently used as a sample statistic [4, 8–11], for example, when calculating kappa for a sample of subjects is one step in a series of research steps. In this case, researchers are merely interested in the agreement in the sample not that of a population.

As a sample statistic, kappa is known to be marginal or prevalence dependent since it takes the marginal totals with which raters use the rating categories into account [12–14]. The value of kappa depends on the prevalence of the condition being diagnosed. Values of kappa can be quite low if a condition is quite common or very rare. Various authors have shown that if two pairs of observers have the same amount of observed agreement, the pair whose marginal distributions are more similar to each other may have a lower kappa-value than the pair with more divergent marginal distributions [4, 9, 15, 16]. Since observers with similar marginal distributions usually have a higher amount of agreement expected to occur by chance, a fixed amount of observed agreement will lead to a lower kappa-value due to the definition of the statistic [14].

The marginal dependence of Cohen’s kappa has almost exclusively been demonstrated and described by means of examples of 2 × 2 tables [4, 8–11, 16]. However, for better understanding of the behavior of the 2 × 2 kappa as a sample statistic, it is desirable to have exact formulations of its marginal dependence. Such formulations are presented in this paper. The paper is organized as follows. Section 2 is used to introduce notation and define Cohen’s kappa. In Section 3 several concepts for a 2 × 2 table are presented. The main results are presented in Section 4. The results show that the 2 × 2 kappa may exhibit several different forms of marginal dependence. The results do not necessarily suggest that Cohen’s kappa should be discarded as an agreement measure. Instead, the exact formulations provide a better understanding of the 2 × 2 statistic. Section 5 contains several conclusions.

2. Notation and Kappa

In this section we introduce notation and define the 2 × 2 kappa coefficient. Suppose two fixed observers independently rate the same set of subjects using the same two categories 1 and 0. For example, 1 = presence and 0 = absence of a trait. For a sample of subjects, let a, b, c, and d denote, respectively, the proportion classified in category 1 by both observers, the proportion classified by the first observer in category 1 and by the second observer in category 0, the proportion classified by the first observer in category 0 and by the second observer in category 1, and the proportion classified in category 0 by both observers.

A general 2 × 2 table with observed proportions, denoted by P, is presented in Table 1. The row and column totals are the marginal totals that result from summing the relative frequencies. We denote these by p₁ and q₁ for rater 1 and by p₂ and q₂ for rater 2. They reflect how often the observers used the categories.

Table 1. Bivariate table of relative frequencies corresponding to two observers making binary (1, 0) decisions on a sample of subjects.

Observer 1	Observer 2		Total
Observer 1	1	0	Total
1	a	b	p₁
0	c	d	q₁

Total	p₂	q₂	1

Using the notation presented in Table 1, the proportion of observed agreement is given by a + d, while the proportion of expected agreement is given by p₁p₂ + q₁q₂. Furthermore, Cohen’s kappa can be defined as

()

Cohen’s kappa is a chance-corrected coefficient [17]. The value of kappa is 1 when perfect agreement between the two observers occurs, 0 when agreement is equal to that expected under independence (a + d = p₁p₂ + q₁q₂), and negative when agreement is less than expected by chance.

3. Concepts for a 2 × 2 Table

In this section we introduce several concepts for a 2 × 2 table P. The 2-tuples (p₁, q₁) and (p₂, q₂) contain the marginal distributions.

Definition 1. Two tuples (p₁, q₁) and (p₂, q₂) are said to be

(i)
similarly arranged if both increase (i.e., p₁ < q₁ and p₂ < q₂) or both decrease (i.e., p₁ > q₁ and p₂ > q₂);
(ii)
oppositely arranged if one increases and the other decreases.

Furthermore, a tuple (p, q) is said to be balanced if p = q = 1/2.

In the following definition we use the concepts from Definition 1 to define terminology that will be used to formalize the marginal dependencies of kappa.

Definition 2. A 2 × 2 table P is said to be

(i)
strongly marginal symmetric if p₁ = p₂ ≠ 1/2 (and hence q₁ = q₂);
(ii)
weakly marginal symmetric if (p₁, q₁) and (p₂, q₂) are similarly arranged;
(iii)
balanced if (p₁, q₁) or (p₂, q₂) is balanced;
(iv)
marginal asymmetric if (p₁, q₁) and (p₂, q₂) are oppositely arranged.

Note that strong marginal symmetry implies weak marginal symmetry. Furthermore, strong marginal symmetry coincides with the usual definition of a symmetric matrix.

The following result relates some of the concepts in Definition 2 to the value of the chance-expected agreement p₁p₂ + q₁q₂. Lemma 3 is used in the proof of Theorem 5.

Lemma 3. For a 2 × 2 table P with marginal distributions (p₁, q₁) and (p₂, q₂) the following equivalences hold.

(1)
P is weakly marginal symmetric ⇔p₁p₂ + q₁q₂ > 1/2;
(2)
P is balanced ⇔p₁p₂ + q₁q₂ = 1/2;
(3)
P is marginal symmetric ⇔p₁p₂ + q₁q₂ < 1/2.

Proof. We prove equivalence 1. The other equivalences follow from using similar arguments.

(⇒) If P is weakly marginal symmetric, (p₁, q₁) and (p₂, q₂) are similarly arranged and we have (p₁ − q₁)(p₂ − q₂) > 0, or equivalently

()

Adding p₁p₂ + q₁q₂ to both sides of inequality (2) and dividing the result by 2, we obtain

()

(⇐) If p₁p₂ + q₁q₂ > 1/2 then (p₁ − q₁)(p₂ − q₂) > 0 must hold using the same steps as in (⇒) backwards. Hence, (p₁, q₁) and (p₂, q₂) are similarly arranged, and it follows that P is weakly marginal symmetric.

4. Main Results

In this section we present several marginal dependencies of Cohen’s kappa (Theorems 5, 7, and 10). The following lemma will be used repeatedly.

Lemma 4. Coefficient (1) is strictly decreasing in p₁p₂ + q₁q₂.

Proof. The first order partial derivative

()

is negative for a + d < 1.

The following result is a slightly stronger version of a theorem in Warrens [16] for a rating scale with two categories. Theorem 5 shows that, for a fixed value of the proportion of observed agreement a + d, 2 × 2 tables that possess weak marginal symmetry produce lower values of kappa than tables that are marginal asymmetric.

Theorem 5. Let P₁, P₂, and P₃ be 2 × 2 tables with the same value of a + d < 1 that are, respectively, weakly marginal symmetric, balanced, and marginal asymmetric. Furthermore, let κ₁, κ₂, and κ₃ denote the associated values of kappa. Then κ₁ < κ₂ < κ₃.

Proof. Lemma 4 shows that κ is strictly decreasing in p₁p₂ + q₁q₂. The result then follows from application of Lemma 3.

Example 6 illustrates Theorem 5.

Example 6. Consider the three hypothetical 2 × 2 tables in Table 2.

Each table has the same proportion of observed agreement a + d = .80. Table 2(a) is strongly marginal symmetric, Table 2(b) is balanced, and Table 2(c) is marginal asymmetric. We have the double inequality κ₁ < κ₂ < κ₃, which illustrates Theorem 5.

Table 2. Three hypothetical 2 × 2 tables.

	1	0	Total
1	.60	.10	.70	a + d = .80
0	.10	.20	.30	p₁p₂ + q₁q₂ = .58

Total	.70	.30	1	κ₁ = .52

	1	0	Total
1	.45	.05	.50	a + d = .80
0	.15	.35	.50	p₁p₂ + q₁q₂ = .50

Total	.60	.40	1	κ₂ = .60

	1	0	Total
1	.40	.20	.60	a + d = .80
0	.00	.40	.40	p₁p₂ + q₁q₂ = .48

Total	.40	.60	1	κ₃ = .62

Theorem 5 also considers 2 × 2 tables that have asymmetric marginals. However, asymmetric tables may not be relevant in practice. If the classifications are hard to make researchers will often make use of expert observers. Furthermore, novice observers usually receive some training before the actual classifications have to be made. Asymmetric tables are therefore rarely encountered in practice.

Theorem 7 shows that 2 × 2 tables that are symmetric may have lower kappa-values than tables with unequal marginal distributions.

Theorem 7. Let P₁ be a weakly marginal symmetric table with marginals (p₁, q₁) and (p₂, q₂) and value κ₁ and P₂ strongly marginal symmetric with marginals (p₃, q₃) and value κ₂. Furthermore, suppose that P₁ and P₂ have the same proportion of observed agreement a + d < 1. If

()

then κ₁ > κ₂.

Proof. Due to the symmetries of the proportion of expected agreement p₁p₂ + q₁q₂, we may assume, without loss of generality, that p₁ > q₁, p₂ > q₂, and p₃ > q₃. It then follows from inequality (5) that p₃ > p₁, p₂. Furthermore, since p₃ > q₃, we have p₃ > 1/2, or 2p₃ > 1, and since p₁ > q₁, we also have 2p₁ > 1.

Since 2p₃ − 1 > 0, multiplying both sides of inequality p₃ > p₁ by 2p₃ − 1 yields

()

Adding 1 − p₃ to both sides of inequality (6) we obtain

()

or equivalently,

()

Next, since 2p₁ − 1 > 0, multiplying both sides of the inequality p₃ > p₂ by 2p₁ − 1 we obtain

()

Adding 1 − p₁ to both sides of inequality (9) we obtain

()

or equivalently,

()

Combining (8) and (11) we obtain the inequality

()

The inequality κ₁ > κ₂ then follows from inequality (12) and application of Lemma 4.

Example 8 illustrates Theorem 7.

Example 8. Consider the two hypothetical 2 × 2 tables in Table 3.

Both tables have the same proportion of observed agreement a + d = .80. Table 3(a) is weakly marginal symmetric, whereas Table 3(b) is strongly marginal symmetric. Since

()

the inequalities in (5) hold. Furthermore, we have κ₁ > κ₂, which illustrates Theorem 7.

Table 3. Two hypothetical 2 × 2 tables.

	1	0	Total
1	.60	.05	.65	a + d = .80
0	.15	.20	.35	p₁p₂ + q₁q₂ = .58

Total	.75	.25	1	κ₁ = .53

	1	0	Total
1	.70	.10	.80	a + d = .80
0	.10	.10	.20	p₃p₃ + q₃q₃ = .68

Total	.80	.20	1	κ₂ = .38

Using similar arguments as in the proof of Theorem 7 we may obtain the following result.

Theorem 9. Let P₁ be a weakly marginal symmetric table with marginals (p₁, q₁) and (p₂, q₂) and κ-value κ₁ and P₂ strongly marginal symmetric with marginals (p₃, q₃) and κ-value κ₂. Furthermore, suppose that P₁ and P₂ have the same proportion of observed agreement a + d < 1. If

()

then κ₁ < κ₂.

If we have p₁ > q₁, p₂ > q₂, and p₃ > q₃ then Theorems 7 and 9 cover the cases that p₁, p₂ > p₃ and p₁, p₂ < p₃. The cases p₁ > p₃ > p₂ and p₂ > p₃ > p₁ turn out to be more complicated.

Another marginal dependence of kappa is presented in Theorem 10. The theorem shows that, for a constant value of the proportion of observed agreement a + d, 2 × 2 tables that exhibit weak marginal symmetry may produce higher kappa-values than tables with strong marginal symmetry. Theorem 10 is similar to Theorem 7, but conditions (5) and (15) are different, unrelated requirements.

Theorem 10. Let P₁ be a weakly marginal symmetric table with marginals (p₁, q₁) and (p₂, q₂) and value κ₁ and P₂ strongly marginal symmetric with marginals (p₃, q₃) and value κ₂. Furthermore, suppose that P₁ and P₂ have the same proportion of observed agreement a + d < 1. If

()

then κ₁ > κ₂.

Proof. Due to the symmetries of the proportion of expected agreement p₁p₂ + q₁q₂, we may assume, without loss of generality, that p₁p₂ > q₁q₂ and p₃ > q₃. It then follows from inequality (15) that . Furthermore, since P₁ is weakly marginal symmetric we must have p₁ > q₁ and p₂ > q₂. It follows that p₁, p₂ > 1/2, and thus .

For p ∈ (0,1) the function p ↦ p(1 − p) is concave with a maximum at p = 1/2. Since , we have

()

or equivalently,

()

Next, since a square is nonnegative we have

, or

()

Adding 2p₁p₂ to both sides of inequality (18) we obtain

()

Combining inequalities (17) and (19) we obtain

()

Adding 1 to both sides of inequality (20) we obtain

()

or equivalently,

()

The inequality κ₁ > κ₂ then follows from inequality (22) and application of Lemma 4.

Example 11 illustrates Theorem 10 and a corollary of Theorem 10.

Example 11. Consider the four hypothetical 2 × 2 tables in Table 4.

Each table has the same proportion of observed agreement a + d = .80. Tables 4(a) and 4(d) are strongly marginal symmetric, whereas Tables 4(b) and 4(c) are weakly marginal symmetric. For the largest marginals of Tables 4(a) and 4(b), we have (.70)(.70)>(.60)(.80) and κ₁ < κ₂, which illustrates Theorem 10.

For many 2 × 2 tables from the literature the converse of Theorem 10 also holds. However, Tables 4(a) and 4(c) provide us with a counterexample that illustrates that the converse does not hold in general. We have κ₁ < κ₃, but not (.70)(.70)>(.62)(.80) = .50.

Finally, Tables 4(a) and 4(d) illustrate a special application of Theorem 10. If two 2 × 2 tables are strongly marginal symmetric and have the same proportion of observed agreement a + d, then the table with the most skewed (unbalanced) marginals (Table 4(a)) has the lowest value of kappa. This is illustrated by the fact that κ₁ < κ₄.

Table 4. Four hypothetical 2 × 2 tables.

	1	0	Total
1	.60	.10	.70	a + d = .80
0	.10	.20	.30	p₁p₂ + q₁q₂ = .58

Total	.70	.30	1	κ₁ = .52

	1	0	Total
1	.60	.00	.60	a + d = .80
0	.20	.20	.40	p₁p₂ + q₁q₂ = .56

Total	.80	.20	1	κ₂ = .55

	1	0	Total
1	.61	.01	.62	a + d = .80
0	.19	.19	.38	p₁p₂ + q₁q₂ = .57

Total	.80	.20	1	κ₃ = .53

	1	0	Total
1	.50	.10	.60	a + d = .80
0	.10	.30	.40	p₁p₂ + q₁q₂ = .52

Total	.60	.40	1	κ₄ = .58

5. Conclusions

Cohen’s kappa is presently a standard tool for the analysis of agreement in a 2 × 2 reliability study. The statistic is frequently used as a sample statistic. Various authors have observed in this context that if two pairs of raters have the same amount of observed agreement, the pair whose marginal distributions are more similar to each other may have a lower kappa-value than the pair with more divergent marginal distributions. These properties of Cohen’s kappa have almost exclusively been demonstrated and described by means of examples of 2 × 2 tables [4, 8–11, 16]. In this paper we presented exact formulations and proved several marginal dependencies of this type (Theorems 5, 7, and 10). In general, they show that, for 2 × 2 tables with the same value of observed agreement, tables with marginal distributions that are more similar have lower associated kappa-values than tables with marginal distributions that are less similar. Each result was illustrated by an example with hypothetical 2 × 2 tables. The results provide a better understanding of the 2 × 2 kappa when it is used as a sample statistic.

Theorem 5 considers 2 × 2 tables that have asymmetric marginals. Although several authors have provided examples with asymmetric marginals, asymmetric tables may not be relevant in practice. If the classifications are hard to make researchers will often make use of expert observers. Furthermore, novice observers usually receive some training before the actual classifications have to be made. Asymmetric tables are therefore rarely encountered in practice.

Vach [14] emphasizes that kappa should not simply be interpreted as a measure of agreement but that Cohen’s kappa expresses the degree to which observed agreement exceeds the agreement that was expected by chance. The marginal dependencies are a direct consequence of the definition of kappa and its aim to adjust the observed agreement with respect to the expected amount of agreement under chance conditions [14, p. 659]. It is not a reason for discarding Cohen’s kappa.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The author thanks two anonymous reviewers for their helpful comments and valuable suggestions on an earlier version of this paper. This research is part of Veni project 451-11-026 funded by the Netherlands Organisation for Scientific Research.

References

1 Warrens M. J., On association coefficients for 2 × 2 tables and properties that do not depend on the marginal distributions, Psychometrika. (2008) 73, no. 4, 777–789, ZBL1284.62762, 2-s2.0-62949147818, https://doi.org/10.1007/s11336-008-9070-3.
Google Scholar
2 Bloch D. A. and Kraemer H. C., 2 × 2 kappa coefficients: measures of agreement or association, Biometrics. (1989) 45, no. 1, 269–287, https://doi.org/10.2307/2532052, 2-s2.0-0024513436.
Google Scholar
3 Fleiss J. L., Measuring agreement between two judges on the presence or absence of a trait, Biometrics. (1975) 31, no. 3, 651–659, MR0388666, 2-s2.0-0016750401.
Google Scholar
4 Brennan R. L. and Prediger D. J., Coefficient kappa: some uses, misuses, and alternatives, Educational and Psychological Measurement. (1981) 41, 687–699.
Web of Science® Google Scholar
5 Cohen J., A coefficient of agreement for nominal scales, Educational and Psychological Measurement. (1960) 20, 37–46, https://doi.org/10.1177/001316446002000104.
Web of Science® Google Scholar
6 Hsu L. M. and Field R., Interrater agreement measures: comments on kappa_n, Cohen′s kappa, Scott′s π, and Aickin′s α, Understanding Statistics. (2003) 2, no. 3, 205–219, https://doi.org/10.1207/S15328031US0203_03.
Google Scholar
7 Kraemer H. C., Ramifications of a population model for κ as a coefficient of reliability, Psychometrika. (1979) 44, no. 4, 461–472, https://doi.org/10.1007/BF02296208, 2-s2.0-0001586045.
Google Scholar
8 Sim J. and Wright C. C., The kappa statistic in reliability studies: use, interpretation, and sample size requirements, Physical Therapy. (2005) 85, no. 3, 257–268, 2-s2.0-14744284163.
Web of Science® Google Scholar
9 Feinstein A. R. and Cicchetti D. V., High agreement but low kappa: I. the problems of two paradoxes, Journal of Clinical Epidemiology. (1990) 43, no. 6, 543–549, https://doi.org/10.1016/0895-4356(90)90158-L, 2-s2.0-0025286295.
10.1016/0895-4356(90)90158-L
Google Scholar
10 Lantz C. A. and Nebenzahl E., Behavior and interpretation of the κ statistic: resolution of the two paradoxes, Journal of Clinical Epidemiology. (1996) 49, no. 4, 431–434, https://doi.org/10.1016/0895-4356(95)00571-4, 2-s2.0-0029935469.
Google Scholar
11 Maclure M. and Willett W. C., Misinterpretation and misuse of the Kappa statistic, American Journal of Epidemiology. (1987) 126, no. 2, 161–169, 2-s2.0-0023250550, https://doi.org/10.1093/aje/126.2.161.
Google Scholar
12 Von Eye A. and Von Eye M., On the marginal dependency of Cohen′s κ, European Psychologist. (2008) 13, no. 4, 305–315, https://doi.org/10.1027/1016-9040.13.4.305, 2-s2.0-56849127695.
Google Scholar
13 Thompson W. D. and Walter S. D., A reappraisal of the kappa coefficient, Journal of Clinical Epidemiology. (1988) 41, no. 10, 949–958, https://doi.org/10.1016/0895-4356(88)90031-5, 2-s2.0-0024207930.
Google Scholar
14 Vach W., The dependence of Cohen′s kappa on the prevalence does not matter, Journal of Clinical Epidemiology. (2005) 58, no. 7, 655–661, https://doi.org/10.1016/j.jclinepi.2004.02.021, 2-s2.0-20444420190.
Web of Science® Google Scholar
15 Byrt T., Bishop J., and Carlin J. B., Bias, prevalence and kappa, Journal of Clinical Epidemiology. (1993) 46, no. 5, 423–429, https://doi.org/10.1016/0895-4356(93)90018-V, 2-s2.0-0027210597.
10.1016/0895-4356(93)90018-V
Google Scholar
16 Warrens M. J., A formal proof of a paradox associated with Cohen′s kappa, Journal of Classification. (2010) 27, no. 3, 322–332, https://doi.org/10.1007/s00357-010-9060-x, 2-s2.0-84895901446.
Google Scholar
17 Warrens M. J., On similarity coefficients for 2 × 2 tables and correction for chance, Psychometrika. (2008) 73, no. 3, 487–502, 2-s2.0-52949107579, MR2447327, https://doi.org/10.1007/s11336-008-9059-y.
Google Scholar

Citing Literature

All articles

On Marginal Dependencies of the 2 × 2 Kappa

Abstract

1. Introduction

2. Notation and Kappa

3. Concepts for a 2 × 2 Table

4. Main Results

5. Conclusions

Conflict of Interests

Acknowledgments

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

On Marginal Dependencies of the 2 × 2 Kappa

Abstract

1. Introduction

2. Notation and Kappa

3. Concepts for a 2 × 2 Table

4. Main Results

5. Conclusions

Conflict of Interests

Acknowledgments

References

Citing Literature

References

Related

Information