Assessing five common measures of interobserver reliability, proposing new refined measures
Chmil, Shawn M.
It is frequently desired to determine the extent of agreement between two raters when the data are measured on an ordinal scale. Five common measures of interobserver reliability are the overall proportion of agreement, Cohen's kappa, weighted kappa, the disagreement rate and the concordance between raters. A number of studies have assessed interobserver reliability including ones which have reservations about the measures of reliability and others which recognize several paradoxes. It is known that chance-corrected measures of agreement are prone to exhibit paradoxical and counter-intuitive results. Also, if measures are to be adjusted for chance agreement, then the guessing mechanism needs to be specified properly and precisely, as the current assumption that all observations are guessed is simply impractical. The inadequacies of these measures are discussed and, in light of their deficiencies, new measures are proposed. The assumption that some but not all observations are guessed is used to develop three new measures of interobserver reliability, namely, partial-chance proportion, partial-chance kappa and the expected-chance proportion. Simulations are used to compare the finite sample performance of these measures. In the simulations, the concordance between raters produced the best results, closely followed by partial-chance proportion, expected-chance proportion and partial-chance kappa, in terms of bias, efficiency and the empirical distributions of critical ratios. Recommended measures of interobserver reliability are the concordance between raters, partial-chance proportion, expected-chance proportion and partial-chance kappa. Although the concordance between raters is highly advised, its usage should be cautioned as it is based on assumptions that are impractical in clinical practice.