How was inter-annotator agreement measured?

Measurement context

IL0-to-IL1 annotation.

10 annotators.

Each annotator annotated 72 texts (6 languages, 6 articles per language, 2 translations per article).

Each text contained about 250 words, so 72 texts contained about 18,000 words.

Of the 18,000 words, 1,268 (7%) were to be annotated. These were the nouns, verbs, adjectives, and adverbs.

Explicit agreement measure

Consider any pair of annotators and any token in the corpus.

Call each annotator's "total count" the count of categories assigned to the token.

Call each annotator's "shared count" the count of the categories in that annotator's assignment to the token that were also in the other annotator's assignment to the token.

Then the agreement of that pair of annotators on that token is the sum of the 2 shared counts divided by the sum of the 2 total counts.

The agreement of that pair of annotators on the corpus is the mean of the pair's agreements on all the tokens.

The explicit agreement measure is the mean of all pairs' agreements on the corpus.

Implicit agreement measure

More cryptically described. Apparently, the fraction of all categories (e.g., 110,000 word senses) on which an annotator pair's members both assign or both fail to assign to a word. Thus, if you assign only "buffalo" and I assign only "whisky" to "bank" and there are 110,000 senses, our agreement on "bank" is 109,998/110,000.

Derived measures

Kappa statistic derived from each agreement measure.