Hello, All.
This query is mainly about classification "reliability" (for lack of a better term -- accuracy? agreement? consistency?). Below I first summarize the problem then elaborate for anyone interested. I'd appreciate your thoughts about potential strategies, relevant concepts or terminology, or resources to investigate (e.g., publications, software, authors).
SUMMARY: I'm seeking useful quantitative ways to compare two or more classifiers' assignments of sampled objects to four unordered categories. Potential complications are that (a) the sample to classify may contain at most a few hundred objects, (b) we'd like to make statistical inferences (e.g., interval estimate), (c) the two most important categories are rather rare, (d) discrepancy between classifiers is more problematic for some pairs of categories than others, and (e) some classifiers are more authoritative than others.
ELABORATION: For one of my nonprofit clients I help compile a bibliographic database of methodological resources for research synthesis, such as articles, guidance, and grey literature about meta-analysis. We've begun sharing some of its approximately 20k resources' bibliographic records with a major U.S. government partner organization (PO) whose better visibility and infrastructure could markedly increase potential users' awareness and access. Part of this sharing process entails partitioning the records into four categories: G (guidance), R (review), S (study), and N (no) based on each resource's topic(s) and other attributes. We share those in G, R, and S with the PO, including their G/R/S classification and other metadata.
That G/R/S/N classification task is challenging, especially under time constraints with limited information about each resource (e.g., title and abstract), and may be less reliable than we'd like. To assess and potentially improve it, such as with better category definitions and supporting documentation, we're planning a study in which Teams U (usual process), C (client consensus), and P (PO consensus) classify the same sample of records. Besides inspecting discrepancies qualitatively to understand reasons for unreliability -- based on each team's assigned categories, confidence ratings, and open-ended comments -- we'd like to compare the teams' classifications quantitatively. For example, we might express reliability as a scalar quantity for either all teams, categories, and records or a subset of teams, categories, or records. Similarly, we might compare individual team members' classifications quantitatively, such as before they reach consensus.
Below are five complicating issues we should probably consider when choosing a strategy to quantify reliability:
1. Number of Sampled Records: Due to constraints on time and other resources, each team will probably classify between about 100 and 300 records. If that sample is too small in important ways, we might consider other designs in which at least one team classifies only a subset of records (e.g., incomplete block).
2. Statistical Inference: One reason to quantify reliability is to construct an interval estimate for or test an hypothesis about the chosen quantity. For instance, we may want to decide whether the three teams' reliability -- or that for 2 teams -- exceeds an a priori adequacy cut-off. We'd consider frequentist, Bayesian, resampling, or other approaches.
3. Rare Important Categories: The categories G and R are arguably most important (e.g., more valuable to more database users), but previous classification of about 10k records indicates G and R are rare -- each about 1% (vs. about 50% each for S and N). To increase their representation in the sample, we've considered oversampling G and R records based on that previous classification, but we don't want that to distort our quantification of reliability in the "natural" population where G and R are rare. This also relates to blinding teams to each category percentages.
4. More Problematic Discrepancies: Discrepancy between teams is arguably more problematic for some pairs of categories than others, and for some pairs one direction of misclassification is more detrimental than the other. For instance, in a sense misclassifying a G or R record as N may be most costly (e.g., not sharing an important type of record with the PO), while misclassifying a S record as R may be less problematic.
5. More Authoritative Teams: We view Teams C and P as more "authoritative" than Team U. While the latter's classification represents operational performance under realistic time and other constraints, each of the other two will ideally entail careful consensus among multiple experts with the luxury of more time and access to information (e.g., full text). Although classifications from Teams C and P are more like gold standards than Team U's, none of them is infallible.
With all that in mind, which indexes, statistical models, or other quantitative strategies are good candidates for the above situation?
As a crude approach that may address Issues #2, #3, and #4 above, we could oversample from G and R, estimate proportions in the three teams' 4 x 4 x 4 table (or its 4 x 4 marginals) with an oversampling adjustment, compute a weighted kappa statistic from those proportion estimates, and construct a bootstrap confidence interval for kappa. Also, to partly address Issue #1 we could run simulations of that approach to investigate the influence of sample size, oversampling scheme, weighting scheme, and other choices.
However, that approach doesn't address Issue #5, and I'm somewhat familiar with criticisms of kappa. Are there trustworthy approaches in pertinent literatures, such as diagnostic accuracy (e.g., based on sensitivity and specificity), interrater reliability, or loglinear models for categorical data? I'll leave it at that for now. Thanks in advance for ideas.
Cheers,
Adam
------------------------------
Adam Hafdahl
Owner & Principal Consultant
ARCH Statistical Consulting, LLC
------------------------------