Detection of Errors and Correction
in Corpus Annotation

Project Description

Project Summary

The success of data-driven approaches and stochastic modeling in computational linguistic research and applications is rooted in the availability of electronic natural language corpora. Despite the central role that annotated corpora play for computational linguistic research and applications, the question of how errors in the annotation of corpora can be detected and corrected has received only little attention. The project is designed to address this important gap by exploring an error detection and correction method with potential applicability to a wide range of corpus annotations.

Corpora are essential for training algorithms in tagging and morphological analysis, parsing, term and name identification, word sense disambiguation, anaphora resolution, and other tasks, and they are of use in human language technology applications such as machine translation, document classification, or information retrieval. The supervised models generally used require training on corpora that are annotated with the linguistic properties intended to be learned, such as morphological, syntactic, semantic, and discourse distinctions. Annotated corpora are also essential as gold standards for testing the performance of human language technology, regardless of whether it is statistical or rule-based in nature. Van Halteren and other researchers have shown, however, that language technology has advanced to a point where the current quality of gold-standard annotation can seriously undermine their use for training and evaluation. Detecting and correcting errors in gold-standard annotation thus is crucial for enabling further progress.

Intellectual Merit

The project will explore and generalize a new error detection approach introduced by the co-PIs Detmar Meurers and Markus Dickinson at EACL'03, where it was shown to successfully and with high precision detect a large number of annotation errors in the part-of-speech annotation of the Wall Street Journal Corpus as part of the Penn Treebank. The approach automatically detects variation in the annotation of comparable n-grams in the corpus and classifies these variations into ambiguities and annotation errors. The variation detection approach is particularly well-suited for the important gold-standard corpora since variation is often introduced by a manual annotation or correction stage used to obtain such annotation.

The variation detection method in principle is language-independent and applicable to a wide range of annotation types. The project will explore how the method can effectively be applied to detect errors in different types of annotation; an article presented by the co-PIs at ACL'05 confirms the feasibility of this undertaking. The project will extend the approach to linguistic dependency annotation and explore the relation of the method to fraud/anomaly detection approaches developed outside of computational linguistics. To determine how the error recall rate can be increased, the project will investigate generalizing the definition of comparable n-grams. Finally, the project will explore an approach to automatically correct the errors which are detected by the variation n-gram approach; testing this approach will include an investigation of the consequences of cleaned training data on human language technology.

Broader Impact

The project is expected to have an important theoretical and practical impact. Developing and testing methods for detecting and correcting errors in a wide range of corpus annotations will improve the quality of annotated corpora as well as the annotation schemata at the heart of current research and application development in human language technology.

The potential broader impact beyond language technology is highly significant. The error detection methodology developed by the project is in principle applicable to all collections of data which encode judgments or classifications of repeated data subunits. To connect the method to approaches outside of computational linguistics, the project will include an exploration of the relation to application domains such as fraud-detection and medical diagnosis validation.

Extended Project Description with References (pdf)