direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Maschnielles Lernen


Autonomous Document Cleaning – A Generative Approach to Reconstruct Strongly Corrupted Scanned Texts
Zitatschlüssel Dai2014
Autor Dai, Z. and Lücke, J.
Seiten 1950–1962
Jahr 2014
DOI 10.1109/TPAMI.2014.2313126
Journal IEEE Transactions on Pattern Analysis and Machine Intelligence
Jahrgang 36
Nummer 10
Zusammenfassung We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink, etc. We aim at autonomously removing such corruptions from a single letter-size page based only on the information the page contains. Our approach first learns character representations from document patches without supervision. For learning, we use a probabilistic generative model parameterizing pattern features, their planar arrangements and their variances. The model's latent variables describe pattern position and class, and feature occurrences. Model parameters are efficiently inferred using a truncated variational EM approach. Based on the learned representation, a clean document can be recovered by identifying, for each patch, pattern class and position while a quality measure allows for discrimination between character and non-character patterns. For a full Latin alphabet we found that a single page does not contain sufficiently many character examples. However, even if heavily corrupted by dirt, we show that a page containing a lower number of character types can efficiently and autonomously be cleaned solely based on the structural regularity of the characters it contains. In different example applications with different alphabets, we demonstrate and discuss the effectiveness, efficiency and generality of the approach.
Link zur Publikation Download Bibtex Eintrag

Zusatzinformationen / Extras


Schnellnavigation zur Seite über Nummerneingabe