This article discusses techniques and practices aimed at the extraction of textual content from images related to printed editions. Optical Character Recognition (Ocr) applied to scholarly editions of classical texts or applied to early printed editions is a challenging task, due to material issues, such as the bad quality of papers damaged by time, or due to linguistic issues, such as the lack of linguistic models suitable to a specific linguistic variety. This article illustrates some common strategies aimed at improving historic Ocr accuracy, such as the alignment of the textual sequences generated by different Ocr engines and the incremental enrichment of suitable linguistic models. Finally, some practices of collaborative Ocr proof-reading are described and discussed.
articolo divulgativo sullo stato dell'arte dell'OCR storico.
Estrarre parole dalle immagini nell'era digitale: alcune osservazioni sull'OCR storico
Federico Boschetti
2017
Abstract
This article discusses techniques and practices aimed at the extraction of textual content from images related to printed editions. Optical Character Recognition (Ocr) applied to scholarly editions of classical texts or applied to early printed editions is a challenging task, due to material issues, such as the bad quality of papers damaged by time, or due to linguistic issues, such as the lack of linguistic models suitable to a specific linguistic variety. This article illustrates some common strategies aimed at improving historic Ocr accuracy, such as the alignment of the textual sequences generated by different Ocr engines and the incremental enrichment of suitable linguistic models. Finally, some practices of collaborative Ocr proof-reading are described and discussed.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.