This article discusses techniques and practices aimed at the extraction of textual content from images related to printed editions. Optical Character Recognition (Ocr) applied to scholarly editions of classical texts or applied to early printed editions is a challenging task, due to material issues, such as the bad quality of papers damaged by time, or due to linguistic issues, such as the lack of linguistic models suitable to a specific linguistic variety. This article illustrates some common strategies aimed at improving historic Ocr accuracy, such as the alignment of the textual sequences generated by different Ocr engines and the incremental enrichment of suitable linguistic models. Finally, some practices of collaborative Ocr proof-reading are described and discussed.

articolo divulgativo sullo stato dell'arte dell'OCR storico.

Estrarre parole dalle immagini nell'era digitale: alcune osservazioni sull'OCR storico

Federico Boschetti
2017

Abstract

This article discusses techniques and practices aimed at the extraction of textual content from images related to printed editions. Optical Character Recognition (Ocr) applied to scholarly editions of classical texts or applied to early printed editions is a challenging task, due to material issues, such as the bad quality of papers damaged by time, or due to linguistic issues, such as the lack of linguistic models suitable to a specific linguistic variety. This article illustrates some common strategies aimed at improving historic Ocr accuracy, such as the alignment of the textual sequences generated by different Ocr engines and the incremental enrichment of suitable linguistic models. Finally, some practices of collaborative Ocr proof-reading are described and discussed.
Campo DC Valore Lingua
dc.authority.ancejournal LA RIVISTA DI ENGRAMMA -
dc.authority.people Federico Boschetti it
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.date.accessioned 2024/02/20 23:58:58 -
dc.date.available 2024/02/20 23:58:58 -
dc.date.issued 2017 -
dc.description.abstracteng This article discusses techniques and practices aimed at the extraction of textual content from images related to printed editions. Optical Character Recognition (Ocr) applied to scholarly editions of classical texts or applied to early printed editions is a challenging task, due to material issues, such as the bad quality of papers damaged by time, or due to linguistic issues, such as the lack of linguistic models suitable to a specific linguistic variety. This article illustrates some common strategies aimed at improving historic Ocr accuracy, such as the alignment of the textual sequences generated by different Ocr engines and the incremental enrichment of suitable linguistic models. Finally, some practices of collaborative Ocr proof-reading are described and discussed. -
dc.description.abstractita articolo divulgativo sullo stato dell'arte dell'OCR storico. -
dc.description.affiliations CNR-ILC -
dc.description.allpeople Boschetti, Federico -
dc.description.allpeopleoriginal Federico Boschetti -
dc.description.fulltext none en
dc.description.numberofauthors 1 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/338981 -
dc.identifier.url http://www.engramma.it/eOS/index.php?id_articolo=3228 -
dc.language.iso ita -
dc.relation.volume 150 -
dc.subject.keywords ocr storico -
dc.subject.singlekeyword ocr storico *
dc.title Estrarre parole dalle immagini nell'era digitale: alcune osservazioni sull'OCR storico en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.miur 262 -
dc.type.referee No -
dc.ugov.descaux1 382433 -
iris.orcid.lastModifiedDate 2024/04/04 14:55:38 *
iris.orcid.lastModifiedMillisecond 1712235338863 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 01.01 Articolo in rivista
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/338981
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact