The surge in digitisation initiatives by Cultural Heritage institutions has facilitated online accessibility to numerous historical manuscripts. However, a substantial portion of these documents exists solely as images, lacking machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into machine-readable formats, enabling researchers and scholars to analyse vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging.

CATMuS Medieval: A Multilingual Large-Scale Cross-Century Dataset in Latin Script for Handwritten Text Recognition and Beyond

Boschetti, Federico;
2024

Abstract

The surge in digitisation initiatives by Cultural Heritage institutions has facilitated online accessibility to numerous historical manuscripts. However, a substantial portion of these documents exists solely as images, lacking machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into machine-readable formats, enabling researchers and scholars to analyse vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging.
Campo DC Valore Lingua
dc.authority.anceserie LECTURE NOTES IN COMPUTER SCIENCE en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Clérice, Thibault en
dc.authority.people Pinche, Ariane en
dc.authority.people Vlachou-Efstathiou, Malamatenia en
dc.authority.people Chagué, Alix en
dc.authority.people Camps, Jean-Baptiste en
dc.authority.people Levenson, Matthias Gille en
dc.authority.people Brisville-Fertin, Olivier en
dc.authority.people Boschetti, Federico en
dc.authority.people Fischer, Franz en
dc.authority.people Gervers, Michael en
dc.authority.people Boutreux, Agnès en
dc.authority.people Manton, Avery en
dc.authority.people Gabay, Simon en
dc.authority.people O'Connor, Patricia en
dc.authority.people Haverals, Wouter en
dc.authority.people Kestemont, Mike en
dc.authority.people Vandyck, Caroline en
dc.authority.people Kiessling, Benjamin en
dc.collection.id.s 8c50ea44-be95-498f-946e-7bb5bd666b7c *
dc.collection.name 02.01 Contributo in volume (Capitolo o Saggio) *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.date.accessioned 2024/12/03 17:05:05 -
dc.date.available 2024/12/03 17:05:05 -
dc.date.firstsubmission 2024/10/11 15:50:36 *
dc.date.issued 2024 -
dc.date.submission 2024/10/11 15:50:36 *
dc.description.abstracteng The surge in digitisation initiatives by Cultural Heritage institutions has facilitated online accessibility to numerous historical manuscripts. However, a substantial portion of these documents exists solely as images, lacking machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into machine-readable formats, enabling researchers and scholars to analyse vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging. -
dc.description.allpeople Clérice, Thibault; Pinche, Ariane; Vlachou-Efstathiou, Malamatenia; Chagué, Alix; Camps, Jean-Baptiste; Levenson, Matthias Gille; Brisville-Fertin, Olivier; Boschetti, Federico; Fischer, Franz; Gervers, Michael; Boutreux, Agnès; Manton, Avery; Gabay, Simon; O'Connor, Patricia; Haverals, Wouter; Kestemont, Mike; Vandyck, Caroline; Kiessling, Benjamin -
dc.description.allpeopleoriginal Clérice, Thibault; Pinche, Ariane; Vlachou-Efstathiou, Malamatenia; Chagué, Alix; Camps, Jean-Baptiste; Levenson, Matthias Gille; Brisville-Fertin, Olivier; Boschetti, Federico; Fischer, Franz; Gervers, Michael; Boutreux, Agnès; Manton, Avery; Gabay, Simon; O'Connor, Patricia; Haverals, Wouter; Kestemont, Mike; Vandyck, Caroline; Kiessling, Benjamin en
dc.description.fulltext partially_open en
dc.description.numberofauthors 18 -
dc.identifier.doi 10.1007/978-3-031-70543-4_11 en
dc.identifier.isbn 9783031705427 en
dc.identifier.isbn 9783031705434 en
dc.identifier.scopus 2-s2.0-85204572971 en
dc.identifier.source crossref *
dc.identifier.uri https://hdl.handle.net/20.500.14243/506902 -
dc.identifier.url https://link.springer.com/book/10.1007/978-3-031-70543-4 en
dc.language.iso eng en
dc.publisher.name Springer en
dc.relation.allauthors Barney Smith, Elisa H.; Liwicki, Marcus; Liangrui, Peng en
dc.relation.firstpage 174 en
dc.relation.ispartofbook Document Analysis and Recognition – ICDAR 2024 en
dc.relation.lastpage 194 en
dc.relation.numberofpages 21 en
dc.relation.volume 14806 LNCS en
dc.subject.keywords historical sources; medieval manuscripts; Latin scripts; benchmarking dataset; multilingual; handwritten text recognition -
dc.subject.singlekeyword historical sources *
dc.subject.singlekeyword medieval manuscripts *
dc.subject.singlekeyword Latin scripts *
dc.subject.singlekeyword benchmarking dataset *
dc.subject.singlekeyword multilingual *
dc.subject.singlekeyword handwritten text recognition *
dc.title CATMuS Medieval: A Multilingual Large-Scale Cross-Century Dataset in Latin Script for Handwritten Text Recognition and Beyond en
dc.type.driver info:eu-repo/semantics/bookPart -
dc.type.full 02 Contributo in Volume::02.01 Contributo in volume (Capitolo o Saggio) it
dc.type.miur 268 -
iris.mediafilter.data 2025/04/12 03:39:11 *
iris.orcid.lastModifiedDate 2025/01/22 12:17:59 *
iris.orcid.lastModifiedMillisecond 1737544679357 *
iris.scopus.extIssued 2024 -
iris.scopus.extTitle CATMuS Medieval: A Multilingual Large-Scale Cross-Century Dataset in Latin Script for Handwritten Text Recognition and Beyond -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoahost repository *
iris.unpaywall.bestoaversion submittedVersion *
iris.unpaywall.doi 10.1007/978-3-031-70543-4_11 *
iris.unpaywall.hosttype repository *
iris.unpaywall.isoa true *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.landingpage https://inria.hal.science/hal-04453952 *
iris.unpaywall.license cc-by *
iris.unpaywall.metadataCallLastModified 27/01/2026 03:58:37 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1769482717467 -
iris.unpaywall.oastatus green *
iris.unpaywall.pdfurl https://inria.hal.science/hal-04453952/document *
scopus.authority.anceserie LECTURE NOTES IN COMPUTER SCIENCE###0302-9743 *
scopus.category 2614 *
scopus.category 1700 *
scopus.contributor.affiliation Inria -
scopus.contributor.affiliation CNRS -
scopus.contributor.affiliation EPHE -
scopus.contributor.affiliation EPHE -
scopus.contributor.affiliation ÉNC - École nationale des chartes -
scopus.contributor.affiliation École Normale Supérieure de Lyon -
scopus.contributor.affiliation École Normale Supérieure de Lyon -
scopus.contributor.affiliation Ca’Foscari -
scopus.contributor.affiliation Ca’Foscari -
scopus.contributor.affiliation University of Toronto -
scopus.contributor.affiliation University of Toronto -
scopus.contributor.affiliation University of Toronto -
scopus.contributor.affiliation UNIGE - Université de Genève -
scopus.contributor.affiliation CJM - Centre Jean Mabillon -
scopus.contributor.affiliation Princeton University -
scopus.contributor.affiliation Antwerp University -
scopus.contributor.affiliation Antwerp University -
scopus.contributor.affiliation EPHE -
scopus.contributor.afid 60013373 -
scopus.contributor.afid 60108312 -
scopus.contributor.afid 60027946 -
scopus.contributor.afid 60027946 -
scopus.contributor.afid 60108669 -
scopus.contributor.afid 60005667 -
scopus.contributor.afid 60005667 -
scopus.contributor.afid 131713121 -
scopus.contributor.afid 131713121 -
scopus.contributor.afid 60016849 -
scopus.contributor.afid 60016849 -
scopus.contributor.afid 60016849 -
scopus.contributor.afid 60004718 -
scopus.contributor.afid 60108669 -
scopus.contributor.afid 60003269 -
scopus.contributor.afid 60012937 -
scopus.contributor.afid 60012937 -
scopus.contributor.afid 60027946 -
scopus.contributor.auid 56543639100 -
scopus.contributor.auid 57221684848 -
scopus.contributor.auid 58286393600 -
scopus.contributor.auid 57479333300 -
scopus.contributor.auid 55966858000 -
scopus.contributor.auid 59454898300 -
scopus.contributor.auid 59337841800 -
scopus.contributor.auid 36081634700 -
scopus.contributor.auid 58939372300 -
scopus.contributor.auid 6506551372 -
scopus.contributor.auid 59337085500 -
scopus.contributor.auid 59337841900 -
scopus.contributor.auid 57212619963 -
scopus.contributor.auid 59337535600 -
scopus.contributor.auid 57210274021 -
scopus.contributor.auid 36450632300 -
scopus.contributor.auid 59337693200 -
scopus.contributor.auid 57211666085 -
scopus.contributor.country France -
scopus.contributor.country France -
scopus.contributor.country France -
scopus.contributor.country France -
scopus.contributor.country France -
scopus.contributor.country France -
scopus.contributor.country France -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Canada -
scopus.contributor.country Canada -
scopus.contributor.country Canada -
scopus.contributor.country Switzerland -
scopus.contributor.country France -
scopus.contributor.country United States -
scopus.contributor.country Belgium -
scopus.contributor.country Belgium -
scopus.contributor.country France -
scopus.contributor.dptid 128356945 -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid 131714123 -
scopus.contributor.dptid 131714123 -
scopus.contributor.dptid 113636298 -
scopus.contributor.dptid 113636298 -
scopus.contributor.dptid 113636298 -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.name Thibault -
scopus.contributor.name Ariane -
scopus.contributor.name Malamatenia -
scopus.contributor.name Alix -
scopus.contributor.name Jean-Baptiste -
scopus.contributor.name Matthias Gille -
scopus.contributor.name Olivier -
scopus.contributor.name Federico -
scopus.contributor.name Franz -
scopus.contributor.name Michael -
scopus.contributor.name Agnès -
scopus.contributor.name Avery -
scopus.contributor.name Simon -
scopus.contributor.name Patricia -
scopus.contributor.name Wouter -
scopus.contributor.name Mike -
scopus.contributor.name Caroline -
scopus.contributor.name Benjamin -
scopus.contributor.subaffiliation ALMAnaCH - Automatic Language Modelling and Analysis and Computational Humanities; -
scopus.contributor.subaffiliation CIHAM–UMR 5648; -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation VeDPH - Venice Centre for Digital and Public Humanities; -
scopus.contributor.subaffiliation VeDPH - Venice Centre for Digital and Public Humanities; -
scopus.contributor.subaffiliation UToronto - Department of History; -
scopus.contributor.subaffiliation UToronto - Department of History; -
scopus.contributor.subaffiliation UToronto - Department of History; -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.surname Clérice -
scopus.contributor.surname Pinche -
scopus.contributor.surname Vlachou-Efstathiou -
scopus.contributor.surname Chagué -
scopus.contributor.surname Camps -
scopus.contributor.surname Levenson -
scopus.contributor.surname Brisville-Fertin -
scopus.contributor.surname Boschetti -
scopus.contributor.surname Fischer -
scopus.contributor.surname Gervers -
scopus.contributor.surname Boutreux -
scopus.contributor.surname Manton -
scopus.contributor.surname Gabay -
scopus.contributor.surname O’Connor -
scopus.contributor.surname Haverals -
scopus.contributor.surname Kestemont -
scopus.contributor.surname Vandyck -
scopus.contributor.surname Kiessling -
scopus.date.issued 2024 *
scopus.description.abstracteng The surge in digitisation initiatives by Cultural Heritage institutions has facilitated online accessibility to numerous historical manuscripts. However, a substantial portion of these documents exists solely as images, lacking machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into machine-readable formats, enabling researchers and scholars to analyse vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging. We introduce the Consistent Approaches to Transcribing Manuscripts (CATMuS) dataset for medieval manuscripts, which offers (1) a uniform framework for annotation practices for medieval manuscripts, a benchmarking environment (2) for evaluating automatic text recognition models across multiple dimensions thanks to rich metadata (century of production, language, genre, script, etc.), (3) for other tasks (such as script classification or dating approaches), (4) and finally for exploratory work pertaining to computer vision and digital paleography around line-based tasks, such as generative approaches. Developed through collaboration among various institutions and projects, CATMuS provides an inter-compatible dataset spanning more than 200 manuscripts and incunabula in 10 different languages, comprising over 160,000 lines of text and 5 million characters spanning from the 8th century to the 16th. The dataset’s consistency in transcription approaches aims to mitigate challenges arising from the diversity in standards for medieval manuscript transcriptions, providing a comprehensive benchmark for evaluating HTR models on historical sources. *
scopus.description.allpeopleoriginal Clerice T.; Pinche A.; Vlachou-Efstathiou M.; Chague A.; Camps J.-B.; Levenson M.G.; Brisville-Fertin O.; Boschetti F.; Fischer F.; Gervers M.; Boutreux A.; Manton A.; Gabay S.; O'Connor P.; Haverals W.; Kestemont M.; Vandyck C.; Kiessling B. *
scopus.differences scopus.publisher.name *
scopus.differences scopus.subject.keywords *
scopus.differences scopus.description.allpeopleoriginal *
scopus.differences scopus.description.abstracteng *
scopus.differences scopus.identifier.isbn *
scopus.differences scopus.relation.volume *
scopus.document.type cp *
scopus.document.types cp *
scopus.identifier.doi 10.1007/978-3-031-70543-4_11 *
scopus.identifier.eissn 1611-3349 *
scopus.identifier.isbn 9783031705427 *
scopus.identifier.pui 645315420 *
scopus.identifier.scopus 2-s2.0-85204572971 *
scopus.journal.sourceid 25674 *
scopus.language.iso eng *
scopus.publisher.name Springer Science and Business Media Deutschland GmbH *
scopus.relation.conferencedate 2024 *
scopus.relation.conferencename 18th International Conference on Document Analysis and Recognition, ICDAR 2024 *
scopus.relation.conferenceplace grc *
scopus.relation.firstpage 174 *
scopus.relation.lastpage 194 *
scopus.relation.volume 14806 *
scopus.subject.keywords benchmarking dataset; handwritten text recognition; Historical sources; Latin scripts; medieval manuscripts; multilingual; *
scopus.title CATMuS Medieval: A Multilingual Large-Scale Cross-Century Dataset in Latin Script for Handwritten Text Recognition and Beyond *
scopus.titleeng CATMuS Medieval: A Multilingual Large-Scale Cross-Century Dataset in Latin Script for Handwritten Text Recognition and Beyond *
Appare nelle tipologie: 02.01 Contributo in volume (Capitolo o Saggio)
File in questo prodotto:
File Dimensione Formato  
clerice_et_al_Springer978-3-031-70543-4.pdf

solo utenti autorizzati

Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 2.24 MB
Formato Adobe PDF
2.24 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
ICDAR24___CATMUS_Medieval-1.pdf

accesso aperto

Licenza: Creative commons
Dimensione 2.61 MB
Formato Adobe PDF
2.61 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/506902
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? ND
social impact