Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.

Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models

Ciaccio C.;Sartor M.;Miaschi A.;Dell'Orletta F.
2025

Abstract

Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.
Campo DC Valore Lingua
dc.authority.anceserie PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Ciaccio C. en
dc.authority.people Sartor M. en
dc.authority.people Miaschi A. en
dc.authority.people Dell'Orletta F. en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2026/03/03 14:43:31 -
dc.date.available 2026/03/03 14:43:31 -
dc.date.firstsubmission 2026/03/02 18:29:56 *
dc.date.issued 2025 -
dc.date.submission 2026/03/02 18:29:56 *
dc.description.abstracteng Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots. -
dc.description.allpeople Ciaccio, C.; Sartor, M.; Miaschi, A.; Dell'Orletta, F. -
dc.description.allpeopleoriginal Ciaccio C.; Sartor M.; Miaschi A.; Dell'Orletta F. en
dc.description.fulltext open en
dc.description.international no en
dc.description.numberofauthors 4 -
dc.identifier.doi 10.18653/v1/2025.findings-acl.593 en
dc.identifier.scopus 2-s2.0-105028561206 en
dc.identifier.source scopus *
dc.identifier.uri https://hdl.handle.net/20.500.14243/570461 -
dc.language.iso eng en
dc.publisher.name Association for Computational Linguistics (ACL) en
dc.relation.conferencedate 2025 en
dc.relation.conferencename 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 en
dc.relation.firstpage 11361 en
dc.relation.ispartofbook Proceedings of the Annual Meeting of the Association for Computational Linguistics en
dc.relation.lastpage 11372 en
dc.relation.numberofpages 12 en
dc.subject.keywordseng Large Language Models (LLMs) -
dc.subject.keywordseng Interpretability -
dc.subject.singlekeyword Large Language Models (LLMs) *
dc.subject.singlekeyword Interpretability *
dc.title Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
iris.mediafilter.data 2026/03/04 02:52:30 *
iris.orcid.lastModifiedDate 2026/03/03 14:43:31 *
iris.orcid.lastModifiedMillisecond 1772545411798 *
iris.scopus.extIssued 2025 -
iris.scopus.extTitle Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoaversion publishedVersion *
iris.unpaywall.doi 10.18653/v1/2025.findings-acl.593 *
iris.unpaywall.isoa true *
iris.unpaywall.landingpage https://doi.org/10.18653/v1/2025.findings-acl.593 *
iris.unpaywall.license cc-by *
iris.unpaywall.metadataCallLastModified 04/03/2026 04:34:00 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1772595240983 -
iris.unpaywall.oastatus gold *
iris.unpaywall.pdfurl https://aclanthology.org/2025.findings-acl.593.pdf *
scopus.authority.anceserie PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING###0736-587X *
scopus.category 1203 *
scopus.category 3310 *
scopus.category 1706 *
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60008941 -
scopus.contributor.auid 59504212000 -
scopus.contributor.auid 59207233400 -
scopus.contributor.auid 57211678681 -
scopus.contributor.auid 57540567000 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid 114087935 -
scopus.contributor.dptid 114087935 -
scopus.contributor.dptid 114087935 -
scopus.contributor.dptid 114087935 -
scopus.contributor.name Cristiano -
scopus.contributor.name Marta -
scopus.contributor.name Alessio -
scopus.contributor.name Felice -
scopus.contributor.subaffiliation Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC); -
scopus.contributor.subaffiliation Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC); -
scopus.contributor.subaffiliation Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC); -
scopus.contributor.subaffiliation Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC); -
scopus.contributor.surname Ciaccio -
scopus.contributor.surname Sartor -
scopus.contributor.surname Miaschi -
scopus.contributor.surname Dell'Orletta -
scopus.date.issued 2025 *
scopus.description.abstracteng Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots. *
scopus.description.allpeopleoriginal Ciaccio C.; Sartor M.; Miaschi A.; Dell'Orletta F. *
scopus.differences scopus.identifier.isbn *
scopus.differences scopus.relation.conferenceplace *
scopus.document.type cp *
scopus.document.types cp *
scopus.funding.funders 501100021856 - Ministero dell'Università e della Ricerca; 501100021856 - Ministero dell'Università e della Ricerca; *
scopus.funding.ids PE0000013-FAIR; *
scopus.identifier.doi 10.18653/v1/2025.findings-acl.593 *
scopus.identifier.isbn 9798891762565 *
scopus.identifier.pui 650043653 *
scopus.identifier.scopus 2-s2.0-105028561206 *
scopus.journal.sourceid 21101138302 *
scopus.language.iso eng *
scopus.publisher.name Association for Computational Linguistics (ACL) *
scopus.relation.conferencedate 2025 *
scopus.relation.conferencename 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 *
scopus.relation.conferenceplace aut *
scopus.relation.firstpage 11361 *
scopus.relation.lastpage 11372 *
scopus.title Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models *
scopus.titleeng Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models *
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
2025.findings-acl.593.pdf

accesso aperto

Licenza: Creative commons
Dimensione 2.68 MB
Formato Adobe PDF
2.68 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570461
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact