Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.
Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models
Ciaccio C.;Sartor M.;Miaschi A.;Dell'Orletta F.
2025
Abstract
Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.anceserie | PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING | en |
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.people | Ciaccio C. | en |
| dc.authority.people | Sartor M. | en |
| dc.authority.people | Miaschi A. | en |
| dc.authority.people | Dell'Orletta F. | en |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2026/03/03 14:43:31 | - |
| dc.date.available | 2026/03/03 14:43:31 | - |
| dc.date.firstsubmission | 2026/03/02 18:29:56 | * |
| dc.date.issued | 2025 | - |
| dc.date.submission | 2026/03/02 18:29:56 | * |
| dc.description.abstracteng | Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots. | - |
| dc.description.allpeople | Ciaccio, C.; Sartor, M.; Miaschi, A.; Dell'Orletta, F. | - |
| dc.description.allpeopleoriginal | Ciaccio C.; Sartor M.; Miaschi A.; Dell'Orletta F. | en |
| dc.description.fulltext | open | en |
| dc.description.international | no | en |
| dc.description.numberofauthors | 4 | - |
| dc.identifier.doi | 10.18653/v1/2025.findings-acl.593 | en |
| dc.identifier.scopus | 2-s2.0-105028561206 | en |
| dc.identifier.source | scopus | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/570461 | - |
| dc.language.iso | eng | en |
| dc.publisher.name | Association for Computational Linguistics (ACL) | en |
| dc.relation.conferencedate | 2025 | en |
| dc.relation.conferencename | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 | en |
| dc.relation.firstpage | 11361 | en |
| dc.relation.ispartofbook | Proceedings of the Annual Meeting of the Association for Computational Linguistics | en |
| dc.relation.lastpage | 11372 | en |
| dc.relation.numberofpages | 12 | en |
| dc.subject.keywordseng | Large Language Models (LLMs) | - |
| dc.subject.keywordseng | Interpretability | - |
| dc.subject.singlekeyword | Large Language Models (LLMs) | * |
| dc.subject.singlekeyword | Interpretability | * |
| dc.title | Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| iris.mediafilter.data | 2026/03/04 02:52:30 | * |
| iris.orcid.lastModifiedDate | 2026/03/03 14:43:31 | * |
| iris.orcid.lastModifiedMillisecond | 1772545411798 | * |
| iris.scopus.extIssued | 2025 | - |
| iris.scopus.extTitle | Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models | - |
| iris.sitodocente.maxattempts | 1 | - |
| iris.unpaywall.bestoaversion | publishedVersion | * |
| iris.unpaywall.doi | 10.18653/v1/2025.findings-acl.593 | * |
| iris.unpaywall.isoa | true | * |
| iris.unpaywall.landingpage | https://doi.org/10.18653/v1/2025.findings-acl.593 | * |
| iris.unpaywall.license | cc-by | * |
| iris.unpaywall.metadataCallLastModified | 04/03/2026 04:34:00 | - |
| iris.unpaywall.metadataCallLastModifiedMillisecond | 1772595240983 | - |
| iris.unpaywall.oastatus | gold | * |
| iris.unpaywall.pdfurl | https://aclanthology.org/2025.findings-acl.593.pdf | * |
| scopus.authority.anceserie | PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING###0736-587X | * |
| scopus.category | 1203 | * |
| scopus.category | 3310 | * |
| scopus.category | 1706 | * |
| scopus.contributor.affiliation | ItaliaNLP Lab | - |
| scopus.contributor.affiliation | ItaliaNLP Lab | - |
| scopus.contributor.affiliation | ItaliaNLP Lab | - |
| scopus.contributor.affiliation | ItaliaNLP Lab | - |
| scopus.contributor.afid | 60008941 | - |
| scopus.contributor.afid | 60008941 | - |
| scopus.contributor.afid | 60008941 | - |
| scopus.contributor.afid | 60008941 | - |
| scopus.contributor.auid | 59504212000 | - |
| scopus.contributor.auid | 59207233400 | - |
| scopus.contributor.auid | 57211678681 | - |
| scopus.contributor.auid | 57540567000 | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.dptid | 114087935 | - |
| scopus.contributor.dptid | 114087935 | - |
| scopus.contributor.dptid | 114087935 | - |
| scopus.contributor.dptid | 114087935 | - |
| scopus.contributor.name | Cristiano | - |
| scopus.contributor.name | Marta | - |
| scopus.contributor.name | Alessio | - |
| scopus.contributor.name | Felice | - |
| scopus.contributor.subaffiliation | Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC); | - |
| scopus.contributor.subaffiliation | Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC); | - |
| scopus.contributor.subaffiliation | Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC); | - |
| scopus.contributor.subaffiliation | Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC); | - |
| scopus.contributor.surname | Ciaccio | - |
| scopus.contributor.surname | Sartor | - |
| scopus.contributor.surname | Miaschi | - |
| scopus.contributor.surname | Dell'Orletta | - |
| scopus.date.issued | 2025 | * |
| scopus.description.abstracteng | Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots. | * |
| scopus.description.allpeopleoriginal | Ciaccio C.; Sartor M.; Miaschi A.; Dell'Orletta F. | * |
| scopus.differences | scopus.identifier.isbn | * |
| scopus.differences | scopus.relation.conferenceplace | * |
| scopus.document.type | cp | * |
| scopus.document.types | cp | * |
| scopus.funding.funders | 501100021856 - Ministero dell'Università e della Ricerca; 501100021856 - Ministero dell'Università e della Ricerca; | * |
| scopus.funding.ids | PE0000013-FAIR; | * |
| scopus.identifier.doi | 10.18653/v1/2025.findings-acl.593 | * |
| scopus.identifier.isbn | 9798891762565 | * |
| scopus.identifier.pui | 650043653 | * |
| scopus.identifier.scopus | 2-s2.0-105028561206 | * |
| scopus.journal.sourceid | 21101138302 | * |
| scopus.language.iso | eng | * |
| scopus.publisher.name | Association for Computational Linguistics (ACL) | * |
| scopus.relation.conferencedate | 2025 | * |
| scopus.relation.conferencename | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 | * |
| scopus.relation.conferenceplace | aut | * |
| scopus.relation.firstpage | 11361 | * |
| scopus.relation.lastpage | 11372 | * |
| scopus.title | Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models | * |
| scopus.titleeng | Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models | * |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
2025.findings-acl.593.pdf
accesso aperto
Licenza:
Creative commons
Dimensione
2.68 MB
Formato
Adobe PDF
|
2.68 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


