CNR Institutional Research Information System

Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.

Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models

Ciaccio C.;Sartor M.;Miaschi A.;Dell'Orletta F.

2025

Abstract

Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.anceserie	PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING	en
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Ciaccio C.	en
dc.authority.people	Sartor M.	en
dc.authority.people	Miaschi A.	en
dc.authority.people	Dell'Orletta F.	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2026/03/03 14:43:31	-
dc.date.available	2026/03/03 14:43:31	-
dc.date.firstsubmission	2026/03/02 18:29:56	*
dc.date.issued	2025	-
dc.date.submission	2026/03/02 18:29:56	*
dc.description.abstracteng	Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.	-
dc.description.allpeople	Ciaccio, C.; Sartor, M.; Miaschi, A.; Dell'Orletta, F.	-
dc.description.allpeopleoriginal	Ciaccio C.; Sartor M.; Miaschi A.; Dell'Orletta F.	en
dc.description.fulltext	open	en
dc.description.international	no	en
dc.description.numberofauthors	4	-
dc.identifier.doi	10.18653/v1/2025.findings-acl.593	en
dc.identifier.scopus	2-s2.0-105028561206	en
dc.identifier.source	scopus	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/570461	-
dc.language.iso	eng	en
dc.publisher.name	Association for Computational Linguistics (ACL)	en
dc.relation.conferencedate	2025	en
dc.relation.conferencename	63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025	en
dc.relation.firstpage	11361	en
dc.relation.ispartofbook	Proceedings of the Annual Meeting of the Association for Computational Linguistics	en
dc.relation.lastpage	11372	en
dc.relation.numberofpages	12	en
dc.subject.keywordseng	Large Language Models (LLMs)	-
dc.subject.keywordseng	Interpretability	-
dc.subject.singlekeyword	Large Language Models (LLMs)	*
dc.subject.singlekeyword	Interpretability	*
dc.title	Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
iris.mediafilter.data	2026/03/04 02:52:30	*
iris.orcid.lastModifiedDate	2026/03/03 14:43:31	*
iris.orcid.lastModifiedMillisecond	1772545411798	*
iris.scopus.extIssued	2025	-
iris.scopus.extTitle	Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.bestoaversion	publishedVersion	*
iris.unpaywall.doi	10.18653/v1/2025.findings-acl.593	*
iris.unpaywall.isoa	true	*
iris.unpaywall.landingpage	https://doi.org/10.18653/v1/2025.findings-acl.593	*
iris.unpaywall.license	cc-by	*
iris.unpaywall.metadataCallLastModified	04/03/2026 04:34:00	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1772595240983	-
iris.unpaywall.oastatus	gold	*
iris.unpaywall.pdfurl	https://aclanthology.org/2025.findings-acl.593.pdf	*
scopus.authority.anceserie	PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING###0736-587X	*
scopus.category	1203	*
scopus.category	3310	*
scopus.category	1706	*
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60008941	-
scopus.contributor.auid	59504212000	-
scopus.contributor.auid	59207233400	-
scopus.contributor.auid	57211678681	-
scopus.contributor.auid	57540567000	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid	114087935	-
scopus.contributor.name	Cristiano	-
scopus.contributor.name	Marta	-
scopus.contributor.name	Alessio	-
scopus.contributor.name	Felice	-
scopus.contributor.subaffiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC);	-
scopus.contributor.subaffiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC);	-
scopus.contributor.subaffiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC);	-
scopus.contributor.subaffiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC);	-
scopus.contributor.surname	Ciaccio	-
scopus.contributor.surname	Sartor	-
scopus.contributor.surname	Miaschi	-
scopus.contributor.surname	Dell'Orletta	-
scopus.date.issued	2025	*
scopus.description.abstracteng	Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are "character-blind" and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.	*
scopus.description.allpeopleoriginal	Ciaccio C.; Sartor M.; Miaschi A.; Dell'Orletta F.	*
scopus.differences	scopus.identifier.isbn	*
scopus.differences	scopus.relation.conferenceplace	*
scopus.document.type	cp	*
scopus.document.types	cp	*
scopus.funding.funders	501100021856 - Ministero dell'Università e della Ricerca; 501100021856 - Ministero dell'Università e della Ricerca;	*
scopus.funding.ids	PE0000013-FAIR;	*
scopus.identifier.doi	10.18653/v1/2025.findings-acl.593	*
scopus.identifier.isbn	9798891762565	*
scopus.identifier.pui	650043653	*
scopus.identifier.scopus	2-s2.0-105028561206	*
scopus.journal.sourceid	21101138302	*
scopus.language.iso	eng	*
scopus.publisher.name	Association for Computational Linguistics (ACL)	*
scopus.relation.conferencedate	2025	*
scopus.relation.conferencename	63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025	*
scopus.relation.conferenceplace	aut	*
scopus.relation.firstpage	11361	*
scopus.relation.lastpage	11372	*
scopus.title	Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models	*
scopus.titleeng	Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models	*
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2025.findings-acl.593.pdf accesso aperto Licenza: Creative commons Dimensione 2.68 MB Formato Adobe PDF Visualizza/Apri	2.68 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570461

Citazioni

ND

3

ND

social impact