CNR Institutional Research Information System

Pretraining large language models typically relies on randomly ordered corpora, implicitly assuming that data order has limited impact on learning. However, curriculum learning suggests that the sequence of training examples can influence optimization and representation dynamics. In this work, we systematically examine pretraining data ordering as an independent design variable for transformer-based language models, analyzing how curriculum-inspired strategies affect learning trajectories, representations, and transfer performance. We pretrain encoder-only and decoder-only models under controlled conditions, varying only the ordering of training data according to readability-based complexity proxies and their inverted variants, alongside multiple random baselines. Beyond final accuracy, we adopt a multi-dimensional evaluation framework combining intrinsic metrics, linguistic probing across training stages, downstream tasks, and geometric analyses of embedding spaces. Results indicate architecture-dependent tendencies in response to data ordering. Encoder models generally exhibit stronger sensitivity to curriculum strategies, with noticeable differences in optimization behavior, probing dynamics, and representation geometry. Decoder models appear comparatively more stable under forward curricula, with more pronounced effects emerging under inverted orderings. Probing analyses suggest that early improvements reflect differences in data exposure rather than accelerated linguistic acquisition, while later-stage effects selectively mirror properties emphasized by specific curricula. Geometric analyses show that data ordering reshapes global variance structure, often increasing anisotropy, without substantially altering nonlinear intrinsic dimensionality. Overall, data ordering functions as a selective inductive bias during pretraining, influencing learning dynamics and representational emphasis rather than consistently improving performance. These findings clarify how curriculum design interacts with transformer architectures and delineate its practical impact on pretraining outcomes.

On the impact of pretraining data ordering in transformer encoder- and decoder-only language models

Dini, Luca;Domenichelli, Lucia;Brunato, Dominique;Dell'Orletta, Felice

2026

Abstract

Pretraining large language models typically relies on randomly ordered corpora, implicitly assuming that data order has limited impact on learning. However, curriculum learning suggests that the sequence of training examples can influence optimization and representation dynamics. In this work, we systematically examine pretraining data ordering as an independent design variable for transformer-based language models, analyzing how curriculum-inspired strategies affect learning trajectories, representations, and transfer performance. We pretrain encoder-only and decoder-only models under controlled conditions, varying only the ordering of training data according to readability-based complexity proxies and their inverted variants, alongside multiple random baselines. Beyond final accuracy, we adopt a multi-dimensional evaluation framework combining intrinsic metrics, linguistic probing across training stages, downstream tasks, and geometric analyses of embedding spaces. Results indicate architecture-dependent tendencies in response to data ordering. Encoder models generally exhibit stronger sensitivity to curriculum strategies, with noticeable differences in optimization behavior, probing dynamics, and representation geometry. Decoder models appear comparatively more stable under forward curricula, with more pronounced effects emerging under inverted orderings. Probing analyses suggest that early improvements reflect differences in data exposure rather than accelerated linguistic acquisition, while later-stage effects selectively mirror properties emphasized by specific curricula. Geometric analyses show that data ordering reshapes global variance structure, often increasing anisotropy, without substantially altering nonlinear intrinsic dimensionality. Overall, data ordering functions as a selective inductive bias during pretraining, influencing learning dynamics and representational emphasis rather than consistently improving performance. These findings clarify how curriculum design interacts with transformer architectures and delineate its practical impact on pretraining outcomes.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.ancejournal	KNOWLEDGE-BASED SYSTEMS	en
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Dini, Luca	en
dc.authority.people	Domenichelli, Lucia	en
dc.authority.people	Brunato, Dominique	en
dc.authority.people	Dell'Orletta, Felice	en
dc.collection.id.s	b3f88f24-048a-4e43-8ab1-6697b90e068e	*
dc.collection.name	01.01 Articolo in rivista	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2026/07/03 16:53:09	-
dc.date.available	2026/07/03 16:53:09	-
dc.date.firstsubmission	2026/05/12 15:12:27	*
dc.date.issued	2026	-
dc.date.submission	2026/05/12 15:12:27	*
dc.description.abstracteng	Pretraining large language models typically relies on randomly ordered corpora, implicitly assuming that data order has limited impact on learning. However, curriculum learning suggests that the sequence of training examples can influence optimization and representation dynamics. In this work, we systematically examine pretraining data ordering as an independent design variable for transformer-based language models, analyzing how curriculum-inspired strategies affect learning trajectories, representations, and transfer performance. We pretrain encoder-only and decoder-only models under controlled conditions, varying only the ordering of training data according to readability-based complexity proxies and their inverted variants, alongside multiple random baselines. Beyond final accuracy, we adopt a multi-dimensional evaluation framework combining intrinsic metrics, linguistic probing across training stages, downstream tasks, and geometric analyses of embedding spaces. Results indicate architecture-dependent tendencies in response to data ordering. Encoder models generally exhibit stronger sensitivity to curriculum strategies, with noticeable differences in optimization behavior, probing dynamics, and representation geometry. Decoder models appear comparatively more stable under forward curricula, with more pronounced effects emerging under inverted orderings. Probing analyses suggest that early improvements reflect differences in data exposure rather than accelerated linguistic acquisition, while later-stage effects selectively mirror properties emphasized by specific curricula. Geometric analyses show that data ordering reshapes global variance structure, often increasing anisotropy, without substantially altering nonlinear intrinsic dimensionality. Overall, data ordering functions as a selective inductive bias during pretraining, influencing learning dynamics and representational emphasis rather than consistently improving performance. These findings clarify how curriculum design interacts with transformer architectures and delineate its practical impact on pretraining outcomes.	-
dc.description.allpeople	Dini, Luca; Domenichelli, Lucia; Brunato, Dominique; Dell'Orletta, Felice	-
dc.description.allpeopleoriginal	Dini, Luca; Domenichelli, Lucia; Brunato, Dominique; Dell'Orletta, Felice	en
dc.description.fulltext	open	en
dc.description.international	no	en
dc.description.numberofauthors	4	-
dc.identifier.doi	10.1016/j.knosys.2026.115850	en
dc.identifier.scopus	2-s2.0-105034728576	en
dc.identifier.source	crossref	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/580564	-
dc.language.iso	eng	en
dc.relation.volume	342	en
dc.subject.keywords	Curriculum learning	-
dc.subject.keywords	Data ordering	-
dc.subject.keywords	Language model pretraining	-
dc.subject.keywords	Linguistic representations	-
dc.subject.keywords	Representation geometry	-
dc.subject.singlekeyword	Curriculum learning	*
dc.subject.singlekeyword	Data ordering	*
dc.subject.singlekeyword	Language model pretraining	*
dc.subject.singlekeyword	Linguistic representations	*
dc.subject.singlekeyword	Representation geometry	*
dc.title	On the impact of pretraining data ordering in transformer encoder- and decoder-only language models	en
dc.type.driver	info:eu-repo/semantics/article	-
dc.type.full	01 Contributo su Rivista::01.01 Articolo in rivista	it
dc.type.miur	262	-
iris.mediafilter.data	2026/07/04 02:29:08	*
iris.orcid.lastModifiedDate	2026/07/03 16:53:09	*
iris.orcid.lastModifiedMillisecond	1783090389768	*
iris.scopus.extIssued	2026	-
iris.scopus.extTitle	On the impact of pretraining data ordering in transformer encoder- and decoder-only language models	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.doi	10.1016/j.knosys.2026.115850	*
iris.unpaywall.isoa	false	*
iris.unpaywall.journalisindoaj	false	*
iris.unpaywall.metadataCallLastModified	04/07/2026 04:56:24	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1783133784029	-
iris.unpaywall.oastatus	closed	*
scopus.authority.ancejournal	KNOWLEDGE-BASED SYSTEMS###0950-7051	*
scopus.category	1404	*
scopus.category	1712	*
scopus.category	1802	*
scopus.category	1702	*
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.affiliation	University of Pisa	-
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60028868	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60021199	-
scopus.contributor.auid	35185041000	-
scopus.contributor.auid	60169651800	-
scopus.contributor.auid	55237740200	-
scopus.contributor.auid	57540567000	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid	121833164	-
scopus.contributor.dptid		-
scopus.contributor.dptid	121833164	-
scopus.contributor.dptid	121833164	-
scopus.contributor.name	Luca	-
scopus.contributor.name	Lucia	-
scopus.contributor.name	Dominique	-
scopus.contributor.name	Felice	-
scopus.contributor.subaffiliation	Institute of Computational Linguistics “Antonio Zampolli” (CNR-ILC);	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation	Institute of Computational Linguistics “Antonio Zampolli” (CNR-ILC);	-
scopus.contributor.subaffiliation	Institute of Computational Linguistics “Antonio Zampolli” (CNR-ILC);	-
scopus.contributor.surname	Dini	-
scopus.contributor.surname	Domenichelli	-
scopus.contributor.surname	Brunato	-
scopus.contributor.surname	Dell'Orletta	-
scopus.date.issued	2026	*
scopus.description.abstracteng	Pretraining large language models typically relies on randomly ordered corpora, implicitly assuming that data order has limited impact on learning. However, curriculum learning suggests that the sequence of training examples can influence optimization and representation dynamics. In this work, we systematically examine pretraining data ordering as an independent design variable for transformer-based language models, analyzing how curriculum-inspired strategies affect learning trajectories, representations, and transfer performance. We pretrain encoder-only and decoder-only models under controlled conditions, varying only the ordering of training data according to readability-based complexity proxies and their inverted variants, alongside multiple random baselines. Beyond final accuracy, we adopt a multi-dimensional evaluation framework combining intrinsic metrics, linguistic probing across training stages, downstream tasks, and geometric analyses of embedding spaces. Results indicate architecture-dependent tendencies in response to data ordering. Encoder models generally exhibit stronger sensitivity to curriculum strategies, with noticeable differences in optimization behavior, probing dynamics, and representation geometry. Decoder models appear comparatively more stable under forward curricula, with more pronounced effects emerging under inverted orderings. Probing analyses suggest that early improvements reflect differences in data exposure rather than accelerated linguistic acquisition, while later-stage effects selectively mirror properties emphasized by specific curricula. Geometric analyses show that data ordering reshapes global variance structure, often increasing anisotropy, without substantially altering nonlinear intrinsic dimensionality. Overall, data ordering functions as a selective inductive bias during pretraining, influencing learning dynamics and representational emphasis rather than consistently improving performance. These findings clarify how curriculum design interacts with transformer architectures and delineate its practical impact on pretraining outcomes.	*
scopus.description.allpeopleoriginal	Dini L.; Domenichelli L.; Brunato D.; Dell'Orletta F.	*
scopus.differences	scopus.subject.keywords	*
scopus.differences	scopus.description.allpeopleoriginal	*
scopus.document.type	ar	*
scopus.document.types	ar	*
scopus.identifier.doi	10.1016/j.knosys.2026.115850	*
scopus.identifier.pui	2044663058	*
scopus.identifier.scopus	2-s2.0-105034728576	*
scopus.journal.sourceid	24772	*
scopus.language.iso	eng	*
scopus.publisher.name	Elsevier B.V.	*
scopus.relation.article	115850	*
scopus.relation.volume	342	*
scopus.subject.keywords	Curriculum learning; Data ordering; Language model pretraining; Linguistic representations; Representation geometry;	*
scopus.title	On the impact of pretraining data ordering in transformer encoder- and decoder-only language models	*
scopus.titleeng	On the impact of pretraining data ordering in transformer encoder- and decoder-only language models	*
Appare nelle tipologie:	01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0950705126005769-main.pdf accesso aperto Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 17.91 MB Formato Adobe PDF Visualizza/Apri	17.91 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/580564

Citazioni

ND

0

ND

social impact