Pretraining large language models typically relies on randomly ordered corpora, implicitly assuming that data order has limited impact on learning. However, curriculum learning suggests that the sequence of training examples can influence optimization and representation dynamics. In this work, we systematically examine pretraining data ordering as an independent design variable for transformer-based language models, analyzing how curriculum-inspired strategies affect learning trajectories, representations, and transfer performance. We pretrain encoder-only and decoder-only models under controlled conditions, varying only the ordering of training data according to readability-based complexity proxies and their inverted variants, alongside multiple random baselines. Beyond final accuracy, we adopt a multi-dimensional evaluation framework combining intrinsic metrics, linguistic probing across training stages, downstream tasks, and geometric analyses of embedding spaces. Results indicate architecture-dependent tendencies in response to data ordering. Encoder models generally exhibit stronger sensitivity to curriculum strategies, with noticeable differences in optimization behavior, probing dynamics, and representation geometry. Decoder models appear comparatively more stable under forward curricula, with more pronounced effects emerging under inverted orderings. Probing analyses suggest that early improvements reflect differences in data exposure rather than accelerated linguistic acquisition, while later-stage effects selectively mirror properties emphasized by specific curricula. Geometric analyses show that data ordering reshapes global variance structure, often increasing anisotropy, without substantially altering nonlinear intrinsic dimensionality. Overall, data ordering functions as a selective inductive bias during pretraining, influencing learning dynamics and representational emphasis rather than consistently improving performance. These findings clarify how curriculum design interacts with transformer architectures and delineate its practical impact on pretraining outcomes.

On the impact of pretraining data ordering in transformer encoder- and decoder-only language models

Dini, Luca
;
Domenichelli, Lucia;Brunato, Dominique;Dell'Orletta, Felice
2026

Abstract

Pretraining large language models typically relies on randomly ordered corpora, implicitly assuming that data order has limited impact on learning. However, curriculum learning suggests that the sequence of training examples can influence optimization and representation dynamics. In this work, we systematically examine pretraining data ordering as an independent design variable for transformer-based language models, analyzing how curriculum-inspired strategies affect learning trajectories, representations, and transfer performance. We pretrain encoder-only and decoder-only models under controlled conditions, varying only the ordering of training data according to readability-based complexity proxies and their inverted variants, alongside multiple random baselines. Beyond final accuracy, we adopt a multi-dimensional evaluation framework combining intrinsic metrics, linguistic probing across training stages, downstream tasks, and geometric analyses of embedding spaces. Results indicate architecture-dependent tendencies in response to data ordering. Encoder models generally exhibit stronger sensitivity to curriculum strategies, with noticeable differences in optimization behavior, probing dynamics, and representation geometry. Decoder models appear comparatively more stable under forward curricula, with more pronounced effects emerging under inverted orderings. Probing analyses suggest that early improvements reflect differences in data exposure rather than accelerated linguistic acquisition, while later-stage effects selectively mirror properties emphasized by specific curricula. Geometric analyses show that data ordering reshapes global variance structure, often increasing anisotropy, without substantially altering nonlinear intrinsic dimensionality. Overall, data ordering functions as a selective inductive bias during pretraining, influencing learning dynamics and representational emphasis rather than consistently improving performance. These findings clarify how curriculum design interacts with transformer architectures and delineate its practical impact on pretraining outcomes.
Campo DC Valore Lingua
dc.authority.ancejournal KNOWLEDGE-BASED SYSTEMS en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Dini, Luca en
dc.authority.people Domenichelli, Lucia en
dc.authority.people Brunato, Dominique en
dc.authority.people Dell'Orletta, Felice en
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.firstsubmission 2026/05/12 15:12:27 *
dc.date.issued 2026 -
dc.date.submission 2026/05/12 15:12:27 *
dc.description.abstracteng Pretraining large language models typically relies on randomly ordered corpora, implicitly assuming that data order has limited impact on learning. However, curriculum learning suggests that the sequence of training examples can influence optimization and representation dynamics. In this work, we systematically examine pretraining data ordering as an independent design variable for transformer-based language models, analyzing how curriculum-inspired strategies affect learning trajectories, representations, and transfer performance. We pretrain encoder-only and decoder-only models under controlled conditions, varying only the ordering of training data according to readability-based complexity proxies and their inverted variants, alongside multiple random baselines. Beyond final accuracy, we adopt a multi-dimensional evaluation framework combining intrinsic metrics, linguistic probing across training stages, downstream tasks, and geometric analyses of embedding spaces. Results indicate architecture-dependent tendencies in response to data ordering. Encoder models generally exhibit stronger sensitivity to curriculum strategies, with noticeable differences in optimization behavior, probing dynamics, and representation geometry. Decoder models appear comparatively more stable under forward curricula, with more pronounced effects emerging under inverted orderings. Probing analyses suggest that early improvements reflect differences in data exposure rather than accelerated linguistic acquisition, while later-stage effects selectively mirror properties emphasized by specific curricula. Geometric analyses show that data ordering reshapes global variance structure, often increasing anisotropy, without substantially altering nonlinear intrinsic dimensionality. Overall, data ordering functions as a selective inductive bias during pretraining, influencing learning dynamics and representational emphasis rather than consistently improving performance. These findings clarify how curriculum design interacts with transformer architectures and delineate its practical impact on pretraining outcomes. -
dc.description.allpeople Dini, Luca; Domenichelli, Lucia; Brunato, Dominique; Dell'Orletta, Felice -
dc.description.allpeopleoriginal Dini, Luca; Domenichelli, Lucia; Brunato, Dominique; Dell'Orletta, Felice en
dc.description.fulltext none en
dc.description.international no en
dc.description.numberofauthors 4 -
dc.identifier.doi 10.1016/j.knosys.2026.115850 en
dc.identifier.scopus 2-s2.0-105034728576 en
dc.identifier.source crossref *
dc.identifier.uri https://hdl.handle.net/20.500.14243/580564 -
dc.language.iso eng en
dc.relation.volume 342 en
dc.subject.keywords Curriculum learning -
dc.subject.keywords Data ordering -
dc.subject.keywords Language model pretraining -
dc.subject.keywords Linguistic representations -
dc.subject.keywords Representation geometry -
dc.subject.singlekeyword Curriculum learning *
dc.subject.singlekeyword Data ordering *
dc.subject.singlekeyword Language model pretraining *
dc.subject.singlekeyword Linguistic representations *
dc.subject.singlekeyword Representation geometry *
dc.title On the impact of pretraining data ordering in transformer encoder- and decoder-only language models en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.miur 262 -
iris.orcid.lastModifiedDate 2026/05/12 15:12:27 *
iris.orcid.lastModifiedMillisecond 1778591547280 *
iris.scopus.extIssued 2026 -
iris.scopus.extTitle On the impact of pretraining data ordering in transformer encoder- and decoder-only language models -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.doi 10.1016/j.knosys.2026.115850 *
iris.unpaywall.isoa false *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.metadataCallLastModified 13/05/2026 04:18:47 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1778638727679 -
iris.unpaywall.oastatus closed *
scopus.authority.ancejournal KNOWLEDGE-BASED SYSTEMS###0950-7051 *
scopus.category 1404 *
scopus.category 1712 *
scopus.category 1802 *
scopus.category 1702 *
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.affiliation University of Pisa -
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60028868 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.auid 35185041000 -
scopus.contributor.auid 60169651800 -
scopus.contributor.auid 55237740200 -
scopus.contributor.auid 57540567000 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid 121833164 -
scopus.contributor.dptid -
scopus.contributor.dptid 121833164 -
scopus.contributor.dptid 121833164 -
scopus.contributor.name Luca -
scopus.contributor.name Lucia -
scopus.contributor.name Dominique -
scopus.contributor.name Felice -
scopus.contributor.subaffiliation Institute of Computational Linguistics “Antonio Zampolli” (CNR-ILC); -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation Institute of Computational Linguistics “Antonio Zampolli” (CNR-ILC); -
scopus.contributor.subaffiliation Institute of Computational Linguistics “Antonio Zampolli” (CNR-ILC); -
scopus.contributor.surname Dini -
scopus.contributor.surname Domenichelli -
scopus.contributor.surname Brunato -
scopus.contributor.surname Dell'Orletta -
scopus.date.issued 2026 *
scopus.description.abstracteng Pretraining large language models typically relies on randomly ordered corpora, implicitly assuming that data order has limited impact on learning. However, curriculum learning suggests that the sequence of training examples can influence optimization and representation dynamics. In this work, we systematically examine pretraining data ordering as an independent design variable for transformer-based language models, analyzing how curriculum-inspired strategies affect learning trajectories, representations, and transfer performance. We pretrain encoder-only and decoder-only models under controlled conditions, varying only the ordering of training data according to readability-based complexity proxies and their inverted variants, alongside multiple random baselines. Beyond final accuracy, we adopt a multi-dimensional evaluation framework combining intrinsic metrics, linguistic probing across training stages, downstream tasks, and geometric analyses of embedding spaces. Results indicate architecture-dependent tendencies in response to data ordering. Encoder models generally exhibit stronger sensitivity to curriculum strategies, with noticeable differences in optimization behavior, probing dynamics, and representation geometry. Decoder models appear comparatively more stable under forward curricula, with more pronounced effects emerging under inverted orderings. Probing analyses suggest that early improvements reflect differences in data exposure rather than accelerated linguistic acquisition, while later-stage effects selectively mirror properties emphasized by specific curricula. Geometric analyses show that data ordering reshapes global variance structure, often increasing anisotropy, without substantially altering nonlinear intrinsic dimensionality. Overall, data ordering functions as a selective inductive bias during pretraining, influencing learning dynamics and representational emphasis rather than consistently improving performance. These findings clarify how curriculum design interacts with transformer architectures and delineate its practical impact on pretraining outcomes. *
scopus.description.allpeopleoriginal Dini L.; Domenichelli L.; Brunato D.; Dell'Orletta F. *
scopus.differences scopus.subject.keywords *
scopus.differences scopus.description.allpeopleoriginal *
scopus.document.type ar *
scopus.document.types ar *
scopus.identifier.doi 10.1016/j.knosys.2026.115850 *
scopus.identifier.pui 2044663058 *
scopus.identifier.scopus 2-s2.0-105034728576 *
scopus.journal.sourceid 24772 *
scopus.language.iso eng *
scopus.publisher.name Elsevier B.V. *
scopus.relation.article 115850 *
scopus.relation.volume 342 *
scopus.subject.keywords Curriculum learning; Data ordering; Language model pretraining; Linguistic representations; Representation geometry; *
scopus.title On the impact of pretraining data ordering in transformer encoder- and decoder-only language models *
scopus.titleeng On the impact of pretraining data ordering in transformer encoder- and decoder-only language models *
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/580564
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ente

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact