While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.

Outlier dimensions that disrupt transformers are driven by frequency

Puccetti G.;Dell'Orletta F.
2022

Abstract

While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Puccetti G. en
dc.authority.people Rogers A. en
dc.authority.people Drozd A. en
dc.authority.people Dell'Orletta F. en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.appartenenza.mi 973 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2024/12/23 16:45:45 -
dc.date.available 2024/12/23 16:45:45 -
dc.date.firstsubmission 2024/12/23 14:53:03 *
dc.date.issued 2022 -
dc.date.submission 2024/12/23 14:53:03 *
dc.description.abstracteng While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions. -
dc.description.allpeople Puccetti, G.; Rogers, A.; Drozd, A.; Dell'Orletta, F. -
dc.description.allpeopleoriginal Puccetti G.; Rogers A.; Drozd A.; Dell'Orletta F. en
dc.description.fulltext open en
dc.description.international si en
dc.description.numberofauthors 4 -
dc.identifier.doi 10.18653/v1/2022.findings-emnlp.93 en
dc.identifier.isbn 978-1-959429-43-2 en
dc.identifier.scopus 2-s2.0-85144872662 en
dc.identifier.source scopus *
dc.identifier.uri https://hdl.handle.net/20.500.14243/521513 -
dc.identifier.url https://aclanthology.org/2022.findings-emnlp.93/ en
dc.language.iso eng en
dc.publisher.name Association for Computational Linguistics (ACL) en
dc.relation.alleditors Goldberg Y., Kozareva Z., Zhang Y. en
dc.relation.conferencedate 2022 en
dc.relation.conferencename EMNLP 2022 - Findings of the Association for Computational Linguistics en
dc.relation.firstpage 1286 en
dc.relation.ispartofbook Findings of the Association for Computational Linguistics: EMNLP 2022 en
dc.relation.lastpage 1304 en
dc.relation.medium ELETTRONICO en
dc.relation.numberofpages 19 en
dc.subject.keywordseng Large Language Models -
dc.subject.keywordseng Mechanistic interpretability -
dc.subject.keywordseng Natural Language Processing -
dc.subject.singlekeyword Large Language Models *
dc.subject.singlekeyword Mechanistic interpretability *
dc.subject.singlekeyword Natural Language Processing *
dc.title Outlier dimensions that disrupt transformers are driven by frequency en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
iris.mediafilter.data 2025/04/04 04:10:54 *
iris.orcid.lastModifiedDate 2025/03/04 17:02:05 *
iris.orcid.lastModifiedMillisecond 1741104125843 *
iris.scopus.extIssued 2022 -
iris.scopus.extTitle Outlier Dimensions that Disrupt Transformers are Driven by Frequency -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoahost publisher *
iris.unpaywall.bestoaversion publishedVersion *
iris.unpaywall.doi 10.18653/v1/2022.findings-emnlp.93 *
iris.unpaywall.hosttype publisher *
iris.unpaywall.isoa true *
iris.unpaywall.landingpage https://doi.org/10.18653/v1/2022.findings-emnlp.93 *
iris.unpaywall.license cc-by *
iris.unpaywall.metadataCallLastModified 24/06/2025 06:49:42 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1750740582759 -
iris.unpaywall.oastatus hybrid *
iris.unpaywall.pdfurl https://aclanthology.org/2022.findings-emnlp.93.pdf *
scopus.category 1710 *
scopus.category 1706 *
scopus.category 1703 *
scopus.contributor.affiliation RIKEN Center for Computational Science -
scopus.contributor.affiliation RIKEN Center for Computational Science -
scopus.contributor.affiliation RIKEN Center for Computational Science -
scopus.contributor.affiliation NLPLab -
scopus.contributor.afid 60277300 -
scopus.contributor.afid 60277300 -
scopus.contributor.afid 60277300 -
scopus.contributor.afid 60008941 -
scopus.contributor.auid 57220748419 -
scopus.contributor.auid 57198517078 -
scopus.contributor.auid 57191439356 -
scopus.contributor.auid 57540567000 -
scopus.contributor.country Japan -
scopus.contributor.country Japan -
scopus.contributor.country Japan -
scopus.contributor.country Italy -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid 114087935 -
scopus.contributor.name Giovanni -
scopus.contributor.name Anna -
scopus.contributor.name Aleksandr -
scopus.contributor.name Felice -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation Istituto di Linguistica Computazionale “Antonio Zampolli”; -
scopus.contributor.surname Puccetti -
scopus.contributor.surname Rogers -
scopus.contributor.surname Drozd -
scopus.contributor.surname Dell'Orletta -
scopus.date.issued 2022 *
scopus.description.abstracteng While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions. *
scopus.description.allpeopleoriginal Puccetti G.; Rogers A.; Drozd A.; Dell'Orletta F. *
scopus.differences scopus.relation.conferencename *
scopus.differences scopus.identifier.isbn *
scopus.differences scopus.identifier.doi *
scopus.differences scopus.relation.conferenceplace *
scopus.document.type cp *
scopus.document.types cp *
scopus.funding.funders 501100002241 - Japan Science and Technology Agency; 501100002241 - Japan Science and Technology Agency; 501100003382 - Core Research for Evolutional Science and Technology; 501100003382 - Core Research for Evolutional Science and Technology; 501100006264 - RIKEN; 501100006264 - RIKEN; *
scopus.funding.ids JP22H03600; JPMJCR19F5; hp210265; *
scopus.identifier.doi 10.18653/v1/2022.findings-emnlp.528 *
scopus.identifier.isbn 9781959429432 *
scopus.identifier.pui 640545572 *
scopus.identifier.scopus 2-s2.0-85144872662 *
scopus.journal.sourceid 21101140399 *
scopus.language.iso eng *
scopus.publisher.name Association for Computational Linguistics (ACL) *
scopus.relation.conferencedate 2022 *
scopus.relation.conferencename 2022 Findings of the Association for Computational Linguistics: EMNLP 2022 *
scopus.relation.conferenceplace are *
scopus.relation.firstpage 1286 *
scopus.relation.lastpage 1304 *
scopus.title Outlier Dimensions that Disrupt Transformers are Driven by Frequency *
scopus.titleeng Outlier Dimensions that Disrupt Transformers are Driven by Frequency *
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
2022.findings-emnlp.93.pdf

accesso aperto

Descrizione: Outlier Dimensions that Disrupt Transformers are Driven by Frequency
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 2.01 MB
Formato Adobe PDF
2.01 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/521513
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 26
  • ???jsp.display-item.citation.isi??? ND
social impact