CNR Institutional Research Information System

While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.

Outlier dimensions that disrupt transformers are driven by frequency

Puccetti G.;Rogers A.;Drozd A.;Dell'Orletta F.

2022

Abstract

While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	en
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Puccetti G.	en
dc.authority.people	Rogers A.	en
dc.authority.people	Drozd A.	en
dc.authority.people	Dell'Orletta F.	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.appartenenza.mi	973	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2024/12/23 16:45:45	-
dc.date.available	2024/12/23 16:45:45	-
dc.date.firstsubmission	2024/12/23 14:53:03	*
dc.date.issued	2022	-
dc.date.submission	2024/12/23 14:53:03	*
dc.description.abstracteng	While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.	-
dc.description.allpeople	Puccetti, G.; Rogers, A.; Drozd, A.; Dell'Orletta, F.	-
dc.description.allpeopleoriginal	Puccetti G.; Rogers A.; Drozd A.; Dell'Orletta F.	en
dc.description.fulltext	open	en
dc.description.international	si	en
dc.description.numberofauthors	4	-
dc.identifier.doi	10.18653/v1/2022.findings-emnlp.93	en
dc.identifier.isbn	978-1-959429-43-2	en
dc.identifier.scopus	2-s2.0-85144872662	en
dc.identifier.source	scopus	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/521513	-
dc.identifier.url	https://aclanthology.org/2022.findings-emnlp.93/	en
dc.language.iso	eng	en
dc.publisher.name	Association for Computational Linguistics (ACL)	en
dc.relation.alleditors	Goldberg Y., Kozareva Z., Zhang Y.	en
dc.relation.conferencedate	2022	en
dc.relation.conferencename	EMNLP 2022 - Findings of the Association for Computational Linguistics	en
dc.relation.firstpage	1286	en
dc.relation.ispartofbook	Findings of the Association for Computational Linguistics: EMNLP 2022	en
dc.relation.lastpage	1304	en
dc.relation.medium	ELETTRONICO	en
dc.relation.numberofpages	19	en
dc.subject.keywordseng	Large Language Models	-
dc.subject.keywordseng	Mechanistic interpretability	-
dc.subject.keywordseng	Natural Language Processing	-
dc.subject.singlekeyword	Large Language Models	*
dc.subject.singlekeyword	Mechanistic interpretability	*
dc.subject.singlekeyword	Natural Language Processing	*
dc.title	Outlier dimensions that disrupt transformers are driven by frequency	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
iris.mediafilter.data	2025/04/04 04:10:54	*
iris.orcid.lastModifiedDate	2025/03/04 17:02:05	*
iris.orcid.lastModifiedMillisecond	1741104125843	*
iris.scopus.extIssued	2022	-
iris.scopus.extTitle	Outlier Dimensions that Disrupt Transformers are Driven by Frequency	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.bestoahost	publisher	*
iris.unpaywall.bestoaversion	publishedVersion	*
iris.unpaywall.doi	10.18653/v1/2022.findings-emnlp.93	*
iris.unpaywall.hosttype	publisher	*
iris.unpaywall.isoa	true	*
iris.unpaywall.landingpage	https://doi.org/10.18653/v1/2022.findings-emnlp.93	*
iris.unpaywall.license	cc-by	*
iris.unpaywall.metadataCallLastModified	24/06/2025 06:49:42	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1750740582759	-
iris.unpaywall.oastatus	hybrid	*
iris.unpaywall.pdfurl	https://aclanthology.org/2022.findings-emnlp.93.pdf	*
scopus.category	1710	*
scopus.category	1706	*
scopus.category	1703	*
scopus.contributor.affiliation	RIKEN Center for Computational Science	-
scopus.contributor.affiliation	RIKEN Center for Computational Science	-
scopus.contributor.affiliation	RIKEN Center for Computational Science	-
scopus.contributor.affiliation	NLPLab	-
scopus.contributor.afid	60277300	-
scopus.contributor.afid	60277300	-
scopus.contributor.afid	60277300	-
scopus.contributor.afid	60008941	-
scopus.contributor.auid	57220748419	-
scopus.contributor.auid	57198517078	-
scopus.contributor.auid	57191439356	-
scopus.contributor.auid	57540567000	-
scopus.contributor.country	Japan	-
scopus.contributor.country	Japan	-
scopus.contributor.country	Japan	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid	114087935	-
scopus.contributor.name	Giovanni	-
scopus.contributor.name	Anna	-
scopus.contributor.name	Aleksandr	-
scopus.contributor.name	Felice	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation	Istituto di Linguistica Computazionale “Antonio Zampolli”;	-
scopus.contributor.surname	Puccetti	-
scopus.contributor.surname	Rogers	-
scopus.contributor.surname	Drozd	-
scopus.contributor.surname	Dell'Orletta	-
scopus.date.issued	2022	*
scopus.description.abstracteng	While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.	*
scopus.description.allpeopleoriginal	Puccetti G.; Rogers A.; Drozd A.; Dell'Orletta F.	*
scopus.differences	scopus.relation.conferencename	*
scopus.differences	scopus.identifier.isbn	*
scopus.differences	scopus.identifier.doi	*
scopus.differences	scopus.relation.conferenceplace	*
scopus.document.type	cp	*
scopus.document.types	cp	*
scopus.funding.funders	501100002241 - Japan Science and Technology Agency; 501100002241 - Japan Science and Technology Agency; 501100003382 - Core Research for Evolutional Science and Technology; 501100003382 - Core Research for Evolutional Science and Technology; 501100006264 - RIKEN; 501100006264 - RIKEN;	*
scopus.funding.ids	JP22H03600; JPMJCR19F5; hp210265;	*
scopus.identifier.doi	10.18653/v1/2022.findings-emnlp.528	*
scopus.identifier.isbn	9781959429432	*
scopus.identifier.pui	640545572	*
scopus.identifier.scopus	2-s2.0-85144872662	*
scopus.journal.sourceid	21101140399	*
scopus.language.iso	eng	*
scopus.publisher.name	Association for Computational Linguistics (ACL)	*
scopus.relation.conferencedate	2022	*
scopus.relation.conferencename	2022 Findings of the Association for Computational Linguistics: EMNLP 2022	*
scopus.relation.conferenceplace	are	*
scopus.relation.firstpage	1286	*
scopus.relation.lastpage	1304	*
scopus.title	Outlier Dimensions that Disrupt Transformers are Driven by Frequency	*
scopus.titleeng	Outlier Dimensions that Disrupt Transformers are Driven by Frequency	*
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2022.findings-emnlp.93.pdf accesso aperto Descrizione: Outlier Dimensions that Disrupt Transformers are Driven by Frequency Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 2.01 MB Formato Adobe PDF Visualizza/Apri	2.01 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/521513

Citazioni

ND

26

ND

social impact