CNR Institutional Research Information System

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we evaluate the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. We develop a pipeline that fine-tunes language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT), obtaining generations more challenging to detect by current models. Additionally, we analyze the linguistic shifts induced by the alignment and how detectors rely on “linguistic shortcuts” to detect texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detecting performances. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks.

Stress-testing machine generated text detection: shifting language models writing style to fool detectors

Pedrotti A.;Papucci M.;Ciaccio C.;Miaschi A.;Puccetti G.;Dell'Orletta F.;Esuli A.

2025

Abstract

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we evaluate the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. We develop a pipeline that fine-tunes language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT), obtaining generations more challenging to detect by current models. Additionally, we analyze the linguistic shifts induced by the alignment and how detectors rely on “linguistic shortcuts” to detect texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detecting performances. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	en
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Pedrotti A.	en
dc.authority.people	Papucci M.	en
dc.authority.people	Ciaccio C.	en
dc.authority.people	Miaschi A.	en
dc.authority.people	Puccetti G.	en
dc.authority.people	Dell'Orletta F.	en
dc.authority.people	Esuli A.	en
dc.authority.project	corda__h2020::e7f5e7755409fc74eea9d168ab795634	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.appartenenza.mi	973	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2025/09/29 15:27:29	-
dc.date.available	2025/09/29 15:27:29	-
dc.date.firstsubmission	2025/09/29 15:26:49	*
dc.date.issued	2025	-
dc.date.submission	2025/09/29 15:26:49	*
dc.description.abstracteng	Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we evaluate the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. We develop a pipeline that fine-tunes language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT), obtaining generations more challenging to detect by current models. Additionally, we analyze the linguistic shifts induced by the alignment and how detectors rely on “linguistic shortcuts” to detect texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detecting performances. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks.	-
dc.description.allpeople	Pedrotti, A.; Papucci, M.; Ciaccio, C.; Miaschi, A.; Puccetti, G.; Dell'Orletta, F.; Esuli, A.	-
dc.description.allpeopleoriginal	Pedrotti A.; Papucci M.; Ciaccio C.; Miaschi A.; Puccetti G.; Dell'Orletta F.; Esuli A.	en
dc.description.fulltext	open	en
dc.description.numberofauthors	7	-
dc.identifier.doi	10.18653/v1/2025.findings-acl.156	en
dc.identifier.isbn	979-8-89176-256-5	en
dc.identifier.scopus	2-s2.0-105028618911	-
dc.identifier.source	crossref	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/554367	-
dc.identifier.url	https://aclanthology.org/2025.findings-acl.156/	en
dc.language.iso	eng	en
dc.publisher.name	Association for Computational Linguistics	en
dc.relation.allauthors	Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar (eds.)	en
dc.relation.conferencedate	27/07-01/08/2025	en
dc.relation.conferencename	NAACL 2025 - Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics. Findings	en
dc.relation.conferenceplace	Vienna, Austria	en
dc.relation.firstpage	3010	en
dc.relation.ispartofbook	NAACL 2025 Findings proceedings	en
dc.relation.lastpage	3031	en
dc.relation.medium	ELETTRONICO	en
dc.relation.numberofpages	22	en
dc.relation.projectAcronym	SoBigData	en
dc.relation.projectAwardNumber	654024	en
dc.relation.projectAwardTitle	SoBigData Research Infrastructure	en
dc.relation.projectFunderName	European Commission	en
dc.relation.projectFundingStream	Horizon 2020 Framework Programme	en
dc.subject.keywordseng	machine-generated text detection, synthetic content detection	-
dc.subject.singlekeyword	machine-generated text detection	*
dc.subject.singlekeyword	synthetic content detection	*
dc.title	Stress-testing machine generated text detection: shifting language models writing style to fool detectors	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
iris.mediafilter.data	2025/09/30 03:37:07	*
iris.orcid.lastModifiedDate	2026/04/20 15:05:00	*
iris.orcid.lastModifiedMillisecond	1776690300512	*
iris.scopus.extIssued	2025	-
iris.scopus.extTitle	Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors	-
iris.scopus.ideLinkStatusDate	2026/04/20 15:05:00	*
iris.scopus.ideLinkStatusMillisecond	1776690300555	*
iris.sitodocente.maxattempts	1	-
iris.unpaywall.bestoaversion	publishedVersion	*
iris.unpaywall.doi	10.18653/v1/2025.findings-acl.156	*
iris.unpaywall.isoa	true	*
iris.unpaywall.journalisindoaj	false	*
iris.unpaywall.landingpage	https://doi.org/10.18653/v1/2025.findings-acl.156	*
iris.unpaywall.license	cc-by	*
iris.unpaywall.metadataCallLastModified	28/04/2026 05:03:35	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1777345415827	-
iris.unpaywall.oastatus	gold	*
iris.unpaywall.pdfurl	https://aclanthology.org/2025.findings-acl.156.pdf	*
scopus.authority.anceserie	PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING###0736-587X	*
scopus.category	1203	*
scopus.category	3310	*
scopus.category	1706	*
scopus.contributor.affiliation	Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (CNR-ISTI)	-
scopus.contributor.affiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC)	-
scopus.contributor.affiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC)	-
scopus.contributor.affiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC)	-
scopus.contributor.affiliation	Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (CNR-ISTI)	-
scopus.contributor.affiliation	Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC)	-
scopus.contributor.affiliation	Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (CNR-ISTI)	-
scopus.contributor.afid	60085207	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60085207	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60085207	-
scopus.contributor.auid	57223141523	-
scopus.contributor.auid	57991631200	-
scopus.contributor.auid	59504212000	-
scopus.contributor.auid	57211678681	-
scopus.contributor.auid	57220748419	-
scopus.contributor.auid	57540567000	-
scopus.contributor.auid	15044356100	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid		-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid		-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid		-
scopus.contributor.name	Andrea	-
scopus.contributor.name	Michele	-
scopus.contributor.name	Cristiano	-
scopus.contributor.name	Alessio	-
scopus.contributor.name	Giovanni	-
scopus.contributor.name	Felice	-
scopus.contributor.name	Andrea	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation	ItaliaNLP Lab;	-
scopus.contributor.subaffiliation	ItaliaNLP Lab;	-
scopus.contributor.subaffiliation	ItaliaNLP Lab;	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation	ItaliaNLP Lab;	-
scopus.contributor.subaffiliation		-
scopus.contributor.surname	Pedrotti	-
scopus.contributor.surname	Papucci	-
scopus.contributor.surname	Ciaccio	-
scopus.contributor.surname	Miaschi	-
scopus.contributor.surname	Puccetti	-
scopus.contributor.surname	Dell'Orletta	-
scopus.contributor.surname	Esuli	-
scopus.date.issued	2025	*
scopus.description.abstracteng	Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks.	*
scopus.description.allpeopleoriginal	Pedrotti A.; Papucci M.; Ciaccio C.; Miaschi A.; Puccetti G.; Dell'Orletta F.; Esuli A.	*
scopus.differences	scopus.authority.anceserie	*
scopus.differences	scopus.publisher.name	*
scopus.differences	scopus.relation.conferencedate	*
scopus.differences	scopus.description.abstracteng	*
scopus.differences	scopus.relation.conferencename	*
scopus.differences	scopus.identifier.isbn	*
scopus.differences	scopus.relation.conferenceplace	*
scopus.document.type	cp	*
scopus.document.types	cp	*
scopus.funding.funders	501100021856 - Ministero dell'Università e della Ricerca; 501100021856 - Ministero dell'Università e della Ricerca; 501100000780 - European Commission; 501100000780 - European Commission; 100031478 - NextGenerationEU; 100031478 - NextGenerationEU;	*
scopus.funding.ids	CUP B53C22001770006; XAI-CARE-PNRR-MAD-2022-12376692; CUP B53D23013050006; CUP B53C22001760006; PE0000013-FAIR;	*
scopus.identifier.doi	10.18653/v1/2025.findings-acl.156	*
scopus.identifier.isbn	9798891762565	*
scopus.identifier.pui	650042695	*
scopus.identifier.scopus	2-s2.0-105028618911	*
scopus.journal.sourceid	21101138302	*
scopus.language.iso	eng	*
scopus.publisher.name	Association for Computational Linguistics (ACL)	*
scopus.relation.conferencedate	2025	*
scopus.relation.conferencename	63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025	*
scopus.relation.conferenceplace	aut	*
scopus.relation.firstpage	3010	*
scopus.relation.lastpage	3031	*
scopus.title	Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors	*
scopus.titleeng	Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors	*
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Pedrotti et al_ACL Findings-2025.pdf accesso aperto Descrizione: Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors Tipologia: Versione Editoriale (PDF) Licenza: Altro tipo di licenza Dimensione 798.04 kB Formato Adobe PDF Visualizza/Apri	798.04 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/554367

Citazioni

ND

2

ND

social impact