Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we evaluate the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. We develop a pipeline that fine-tunes language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT), obtaining generations more challenging to detect by current models. Additionally, we analyze the linguistic shifts induced by the alignment and how detectors rely on “linguistic shortcuts” to detect texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detecting performances. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks.

Stress-testing machine generated text detection: shifting language models writing style to fool detectors

Pedrotti A.
;
Papucci M.;Ciaccio C.;Miaschi A.;Puccetti G.;Dell'Orletta F.;Esuli A.
2025

Abstract

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we evaluate the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. We develop a pipeline that fine-tunes language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT), obtaining generations more challenging to detect by current models. Additionally, we analyze the linguistic shifts induced by the alignment and how detectors rely on “linguistic shortcuts” to detect texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detecting performances. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Pedrotti A. en
dc.authority.people Papucci M. en
dc.authority.people Ciaccio C. en
dc.authority.people Miaschi A. en
dc.authority.people Puccetti G. en
dc.authority.people Dell'Orletta F. en
dc.authority.people Esuli A. en
dc.authority.project corda__h2020::e7f5e7755409fc74eea9d168ab795634 en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.appartenenza.mi 973 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2025/09/29 15:27:29 -
dc.date.available 2025/09/29 15:27:29 -
dc.date.firstsubmission 2025/09/29 15:26:49 *
dc.date.issued 2025 -
dc.date.submission 2025/09/29 15:26:49 *
dc.description.abstracteng Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we evaluate the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. We develop a pipeline that fine-tunes language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT), obtaining generations more challenging to detect by current models. Additionally, we analyze the linguistic shifts induced by the alignment and how detectors rely on “linguistic shortcuts” to detect texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detecting performances. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks. -
dc.description.allpeople Pedrotti, A.; Papucci, M.; Ciaccio, C.; Miaschi, A.; Puccetti, G.; Dell'Orletta, F.; Esuli, A. -
dc.description.allpeopleoriginal Pedrotti A.; Papucci M.; Ciaccio C.; Miaschi A.; Puccetti G.; Dell'Orletta F.; Esuli A. en
dc.description.fulltext open en
dc.description.numberofauthors 7 -
dc.identifier.doi 10.18653/v1/2025.findings-acl.156 en
dc.identifier.isbn 979-8-89176-256-5 en
dc.identifier.scopus 2-s2.0-105028618911 -
dc.identifier.source crossref *
dc.identifier.uri https://hdl.handle.net/20.500.14243/554367 -
dc.identifier.url https://aclanthology.org/2025.findings-acl.156/ en
dc.language.iso eng en
dc.publisher.name Association for Computational Linguistics en
dc.relation.allauthors Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar (eds.) en
dc.relation.conferencedate 27/07-01/08/2025 en
dc.relation.conferencename NAACL 2025 - Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics. Findings en
dc.relation.conferenceplace Vienna, Austria en
dc.relation.firstpage 3010 en
dc.relation.ispartofbook NAACL 2025 Findings proceedings en
dc.relation.lastpage 3031 en
dc.relation.medium ELETTRONICO en
dc.relation.numberofpages 22 en
dc.relation.projectAcronym SoBigData en
dc.relation.projectAwardNumber 654024 en
dc.relation.projectAwardTitle SoBigData Research Infrastructure en
dc.relation.projectFunderName European Commission en
dc.relation.projectFundingStream Horizon 2020 Framework Programme en
dc.subject.keywordseng machine-generated text detection, synthetic content detection -
dc.subject.singlekeyword machine-generated text detection *
dc.subject.singlekeyword synthetic content detection *
dc.title Stress-testing machine generated text detection: shifting language models writing style to fool detectors en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
iris.mediafilter.data 2025/09/30 03:37:07 *
iris.orcid.lastModifiedDate 2026/04/20 15:05:00 *
iris.orcid.lastModifiedMillisecond 1776690300512 *
iris.scopus.extIssued 2025 -
iris.scopus.extTitle Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors -
iris.scopus.ideLinkStatusDate 2026/04/20 15:05:00 *
iris.scopus.ideLinkStatusMillisecond 1776690300555 *
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoaversion publishedVersion *
iris.unpaywall.doi 10.18653/v1/2025.findings-acl.156 *
iris.unpaywall.isoa true *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.landingpage https://doi.org/10.18653/v1/2025.findings-acl.156 *
iris.unpaywall.license cc-by *
iris.unpaywall.metadataCallLastModified 28/04/2026 05:03:35 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1777345415827 -
iris.unpaywall.oastatus gold *
iris.unpaywall.pdfurl https://aclanthology.org/2025.findings-acl.156.pdf *
scopus.authority.anceserie PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING###0736-587X *
scopus.category 1203 *
scopus.category 3310 *
scopus.category 1706 *
scopus.contributor.affiliation Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (CNR-ISTI) -
scopus.contributor.affiliation Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC) -
scopus.contributor.affiliation Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC) -
scopus.contributor.affiliation Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC) -
scopus.contributor.affiliation Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (CNR-ISTI) -
scopus.contributor.affiliation Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC) -
scopus.contributor.affiliation Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (CNR-ISTI) -
scopus.contributor.afid 60085207 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60085207 -
scopus.contributor.afid 60008941 -
scopus.contributor.afid 60085207 -
scopus.contributor.auid 57223141523 -
scopus.contributor.auid 57991631200 -
scopus.contributor.auid 59504212000 -
scopus.contributor.auid 57211678681 -
scopus.contributor.auid 57220748419 -
scopus.contributor.auid 57540567000 -
scopus.contributor.auid 15044356100 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid -
scopus.contributor.dptid 114087935 -
scopus.contributor.dptid 114087935 -
scopus.contributor.dptid 114087935 -
scopus.contributor.dptid -
scopus.contributor.dptid 114087935 -
scopus.contributor.dptid -
scopus.contributor.name Andrea -
scopus.contributor.name Michele -
scopus.contributor.name Cristiano -
scopus.contributor.name Alessio -
scopus.contributor.name Giovanni -
scopus.contributor.name Felice -
scopus.contributor.name Andrea -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation ItaliaNLP Lab; -
scopus.contributor.subaffiliation ItaliaNLP Lab; -
scopus.contributor.subaffiliation ItaliaNLP Lab; -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation ItaliaNLP Lab; -
scopus.contributor.subaffiliation -
scopus.contributor.surname Pedrotti -
scopus.contributor.surname Papucci -
scopus.contributor.surname Ciaccio -
scopus.contributor.surname Miaschi -
scopus.contributor.surname Puccetti -
scopus.contributor.surname Dell'Orletta -
scopus.contributor.surname Esuli -
scopus.date.issued 2025 *
scopus.description.abstracteng Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks. *
scopus.description.allpeopleoriginal Pedrotti A.; Papucci M.; Ciaccio C.; Miaschi A.; Puccetti G.; Dell'Orletta F.; Esuli A. *
scopus.differences scopus.authority.anceserie *
scopus.differences scopus.publisher.name *
scopus.differences scopus.relation.conferencedate *
scopus.differences scopus.description.abstracteng *
scopus.differences scopus.relation.conferencename *
scopus.differences scopus.identifier.isbn *
scopus.differences scopus.relation.conferenceplace *
scopus.document.type cp *
scopus.document.types cp *
scopus.funding.funders 501100021856 - Ministero dell'Università e della Ricerca; 501100021856 - Ministero dell'Università e della Ricerca; 501100000780 - European Commission; 501100000780 - European Commission; 100031478 - NextGenerationEU; 100031478 - NextGenerationEU; *
scopus.funding.ids CUP B53C22001770006; XAI-CARE-PNRR-MAD-2022-12376692; CUP B53D23013050006; CUP B53C22001760006; PE0000013-FAIR; *
scopus.identifier.doi 10.18653/v1/2025.findings-acl.156 *
scopus.identifier.isbn 9798891762565 *
scopus.identifier.pui 650042695 *
scopus.identifier.scopus 2-s2.0-105028618911 *
scopus.journal.sourceid 21101138302 *
scopus.language.iso eng *
scopus.publisher.name Association for Computational Linguistics (ACL) *
scopus.relation.conferencedate 2025 *
scopus.relation.conferencename 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 *
scopus.relation.conferenceplace aut *
scopus.relation.firstpage 3010 *
scopus.relation.lastpage 3031 *
scopus.title Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors *
scopus.titleeng Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors *
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
Pedrotti et al_ACL Findings-2025.pdf

accesso aperto

Descrizione: Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
Tipologia: Versione Editoriale (PDF)
Licenza: Altro tipo di licenza
Dimensione 798.04 kB
Formato Adobe PDF
798.04 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/554367
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact