CNR Institutional Research Information System

Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.

AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian

Puccetti G.;Rogers A.;Alzetta C.;Dell'Orletta F.;Esuli A.

2024

Abstract

Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.anceserie	PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING	en
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.orgunit	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	en
dc.authority.people	Puccetti G.	en
dc.authority.people	Rogers A.	en
dc.authority.people	Alzetta C.	en
dc.authority.people	Dell'Orletta F.	en
dc.authority.people	Esuli A.	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.appartenenza.mi	973	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2024/12/19 16:13:12	-
dc.date.available	2024/12/19 16:13:12	-
dc.date.firstsubmission	2024/12/18 16:50:48	*
dc.date.issued	2024	-
dc.date.submission	2024/12/18 16:50:48	*
dc.description.abstracteng	Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.	-
dc.description.allpeople	Puccetti, G.; Rogers, A.; Alzetta, C.; Dell'Orletta, F.; Esuli, A.	-
dc.description.allpeopleoriginal	Puccetti G.; Rogers A.; Alzetta C.; Dell'Orletta F.; Esuli A.	en
dc.description.fulltext	open	en
dc.description.numberofauthors	5	-
dc.identifier.doi	10.18653/v1/2024.acl-long.817	en
dc.identifier.isi	WOS:001391776306025	-
dc.identifier.scopus	2-s2.0-85204461442	en
dc.identifier.source	scopus	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/519993	-
dc.identifier.url	https://aclanthology.org/2024.acl-long.817/	en
dc.language.iso	eng	en
dc.publisher.name	Association for Computational Linguistics (ACL)	en
dc.relation.conferencedate	2024	en
dc.relation.conferencename	ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics	en
dc.relation.conferenceplace	tha	en
dc.relation.firstpage	15312	en
dc.relation.ispartofbook	Proceedings of the Annual Meeting of the Association for Computational Linguistics	en
dc.relation.lastpage	15338	en
dc.relation.numberofpages	27	en
dc.relation.volume	1	en
dc.subject.keywordseng	Large Language Models (LLMs)	-
dc.subject.keywordseng	Detecting synthetic texts	-
dc.subject.singlekeyword	Large Language Models (LLMs)	*
dc.subject.singlekeyword	Detecting synthetic texts	*
dc.title	AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
iris.isi.extIssued	2024	-
iris.isi.extTitle	AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian	-
iris.mediafilter.data	2025/03/19 03:35:27	*
iris.orcid.lastModifiedDate	2025/03/16 11:48:07	*
iris.orcid.lastModifiedMillisecond	1742122087860	*
iris.scopus.extIssued	2024	-
iris.scopus.extTitle	AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.bestoaversion	publishedVersion	*
iris.unpaywall.doi	10.18653/v1/2024.acl-long.817	*
iris.unpaywall.isoa	true	*
iris.unpaywall.journalisindoaj	false	*
iris.unpaywall.landingpage	https://doi.org/10.18653/v1/2024.acl-long.817	*
iris.unpaywall.license	cc-by	*
iris.unpaywall.metadataCallLastModified	29/04/2026 05:53:50	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1777434830651	-
iris.unpaywall.oastatus	gold	*
isi.authority.sdg	Goal 3: Good health and well-being###12083	*
isi.category	EV	*
isi.category	EX	*
isi.category	EP	*
isi.contributor.affiliation	Consiglio Nazionale delle Ricerche (CNR)	-
isi.contributor.affiliation	IT University Copenhagen	-
isi.contributor.affiliation	Consiglio Nazionale delle Ricerche (CNR)	-
isi.contributor.affiliation	Consiglio Nazionale delle Ricerche (CNR)	-
isi.contributor.affiliation	Consiglio Nazionale delle Ricerche (CNR)	-
isi.contributor.country	Italy	-
isi.contributor.country	Denmark	-
isi.contributor.country	Italy	-
isi.contributor.country	Italy	-
isi.contributor.country	Italy	-
isi.contributor.name	Giovanni	-
isi.contributor.name	Anna	-
isi.contributor.name	Chiara	-
isi.contributor.name	Felice	-
isi.contributor.name	Andrea	-
isi.contributor.researcherId	MIO-0767-2025	-
isi.contributor.researcherId	KGX-6755-2024	-
isi.contributor.researcherId	KVX-9760-2024	-
isi.contributor.researcherId	AAX-1864-2020	-
isi.contributor.researcherId	B-6343-2015	-
isi.contributor.subaffiliation		-
isi.contributor.subaffiliation		-
isi.contributor.subaffiliation	Ist Linguist Computaz Antonio Zampolli	-
isi.contributor.subaffiliation	Ist Linguist Computaz Antonio Zampolli	-
isi.contributor.subaffiliation		-
isi.contributor.surname	Puccetti	-
isi.contributor.surname	Rogers	-
isi.contributor.surname	Alzetta	-
isi.contributor.surname	Dell'Orletta	-
isi.contributor.surname	Esuli	-
isi.date.issued	2024	*
isi.description.abstracteng	Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.Our results suggest that there are currently no practical methods for detecting synthetic newslike texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.	*
isi.description.allpeopleoriginal	Puccetti, G; Rogers, A; Alzetta, C; Dell'Orletta, F; Esuli, A;	*
isi.document.sourcetype	WOS.ISTP	*
isi.document.type	Proceedings Paper	*
isi.document.types	Proceedings Paper	*
isi.identifier.isi	WOS:001391776306025	*
isi.journal.journaltitle	PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS	*
isi.language.original	English	*
isi.publisher.place	209 N EIGHTH STREET, STROUDSBURG, PA 18360 USA	*
isi.relation.firstpage	15312	*
isi.relation.lastpage	15338	*
isi.title	AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian	*
scopus.authority.anceserie	PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING###0736-587X	*
scopus.category	1203	*
scopus.category	3310	*
scopus.category	1706	*
scopus.contributor.affiliation	Istituto di Scienza e Tecnologia dell'Informazione “A. Faedo”	-
scopus.contributor.affiliation	IT University of Copenhagen	-
scopus.contributor.affiliation	Istituto di Linguistica Computazionale “Antonio Zampolli”	-
scopus.contributor.affiliation	Istituto di Linguistica Computazionale “Antonio Zampolli”	-
scopus.contributor.affiliation	Istituto di Scienza e Tecnologia dell'Informazione “A. Faedo”	-
scopus.contributor.afid	131428502	-
scopus.contributor.afid	60018567	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	60008941	-
scopus.contributor.afid	131428502	-
scopus.contributor.auid	57220748419	-
scopus.contributor.auid	57198517078	-
scopus.contributor.auid	57192938832	-
scopus.contributor.auid	57540567000	-
scopus.contributor.auid	15044356100	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Denmark	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid	114087935	-
scopus.contributor.dptid		-
scopus.contributor.name	Giovanni	-
scopus.contributor.name	Anna	-
scopus.contributor.name	Chiara	-
scopus.contributor.name	Felice	-
scopus.contributor.name	Andrea	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation	ItaliaNLP Lab;	-
scopus.contributor.subaffiliation	ItaliaNLP Lab;	-
scopus.contributor.subaffiliation		-
scopus.contributor.surname	Puccetti	-
scopus.contributor.surname	Rogers	-
scopus.contributor.surname	Alzetta	-
scopus.contributor.surname	Dell'Orletta	-
scopus.contributor.surname	Esuli	-
scopus.date.issued	2024	*
scopus.description.abstracteng	Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.	*
scopus.description.allpeopleoriginal	Puccetti G.; Rogers A.; Alzetta C.; Dell'Orletta F.; Esuli A.	*
scopus.differences	scopus.relation.conferencename	*
scopus.differences	scopus.identifier.isbn	*
scopus.document.type	cp	*
scopus.document.types	cp	*
scopus.identifier.doi	10.18653/v1/2024.acl-long.817	*
scopus.identifier.isbn	9798891760943	*
scopus.identifier.pui	645308969	*
scopus.identifier.scopus	2-s2.0-85204461442	*
scopus.journal.sourceid	21101138302	*
scopus.language.iso	eng	*
scopus.publisher.name	Association for Computational Linguistics (ACL)	*
scopus.relation.conferencedate	2024	*
scopus.relation.conferencename	62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024	*
scopus.relation.conferenceplace	tha	*
scopus.relation.firstpage	15312	*
scopus.relation.lastpage	15338	*
scopus.relation.volume	1	*
scopus.title	AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian	*
scopus.titleeng	AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian	*
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2024.acl-long.817.pdf accesso aperto Descrizione: AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 1.55 MB Formato Adobe PDF Visualizza/Apri	1.55 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/519993

Citazioni

ND

6

1

social impact