Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.
AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Puccetti G.;Alzetta C.;Dell'Orletta F.;Esuli A.
2024
Abstract
Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.anceserie | PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING | en |
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.orgunit | Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI | en |
| dc.authority.people | Puccetti G. | en |
| dc.authority.people | Rogers A. | en |
| dc.authority.people | Alzetta C. | en |
| dc.authority.people | Dell'Orletta F. | en |
| dc.authority.people | Esuli A. | en |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.appartenenza.mi | 973 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2024/12/19 16:13:12 | - |
| dc.date.available | 2024/12/19 16:13:12 | - |
| dc.date.firstsubmission | 2024/12/18 16:50:48 | * |
| dc.date.issued | 2024 | - |
| dc.date.submission | 2024/12/18 16:50:48 | * |
| dc.description.abstracteng | Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem. | - |
| dc.description.allpeople | Puccetti, G.; Rogers, A.; Alzetta, C.; Dell'Orletta, F.; Esuli, A. | - |
| dc.description.allpeopleoriginal | Puccetti G.; Rogers A.; Alzetta C.; Dell'Orletta F.; Esuli A. | en |
| dc.description.fulltext | open | en |
| dc.description.numberofauthors | 5 | - |
| dc.identifier.doi | 10.18653/v1/2024.acl-long.817 | en |
| dc.identifier.isi | WOS:001391776306025 | - |
| dc.identifier.scopus | 2-s2.0-85204461442 | en |
| dc.identifier.source | scopus | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/519993 | - |
| dc.identifier.url | https://aclanthology.org/2024.acl-long.817/ | en |
| dc.language.iso | eng | en |
| dc.publisher.name | Association for Computational Linguistics (ACL) | en |
| dc.relation.conferencedate | 2024 | en |
| dc.relation.conferencename | ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics | en |
| dc.relation.conferenceplace | tha | en |
| dc.relation.firstpage | 15312 | en |
| dc.relation.ispartofbook | Proceedings of the Annual Meeting of the Association for Computational Linguistics | en |
| dc.relation.lastpage | 15338 | en |
| dc.relation.numberofpages | 27 | en |
| dc.relation.volume | 1 | en |
| dc.subject.keywordseng | Large Language Models (LLMs) | - |
| dc.subject.keywordseng | Detecting synthetic texts | - |
| dc.subject.singlekeyword | Large Language Models (LLMs) | * |
| dc.subject.singlekeyword | Detecting synthetic texts | * |
| dc.title | AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| iris.isi.extIssued | 2024 | - |
| iris.isi.extTitle | AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian | - |
| iris.mediafilter.data | 2025/03/19 03:35:27 | * |
| iris.orcid.lastModifiedDate | 2025/03/16 11:48:07 | * |
| iris.orcid.lastModifiedMillisecond | 1742122087860 | * |
| iris.scopus.extIssued | 2024 | - |
| iris.scopus.extTitle | AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian | - |
| iris.sitodocente.maxattempts | 1 | - |
| iris.unpaywall.bestoaversion | publishedVersion | * |
| iris.unpaywall.doi | 10.18653/v1/2024.acl-long.817 | * |
| iris.unpaywall.isoa | true | * |
| iris.unpaywall.journalisindoaj | false | * |
| iris.unpaywall.landingpage | https://doi.org/10.18653/v1/2024.acl-long.817 | * |
| iris.unpaywall.license | cc-by | * |
| iris.unpaywall.metadataCallLastModified | 29/04/2026 05:53:50 | - |
| iris.unpaywall.metadataCallLastModifiedMillisecond | 1777434830651 | - |
| iris.unpaywall.oastatus | gold | * |
| isi.authority.sdg | Goal 3: Good health and well-being###12083 | * |
| isi.category | EV | * |
| isi.category | EX | * |
| isi.category | EP | * |
| isi.contributor.affiliation | Consiglio Nazionale delle Ricerche (CNR) | - |
| isi.contributor.affiliation | IT University Copenhagen | - |
| isi.contributor.affiliation | Consiglio Nazionale delle Ricerche (CNR) | - |
| isi.contributor.affiliation | Consiglio Nazionale delle Ricerche (CNR) | - |
| isi.contributor.affiliation | Consiglio Nazionale delle Ricerche (CNR) | - |
| isi.contributor.country | Italy | - |
| isi.contributor.country | Denmark | - |
| isi.contributor.country | Italy | - |
| isi.contributor.country | Italy | - |
| isi.contributor.country | Italy | - |
| isi.contributor.name | Giovanni | - |
| isi.contributor.name | Anna | - |
| isi.contributor.name | Chiara | - |
| isi.contributor.name | Felice | - |
| isi.contributor.name | Andrea | - |
| isi.contributor.researcherId | MIO-0767-2025 | - |
| isi.contributor.researcherId | KGX-6755-2024 | - |
| isi.contributor.researcherId | KVX-9760-2024 | - |
| isi.contributor.researcherId | AAX-1864-2020 | - |
| isi.contributor.researcherId | B-6343-2015 | - |
| isi.contributor.subaffiliation | - | |
| isi.contributor.subaffiliation | - | |
| isi.contributor.subaffiliation | Ist Linguist Computaz Antonio Zampolli | - |
| isi.contributor.subaffiliation | Ist Linguist Computaz Antonio Zampolli | - |
| isi.contributor.subaffiliation | - | |
| isi.contributor.surname | Puccetti | - |
| isi.contributor.surname | Rogers | - |
| isi.contributor.surname | Alzetta | - |
| isi.contributor.surname | Dell'Orletta | - |
| isi.contributor.surname | Esuli | - |
| isi.date.issued | 2024 | * |
| isi.description.abstracteng | Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.Our results suggest that there are currently no practical methods for detecting synthetic newslike texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem. | * |
| isi.description.allpeopleoriginal | Puccetti, G; Rogers, A; Alzetta, C; Dell'Orletta, F; Esuli, A; | * |
| isi.document.sourcetype | WOS.ISTP | * |
| isi.document.type | Proceedings Paper | * |
| isi.document.types | Proceedings Paper | * |
| isi.identifier.isi | WOS:001391776306025 | * |
| isi.journal.journaltitle | PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | * |
| isi.language.original | English | * |
| isi.publisher.place | 209 N EIGHTH STREET, STROUDSBURG, PA 18360 USA | * |
| isi.relation.firstpage | 15312 | * |
| isi.relation.lastpage | 15338 | * |
| isi.title | AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian | * |
| scopus.authority.anceserie | PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING###0736-587X | * |
| scopus.category | 1203 | * |
| scopus.category | 3310 | * |
| scopus.category | 1706 | * |
| scopus.contributor.affiliation | Istituto di Scienza e Tecnologia dell'Informazione “A. Faedo” | - |
| scopus.contributor.affiliation | IT University of Copenhagen | - |
| scopus.contributor.affiliation | Istituto di Linguistica Computazionale “Antonio Zampolli” | - |
| scopus.contributor.affiliation | Istituto di Linguistica Computazionale “Antonio Zampolli” | - |
| scopus.contributor.affiliation | Istituto di Scienza e Tecnologia dell'Informazione “A. Faedo” | - |
| scopus.contributor.afid | 131428502 | - |
| scopus.contributor.afid | 60018567 | - |
| scopus.contributor.afid | 60008941 | - |
| scopus.contributor.afid | 60008941 | - |
| scopus.contributor.afid | 131428502 | - |
| scopus.contributor.auid | 57220748419 | - |
| scopus.contributor.auid | 57198517078 | - |
| scopus.contributor.auid | 57192938832 | - |
| scopus.contributor.auid | 57540567000 | - |
| scopus.contributor.auid | 15044356100 | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Denmark | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | 114087935 | - |
| scopus.contributor.dptid | 114087935 | - |
| scopus.contributor.dptid | - | |
| scopus.contributor.name | Giovanni | - |
| scopus.contributor.name | Anna | - |
| scopus.contributor.name | Chiara | - |
| scopus.contributor.name | Felice | - |
| scopus.contributor.name | Andrea | - |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | ItaliaNLP Lab; | - |
| scopus.contributor.subaffiliation | ItaliaNLP Lab; | - |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.surname | Puccetti | - |
| scopus.contributor.surname | Rogers | - |
| scopus.contributor.surname | Alzetta | - |
| scopus.contributor.surname | Dell'Orletta | - |
| scopus.contributor.surname | Esuli | - |
| scopus.date.issued | 2024 | * |
| scopus.description.abstracteng | Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem. | * |
| scopus.description.allpeopleoriginal | Puccetti G.; Rogers A.; Alzetta C.; Dell'Orletta F.; Esuli A. | * |
| scopus.differences | scopus.relation.conferencename | * |
| scopus.differences | scopus.identifier.isbn | * |
| scopus.document.type | cp | * |
| scopus.document.types | cp | * |
| scopus.identifier.doi | 10.18653/v1/2024.acl-long.817 | * |
| scopus.identifier.isbn | 9798891760943 | * |
| scopus.identifier.pui | 645308969 | * |
| scopus.identifier.scopus | 2-s2.0-85204461442 | * |
| scopus.journal.sourceid | 21101138302 | * |
| scopus.language.iso | eng | * |
| scopus.publisher.name | Association for Computational Linguistics (ACL) | * |
| scopus.relation.conferencedate | 2024 | * |
| scopus.relation.conferencename | 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 | * |
| scopus.relation.conferenceplace | tha | * |
| scopus.relation.firstpage | 15312 | * |
| scopus.relation.lastpage | 15338 | * |
| scopus.relation.volume | 1 | * |
| scopus.title | AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian | * |
| scopus.titleeng | AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian | * |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
2024.acl-long.817.pdf
accesso aperto
Descrizione: AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
1.55 MB
Formato
Adobe PDF
|
1.55 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


