In this paper we investigate the efects on authorship identiication tasks (including authorship veriication, closed-set authorship attribution, and closed-set and open-set same-author veriication) of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ?classic? authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute diference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that in some cases (e.g., authorship veriication) it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the irst time) show that feature vectors representing pairs of documents (that we here call Dif-Vectors) bring about systematic improvements in the efectiveness of authorship identiication tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identiication scenarios). Our experiments tackle same-author veriication, authorship veriication, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block. The code to reproduce our experiments is open-source and available online.

Same or different? Diff-vectors for authorship analysis

Moreo Fernandez A. D.;Sebastiani F.
2023

Abstract

In this paper we investigate the efects on authorship identiication tasks (including authorship veriication, closed-set authorship attribution, and closed-set and open-set same-author veriication) of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ?classic? authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute diference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that in some cases (e.g., authorship veriication) it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the irst time) show that feature vectors representing pairs of documents (that we here call Dif-Vectors) bring about systematic improvements in the efectiveness of authorship identiication tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identiication scenarios). Our experiments tackle same-author veriication, authorship veriication, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block. The code to reproduce our experiments is open-source and available online.
2023
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Authorship analysis
File in questo prodotto:
File Dimensione Formato  
prod_485905-doc_201441.pdf

accesso aperto

Descrizione: This is the Author Accepted Manuscript (postprint)  version of the following paper: Corbara S., Moreo A., Sebastiani F., “Same or Different? Diff-Vectors for Authorship Analysis”, 2023, peer-reviewed and accepted for publication in “ ACM Transactions on Knowledge Discovery from Data”, vol. 18 n.1. DOI: 10.1145/3609226
Tipologia: Documento in Post-print
Licenza: Nessuna licenza dichiarata (non attribuibile a prodotti successivi al 2023)
Dimensione 2.21 MB
Formato Adobe PDF
2.21 MB Adobe PDF Visualizza/Apri
3609226.pdf

accesso aperto

Descrizione: Same or Different? Diff-Vectors for Authorship Analysis
Tipologia: Versione Editoriale (PDF)
Licenza: Altro tipo di licenza
Dimensione 3.46 MB
Formato Adobe PDF
3.46 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/461145
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact