Deepfakes can have a serious impact on the spread of fake news and on people's lives in general, becoming every day more dangerous. Moderation of online content and databases is vital to mitigate this phenomenon but the development of systems to distinguish between fake and genuine content comes with its own challenges: (a) The lack of generalization capabilities, due to the fact that most deepfake detection models are trained on a specific type of deepfake and struggle to detect deepfakes generated using different techniques. (b) When applying the deepfake detectors to the real world many peculiarities may occur; for example, the management of videos in which there are multiple people in the same scene or the recognition of the faces' movements towards or backwards the camera. In the analysis work we conducted, we started focusing on the generalization problem, trying to understand whether a particular deep learning architecture was more capable of abstracting the concept of deepfake to such an extent that it could detect images or videos that had been manipulated even with novel techniques. In [2] and [5] we compared Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) of various kinds by putting them in a cross-forgery context revealing the superiority of the ViTs, which are less tied to the specific anomalies they see during training. After that, noting a scarcity of methods based on ViT and even more so those based on hybrid architectures, we developed our first real deepfake detector. In [1] we created a new architecture, combining an EfficientNet-B0 and Cross ViT, which we have named Convolutional Cross Vision Transformer. Thanks to the local-global attention mechanism within it and the exploitation of features extracted from the CNN, the model was able to effectively detect deepfake videos, achieving SOTA results on DFDC[6] and FaceForensics++[9] dataset, all while keeping the number of parameters low. The model was also used to participate in the competition presented in [7]. In [4] we designed a new type of Convolutional TimeSformer that take into account both the spatial position of faces in the frame and their temporal position in the video. It is also capable of managing multiple identities and being robust to face-size movements thanks to the introduction of a novel attention mechanism and positional embedding. Our method surpassed the SOTA on in-dataset tests on [8] and performed robustly in real-world situations. Future work will mainly focus on improving deepfake detectors in order to make them more robust to other real-world problems. We also want to make detectors capable of combining information also of a textual nature, context, and the reputation of the account disseminating it, to understand video veracity. Also, as we started doing in [3], we will work on the more generic problem of synthetic content detection.
Deepfake detection: challenges and solutions
Coccomini DA
2023
Abstract
Deepfakes can have a serious impact on the spread of fake news and on people's lives in general, becoming every day more dangerous. Moderation of online content and databases is vital to mitigate this phenomenon but the development of systems to distinguish between fake and genuine content comes with its own challenges: (a) The lack of generalization capabilities, due to the fact that most deepfake detection models are trained on a specific type of deepfake and struggle to detect deepfakes generated using different techniques. (b) When applying the deepfake detectors to the real world many peculiarities may occur; for example, the management of videos in which there are multiple people in the same scene or the recognition of the faces' movements towards or backwards the camera. In the analysis work we conducted, we started focusing on the generalization problem, trying to understand whether a particular deep learning architecture was more capable of abstracting the concept of deepfake to such an extent that it could detect images or videos that had been manipulated even with novel techniques. In [2] and [5] we compared Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) of various kinds by putting them in a cross-forgery context revealing the superiority of the ViTs, which are less tied to the specific anomalies they see during training. After that, noting a scarcity of methods based on ViT and even more so those based on hybrid architectures, we developed our first real deepfake detector. In [1] we created a new architecture, combining an EfficientNet-B0 and Cross ViT, which we have named Convolutional Cross Vision Transformer. Thanks to the local-global attention mechanism within it and the exploitation of features extracted from the CNN, the model was able to effectively detect deepfake videos, achieving SOTA results on DFDC[6] and FaceForensics++[9] dataset, all while keeping the number of parameters low. The model was also used to participate in the competition presented in [7]. In [4] we designed a new type of Convolutional TimeSformer that take into account both the spatial position of faces in the frame and their temporal position in the video. It is also capable of managing multiple identities and being robust to face-size movements thanks to the introduction of a novel attention mechanism and positional embedding. Our method surpassed the SOTA on in-dataset tests on [8] and performed robustly in real-world situations. Future work will mainly focus on improving deepfake detectors in order to make them more robust to other real-world problems. We also want to make detectors capable of combining information also of a textual nature, context, and the reputation of the account disseminating it, to understand video veracity. Also, as we started doing in [3], we will work on the more generic problem of synthetic content detection.File | Dimensione | Formato | |
---|---|---|---|
prod_489905-doc_204062.pdf
accesso aperto
Descrizione: Deepfake detection: challenges and solutions
Tipologia:
Versione Editoriale (PDF)
Dimensione
720.8 kB
Formato
Adobe PDF
|
720.8 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.