Combining EfficientNet and vision transformers for video deepfake detection

Coccomini, Da; Messina, N; Gennaro, C; Falchi, F

doi:10.1007/978-3-031-06433-3_19

Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC). The code for reproducing our results is publicly available here: https://tinyurl.com/cnn-vit-dfd.

Combining EfficientNet and vision transformers for video deepfake detection

Coccomini DA;Messina N;Gennaro C;Falchi F

2022

Abstract

Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC). The code for reproducing our results is publicly available here: https://tinyurl.com/cnn-vit-dfd.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Codice ISBN
	
				978-3-031-06433-3
			
	Parole chiave
	
				Deepfake detection
Deep Learning
Vision transformers
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
prod_468613-doc_192834.pdf solo utenti autorizzati Descrizione: Combining EfficientNet and vision transformers for video deepfake detection Tipologia: Versione Editoriale (PDF) Dimensione 623.11 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	623.11 kB	Adobe PDF	Visualizza/Apri Richiedi una copia
prod_468613-doc_192851.pdf accesso aperto Descrizione: Postprint - Combining EfficientNet and vision transformers for video deepfake detection Tipologia: Versione Editoriale (PDF) Dimensione 1.91 MB Formato Adobe PDF Visualizza/Apri	1.91 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/414759

Citazioni

ND

198

119

CNR Institutional Research Information System

Combining EfficientNet and vision transformers for video deepfake detection

Coccomini DA;Messina N;Gennaro C;Falchi F

2022

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

CNR Institutional Research Information System

Combining EfficientNet and vision transformers for video deepfake detection

Coccomini DA;Messina N;Gennaro C;Falchi F

2022

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)