Having a reliable tool to predict the actions performed in a video can be very useful for intelligent security systems, for many applications related to robotics and for limiting human interactions with the system. In this work we present an architecture trained to predict the action present in digital video sequences. The proposed architecture consists of two main blocks: (i) a 3D backbone that extracts features from each frame of video sequence and (ii) a temporal pooling. In this case, we use Bidirectional Encoder Representations from Transformers (BERT) as technology for temporal pooling instead of a Temporal Global Average Pooling (TGAP). The output of the proposed architecture is the prediction of the action taking place in the video sequence. We use two different backbones, ip-CSN and ir-CSN, in order to evaluate the performance of the entire architecture on two publicly available datasets: HMDB-51, UCF-101. A comparison has been made with the most important architectures that constitute the state of the art for this task. We have obtained results that outperform the state of the art in terms of Top-1 and Top-3 accuracy.

Human Action Recognition with Transformers

Pier Luigi Mazzeo
Primo
Membro del Collaboration Group
;
Paolo Spagnolo
Secondo
Membro del Collaboration Group
;
Cosimo Distante
Ultimo
Membro del Collaboration Group
2022

Abstract

Having a reliable tool to predict the actions performed in a video can be very useful for intelligent security systems, for many applications related to robotics and for limiting human interactions with the system. In this work we present an architecture trained to predict the action present in digital video sequences. The proposed architecture consists of two main blocks: (i) a 3D backbone that extracts features from each frame of video sequence and (ii) a temporal pooling. In this case, we use Bidirectional Encoder Representations from Transformers (BERT) as technology for temporal pooling instead of a Temporal Global Average Pooling (TGAP). The output of the proposed architecture is the prediction of the action taking place in the video sequence. We use two different backbones, ip-CSN and ir-CSN, in order to evaluate the performance of the entire architecture on two publicly available datasets: HMDB-51, UCF-101. A comparison has been made with the most important architectures that constitute the state of the art for this task. We have obtained results that outperform the state of the art in terms of Top-1 and Top-3 accuracy.
2022
Istituto di Scienze Applicate e Sistemi Intelligenti "Eduardo Caianiello" - ISASI - Sede Secondaria Lecce
Human Action Recognition, BERT, 3D-CNN
File in questo prodotto:
File Dimensione Formato  
978-3-031-06433-3_20.pdf

non disponibili

Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 2.52 MB
Formato Adobe PDF
2.52 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/539921
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 1
social impact