Having a reliable tool to predict the actions performed in a video can be very useful for intelligent security systems, for many applications related to robotics and for limiting human interactions with the system. In this work we present an architecture trained to predict the action present in digital video sequences. The proposed architecture consists of two main blocks: (i) a 3D backbone that extracts features from each frame of video sequence and (ii) a temporal pooling. In this case, we use Bidirectional Encoder Representations from Transformers (BERT) as technology for temporal pooling instead of a Temporal Global Average Pooling (TGAP). The output of the proposed architecture is the prediction of the action taking place in the video sequence. We use two different backbones, ip-CSN and ir-CSN, in order to evaluate the performance of the entire architecture on two publicly available datasets: HMDB-51, UCF-101. A comparison has been made with the most important architectures that constitute the state of the art for this task. We have obtained results that outperform the state of the art in terms of Top-1 and Top-3 accuracy.
Human Action Recognition with Transformers
Pier Luigi Mazzeo
Primo
Membro del Collaboration Group
;Paolo SpagnoloSecondo
Membro del Collaboration Group
;Cosimo DistanteUltimo
Membro del Collaboration Group
2022
Abstract
Having a reliable tool to predict the actions performed in a video can be very useful for intelligent security systems, for many applications related to robotics and for limiting human interactions with the system. In this work we present an architecture trained to predict the action present in digital video sequences. The proposed architecture consists of two main blocks: (i) a 3D backbone that extracts features from each frame of video sequence and (ii) a temporal pooling. In this case, we use Bidirectional Encoder Representations from Transformers (BERT) as technology for temporal pooling instead of a Temporal Global Average Pooling (TGAP). The output of the proposed architecture is the prediction of the action taking place in the video sequence. We use two different backbones, ip-CSN and ir-CSN, in order to evaluate the performance of the entire architecture on two publicly available datasets: HMDB-51, UCF-101. A comparison has been made with the most important architectures that constitute the state of the art for this task. We have obtained results that outperform the state of the art in terms of Top-1 and Top-3 accuracy.File | Dimensione | Formato | |
---|---|---|---|
978-3-031-06433-3_20.pdf
non disponibili
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
2.52 MB
Formato
Adobe PDF
|
2.52 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.