Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Text-to-motion retrieval: towards joint understanding of human motion data and natural language

Messina N;Falchi F;
2023

Abstract

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.
2023
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Inglese
Chen H.H., Duh E., Huang H.H., Kato Makoto P., Mothe J., Poblete B.
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
Contributo
SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
2420
2425
6
978-1-4503-9408-6
https://doi.org/10.1145/3539618.3592069
ACM - Association for Computing Machinery
New York
STATI UNITI D'AMERICA
Sì, ma tipo non specificato
23-27/07/2023
Taipei, Taiwan
CLIP
BERT
ViViT
Human motion data
Skeleton sequences
Deep language models
Motion retrieval
Cross-modal retrieval
Elettronico
4
open
Messina N.; Sedmidubsk'y J.; Falchi F.; Rebok T.
273
info:eu-repo/semantics/conferenceObject
04 Contributo in convegno::04.01 Contributo in Atti di convegno
   A European Excellence Centre for Media, Society and Democracy
   AI4Media
   H2020
   951911
File in questo prodotto:
File Dimensione Formato  
prod_486043-doc_201551.pdf

accesso aperto

Descrizione: Text-to-motion retrieval: towards joint understanding of human motion data and natural language
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 962.49 kB
Formato Adobe PDF
962.49 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/463453
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 18
  • ???jsp.display-item.citation.isi??? 16
social impact