CNR Institutional Research Information System

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Text-to-motion retrieval: towards joint understanding of human motion data and natural language

Messina N;Sedmidubsk'y J;Falchi F;Rebok T

2023

Abstract

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Lingua/e
	
				Inglese
			
	Supervisori e coordinatori esterni
	
				Chen H.H., Duh E., Huang H.H., Kato Makoto P., Mothe J., Poblete B.
			
	Titolo del Volume
	
				SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
			
	Relazione
	
				Contributo
			
	Titolo del convegno
	
				SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
			
	Da pagina
	
				2420
			
	A pagina
	
				2425
			
	Numero di pagine
	
				6
			
	Codice ISBN
	
				978-1-4503-9408-6
			
	Codice DOI
	
				https://dx.doi.org/10.1145/3539618.3592069
			
	URL
	
				https://doi.org/10.1145/3539618.3592069
			
	Nome Editore
	
				ACM - Association for Computing Machinery
			
	Città Editore
	
				New York
			
	Nazione Editore
	
				STATI UNITI D'AMERICA
			
	Referee
	
				Sì, ma tipo non specificato
			
	Periodo del Convegno
	
				23-27/07/2023
			
	Luogo del Convegno
	
				Taipei, Taiwan
			
	Parole chiave
	
				CLIP
BERT
ViViT
			
	Parole chiave
	
				Human motion data
Skeleton sequences
Deep language models
Motion retrieval
Cross-modal retrieval
			
	Codice Scopus
	
				2-s2.0-85168662009
			
	Codice Web of Science
	
				WOS:001118084002091
			
	Formato
	
				Elettronico
			
	Presenza di coautori internazionali
	
				Sì
			
	Numero autori
	
				4
			
	Fulltext
	
				open
			
	Tutti gli autori
	
						Messina N.; Sedmidubsk'y J.; Falchi F.; Rebok T.
					
	Tipologia Login Miur
	
				273
			
	Tipologia
	
				info:eu-repo/semantics/conferenceObject
			
	Tipologia
	
				04 Contributo in convegno::04.01 Contributo in Atti di convegno
			
	Identificativo progetto
	
	Titolo Progetto
	
									A European Excellence Centre for Media, Society and Democracy
								
	Acronimo
	
									AI4Media
								
	Finanziamento
	
									H2020
								
	N. Contratto
	
									951911
								
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
prod_486043-doc_201551.pdf accesso aperto Descrizione: Text-to-motion retrieval: towards joint understanding of human motion data and natural language Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 962.49 kB Formato Adobe PDF Visualizza/Apri	962.49 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/463453

Citazioni

ND

18

16

social impact