Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: .

Joint-dataset learning and cross-consistent regularization for text-to-motion retrieval

Messina N.
;
Falchi F.;
2025

Abstract

Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: .
2025
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
3D Human motion, Cross-modal retrieval, Multi-modal understanding, Text-motion retrieval
File in questo prodotto:
File Dimensione Formato  
Messina-Falchi et al_ACM 2025.pdf

accesso aperto

Descrizione: Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval
Tipologia: Documento in Post-print
Licenza: Altro tipo di licenza
Dimensione 1.11 MB
Formato Adobe PDF
1.11 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/554744
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact