Gestures should be considered an integral part of the language. Non-verbal communication integrates, enriches, and sometimes wholly replaces language. Therefore, in the anthropomorphization process of the human-robot interaction or human-machine interaction, one cannot ignore the integration of the auditory and visual canals. This article presents a model for gesture recognition and its synchronization with speech. We imagine a scenario where the human being interacts with the robotic agent, which knows the organization of the space around it and the disposition of the objects, in natural language, indicating the things he intends to refer to. The model recognizes the stroke-hold typical of the deictic gesture and identifies the word or words corresponding to the gesture in the user's sentence. The purpose of the model is to replace any adjective and demonstrative pronoun or indical expressions with spatial information helpful in recognizing the intent of the sentence. We have built a development and simulation framework based on a web interface to test the system. The first results were very encouraging. The model has been shown to work well in real-time, with reasonable success rates on the assigned task.
A hybrid model for gesture recognition and speech synchronization
Maniscalco;Umberto;Messina;Antonio;Storniolo;Pietro
2023
Abstract
Gestures should be considered an integral part of the language. Non-verbal communication integrates, enriches, and sometimes wholly replaces language. Therefore, in the anthropomorphization process of the human-robot interaction or human-machine interaction, one cannot ignore the integration of the auditory and visual canals. This article presents a model for gesture recognition and its synchronization with speech. We imagine a scenario where the human being interacts with the robotic agent, which knows the organization of the space around it and the disposition of the objects, in natural language, indicating the things he intends to refer to. The model recognizes the stroke-hold typical of the deictic gesture and identifies the word or words corresponding to the gesture in the user's sentence. The purpose of the model is to replace any adjective and demonstrative pronoun or indical expressions with spatial information helpful in recognizing the intent of the sentence. We have built a development and simulation framework based on a web interface to test the system. The first results were very encouraging. The model has been shown to work well in real-time, with reasonable success rates on the assigned task.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.