In the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and lengthy training times. In contrast, traditional Dual-stream approaches prioritize accuracy above all else, neglecting the memory requirements of GPU processing and training time. This paper presents a novel Dual-stream architecture for VQA, whose parameters have been rigorously tested and evaluated not only for performance, but also for GPU memory consumption and training time. The results show that it’s possible to achieve competitive performance while significantly reducing the computational burden typically associated with complex VQA models.

Designing and evaluating a Dual-Stream Transformer-based architecture for Visual Question Answering

Minutolo, Aniello
Co-primo
;
Esposito, Massimo
Ultimo
2024

Abstract

In the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and lengthy training times. In contrast, traditional Dual-stream approaches prioritize accuracy above all else, neglecting the memory requirements of GPU processing and training time. This paper presents a novel Dual-stream architecture for VQA, whose parameters have been rigorously tested and evaluated not only for performance, but also for GPU memory consumption and training time. The results show that it’s possible to achieve competitive performance while significantly reducing the computational burden typically associated with complex VQA models.
2024
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Visual Question Answering, VQA, Transformer models, Natural Language Processing, Dual-stream architecture, Multimodal question answering, Attention mechanisms
File in questo prodotto:
File Dimensione Formato  
Designing_and_Evaluating_a_Dual-Stream_Transformer-Based_Architecture_for_Visual_Question_Answering.pdf

accesso aperto

Licenza: Creative commons
Dimensione 2.32 MB
Formato Adobe PDF
2.32 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/522102
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 0
social impact