CNR Institutional Research Information System

Open-vocabulary 3D scene understanding, i.e., recognizing and classifying objects in 3D scenes without being limited to a predefined set of classes, is a foundational task for robotics and extended reality applications. Current leading methods often rely on 2D foundation models to extract semantics, then projected in 3D. This paper investigates the viability of a purely 3D-native pipeline, thereby eliminating dependencies on 2D models and reprojections. We systematically explored various architectural combinations using established 3D components. However, our extensive experiments on benchmark datasets reveal significant performance limitations with this direct 3D-native approach, with performance metrics falling short of expectations. Rather than a simple failure, these outcomes provide critical insights into the current deficiencies of existing 3D models when cascaded for complex open-vocabulary tasks. We highlight the lessons learned, identify the pipeline's limitations (e.g., segmenter-encoder domain gap, robustness to imperfect segmentations), and posit future research directions. We argue that a fundamental rethinking of model design and interplay is necessary to realize the potential of truly 3D-native open-vocabulary understanding.

Breaking the 2D dependency: what limits 3D-only open-vocabulary scene understanding

D’Orsi D.;Carrara F.;Falchi F.;Tonellotto N.

2025

Abstract

Open-vocabulary 3D scene understanding, i.e., recognizing and classifying objects in 3D scenes without being limited to a predefined set of classes, is a foundational task for robotics and extended reality applications. Current leading methods often rely on 2D foundation models to extract semantics, then projected in 3D. This paper investigates the viability of a purely 3D-native pipeline, thereby eliminating dependencies on 2D models and reprojections. We systematically explored various architectural combinations using established 3D components. However, our extensive experiments on benchmark datasets reveal significant performance limitations with this direct 3D-native approach, with performance metrics falling short of expectations. Rather than a simple failure, these outcomes provide critical insights into the current deficiencies of existing 3D models when cascaded for complex open-vocabulary tasks. We highlight the lessons learned, identify the pipeline's limitations (e.g., segmenter-encoder domain gap, robustness to imperfect segmentations), and posit future research directions. We argue that a fundamental rethinking of model design and interplay is necessary to realize the potential of truly 3D-native open-vocabulary understanding.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Codice ISBN
	
				979-8-3315-5500-9
			
	Parole chiave
	
				Open-vocabulary 3D scene understanding; 3D scene segmentation, multimodal point cloud encoder, 3D-only pipeline
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2025_CBMI___3D_Only_OVSU.pdf accesso aperto Descrizione: Breaking the 2D Dependency: What Limits 3D-Only Open-Vocabulary Scene Understanding Tipologia: Documento in Post-print Licenza: Altro tipo di licenza Dimensione 1.05 MB Formato Adobe PDF Visualizza/Apri	1.05 MB	Adobe PDF	Visualizza/Apri
Breaking_the_2D_Dependency_What_Limits_3D-Only_Open-Vocabulary_Scene_Understanding.pdf solo utenti autorizzati Descrizione: Breaking the 2D Dependency: What Limits 3D-Only Open-Vocabulary Scene Understanding Tipologia: Versione Editoriale (PDF) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.11 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.11 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/562945

Citazioni

ND

0

ND

social impact