CNR Institutional Research Information System

This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at the sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings.

Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks

D Brunato;G Venturi

2022

Abstract

This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at the sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.ancejournal	LINGUISTICS VANGUARD	en
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	D Brunato	en
dc.authority.people	G Venturi	en
dc.collection.id.s	b3f88f24-048a-4e43-8ab1-6697b90e068e	*
dc.collection.name	01.01 Articolo in rivista	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2024/02/20 15:01:30	-
dc.date.available	2024/02/20 15:01:30	-
dc.date.firstsubmission	2025/01/24 12:43:53	*
dc.date.issued	2022	-
dc.date.submission	2025/01/29 10:10:41	*
dc.description.abstracteng	This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at the sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings.	-
dc.description.affiliations	Istituto di Linguistica Computazionale "A. Zampolli"	-
dc.description.allpeople	Brunato, D; Venturi, G	-
dc.description.allpeopleoriginal	D. Brunato; G. Venturi	en
dc.description.fulltext	open	en
dc.description.numberofauthors	2	-
dc.identifier.doi	10.1515/lingvan-2021-0017	en
dc.identifier.isi	WOS:000870822600001	-
dc.identifier.scopus	2-s2.0-85141200922	-
dc.identifier.uri	https://hdl.handle.net/20.500.14243/420475	-
dc.identifier.url	https://www.degruyter.com/document/doi/10.1515/lingvan-2021-0017/html	en
dc.language.iso	eng	en
dc.miur.last.status.update	2024-07-08T15:59:26Z	*
dc.relation.firstpage	59	en
dc.relation.lastpage	72	en
dc.relation.medium	ELETTRONICO	en
dc.relation.numberofpages	13	en
dc.subject.keywordseng	Linguistic Complexity	-
dc.subject.keywordseng	Linguistic Profiling	-
dc.subject.keywordseng	Universal Dependencies	-
dc.subject.singlekeyword	Linguistic Complexity	*
dc.subject.singlekeyword	Linguistic Profiling	*
dc.subject.singlekeyword	Universal Dependencies	*
dc.title	Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks	en
dc.type.circulation	Internazionale	en
dc.type.driver	info:eu-repo/semantics/article	-
dc.type.full	01 Contributo su Rivista::01.01 Articolo in rivista	it
dc.type.impactfactor	si	en
dc.type.miur	262	-
dc.type.referee	Esperti anonimi	en
dc.ugov.descaux1	472409	-
iris.isi.extIssued	2023	-
iris.isi.extTitle	Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks	-
iris.mediafilter.data	2025/04/08 04:25:31	*
iris.orcid.lastModifiedDate	2025/07/20 01:50:16	*
iris.orcid.lastModifiedMillisecond	1752969016963	*
iris.scopus.extIssued	2023	-
iris.scopus.extTitle	Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.doi	10.1515/lingvan-2021-0017	*
iris.unpaywall.isoa	false	*
iris.unpaywall.journalisindoaj	false	*
iris.unpaywall.metadataCallLastModified	22/07/2025 04:25:51	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1753151151277	-
iris.unpaywall.oastatus	closed	*
isi.authority.ancejournal	LINGUISTICS VANGUARD###2199-174X	*
isi.category	OT	*
isi.category	OY	*
isi.contributor.affiliation	Inst Computat Linguist A Zampolli ILC CNR	-
isi.contributor.affiliation	Inst Computat Linguist A Zampolli ILC CNR	-
isi.contributor.country	Italy	-
isi.contributor.country	Italy	-
isi.contributor.name	Dominique	-
isi.contributor.name	Giulia	-
isi.contributor.researcherId	MCK-5206-2025	-
isi.contributor.researcherId	AAY-3932-2020	-
isi.contributor.subaffiliation	ItaliaNLP Lab	-
isi.contributor.subaffiliation	ItaliaNLP Lab	-
isi.contributor.surname	Brunato	-
isi.contributor.surname	Venturi	-
isi.date.issued	2023	*
isi.description.abstracteng	This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings.	*
isi.description.allpeopleoriginal	Brunato, D; Venturi, G;	*
isi.document.sourcetype	WOS.SSCI	*
isi.document.type	Article	*
isi.document.types	Article	*
isi.identifier.doi	10.1515/lingvan-2021-0017	*
isi.identifier.isi	WOS:000870822600001	*
isi.journal.journaltitle	LINGUISTICS VANGUARD	*
isi.journal.journaltitleabbrev	LINGUIST VANGUARD	*
isi.language.original	English	*
isi.publisher.place	GENTHINER STRASSE 13, D-10785 BERLIN, GERMANY	*
isi.relation.firstpage	59	*
isi.relation.lastpage	72	*
isi.relation.volume	9	*
isi.title	Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks	*
scopus.authority.ancejournal	LINGUISTICS VANGUARD###2199-174X	*
scopus.category	1203	*
scopus.category	3310	*
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.affiliation	ItaliaNLP Lab	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60021199	-
scopus.contributor.auid	55237740200	-
scopus.contributor.auid	27568199800	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid	121833164	-
scopus.contributor.dptid	121833164	-
scopus.contributor.name	Dominique	-
scopus.contributor.name	Giulia	-
scopus.contributor.subaffiliation	Institute for Computational Linguistics A. Zampolli (ILC-CNR);	-
scopus.contributor.subaffiliation	Institute for Computational Linguistics A. Zampolli (ILC-CNR);	-
scopus.contributor.surname	Brunato	-
scopus.contributor.surname	Venturi	-
scopus.date.issued	2023	*
scopus.description.abstracteng	This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings.	*
scopus.description.allpeopleoriginal	Brunato D.; Venturi G.	*
scopus.differences	scopus.subject.keywords	*
scopus.differences	scopus.date.issued	*
scopus.differences	scopus.description.allpeopleoriginal	*
scopus.differences	scopus.description.abstracteng	*
scopus.differences	scopus.relation.issue	*
scopus.differences	scopus.relation.volume	*
scopus.document.type	ar	*
scopus.document.types	ar	*
scopus.identifier.doi	10.1515/lingvan-2021-0017	*
scopus.identifier.eissn	2199-174X	*
scopus.identifier.pui	2020999860	*
scopus.identifier.scopus	2-s2.0-85141200922	*
scopus.journal.sourceid	21100860908	*
scopus.language.iso	eng	*
scopus.publisher.name	Walter de Gruyter GmbH	*
scopus.relation.firstpage	59	*
scopus.relation.issue	1 s	*
scopus.relation.lastpage	72	*
scopus.relation.volume	9	*
scopus.subject.keywords	linguistic complexity; linguistic profiling; syntactic domain; universal dependencies;	*
scopus.title	Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks	*
scopus.titleeng	Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks	*
Appare nelle tipologie:	01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_472409-doc_192275.pdf accesso aperto Descrizione: Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks Tipologia: Documento in Post-print Licenza: Creative commons Dimensione 2.74 MB Formato Adobe PDF Visualizza/Apri	2.74 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/420475

Citazioni

ND

3

2

social impact