This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at the sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings.

Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks

D Brunato;G Venturi
2022

Abstract

This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at the sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings.
Campo DC Valore Lingua
dc.authority.ancejournal LINGUISTICS VANGUARD en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people D Brunato en
dc.authority.people G Venturi en
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2024/02/20 15:01:30 -
dc.date.available 2024/02/20 15:01:30 -
dc.date.firstsubmission 2025/01/24 12:43:53 *
dc.date.issued 2022 -
dc.date.submission 2025/01/29 10:10:41 *
dc.description.abstracteng This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at the sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings. -
dc.description.affiliations Istituto di Linguistica Computazionale "A. Zampolli" -
dc.description.allpeople Brunato, D; Venturi, G -
dc.description.allpeopleoriginal D. Brunato; G. Venturi en
dc.description.fulltext open en
dc.description.numberofauthors 2 -
dc.identifier.doi 10.1515/lingvan-2021-0017 en
dc.identifier.isi WOS:000870822600001 -
dc.identifier.scopus 2-s2.0-85141200922 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/420475 -
dc.identifier.url https://www.degruyter.com/document/doi/10.1515/lingvan-2021-0017/html en
dc.language.iso eng en
dc.miur.last.status.update 2024-07-08T15:59:26Z *
dc.relation.firstpage 59 en
dc.relation.lastpage 72 en
dc.relation.medium ELETTRONICO en
dc.relation.numberofpages 13 en
dc.subject.keywordseng Linguistic Complexity -
dc.subject.keywordseng Linguistic Profiling -
dc.subject.keywordseng Universal Dependencies -
dc.subject.singlekeyword Linguistic Complexity *
dc.subject.singlekeyword Linguistic Profiling *
dc.subject.singlekeyword Universal Dependencies *
dc.title Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks en
dc.type.circulation Internazionale en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.impactfactor si en
dc.type.miur 262 -
dc.type.referee Esperti anonimi en
dc.ugov.descaux1 472409 -
iris.isi.extIssued 2023 -
iris.isi.extTitle Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks -
iris.mediafilter.data 2025/04/08 04:25:31 *
iris.orcid.lastModifiedDate 2025/07/20 01:50:16 *
iris.orcid.lastModifiedMillisecond 1752969016963 *
iris.scopus.extIssued 2023 -
iris.scopus.extTitle Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.doi 10.1515/lingvan-2021-0017 *
iris.unpaywall.isoa false *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.metadataCallLastModified 22/07/2025 04:25:51 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1753151151277 -
iris.unpaywall.oastatus closed *
isi.authority.ancejournal LINGUISTICS VANGUARD###2199-174X *
isi.category OT *
isi.category OY *
isi.contributor.affiliation Inst Computat Linguist A Zampolli ILC CNR -
isi.contributor.affiliation Inst Computat Linguist A Zampolli ILC CNR -
isi.contributor.country Italy -
isi.contributor.country Italy -
isi.contributor.name Dominique -
isi.contributor.name Giulia -
isi.contributor.researcherId MCK-5206-2025 -
isi.contributor.researcherId AAY-3932-2020 -
isi.contributor.subaffiliation ItaliaNLP Lab -
isi.contributor.subaffiliation ItaliaNLP Lab -
isi.contributor.surname Brunato -
isi.contributor.surname Venturi -
isi.date.issued 2023 *
isi.description.abstracteng This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings. *
isi.description.allpeopleoriginal Brunato, D; Venturi, G; *
isi.document.sourcetype WOS.SSCI *
isi.document.type Article *
isi.document.types Article *
isi.identifier.doi 10.1515/lingvan-2021-0017 *
isi.identifier.isi WOS:000870822600001 *
isi.journal.journaltitle LINGUISTICS VANGUARD *
isi.journal.journaltitleabbrev LINGUIST VANGUARD *
isi.language.original English *
isi.publisher.place GENTHINER STRASSE 13, D-10785 BERLIN, GERMANY *
isi.relation.firstpage 59 *
isi.relation.lastpage 72 *
isi.relation.volume 9 *
isi.title Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks *
scopus.authority.ancejournal LINGUISTICS VANGUARD###2199-174X *
scopus.category 1203 *
scopus.category 3310 *
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.auid 55237740200 -
scopus.contributor.auid 27568199800 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid 121833164 -
scopus.contributor.dptid 121833164 -
scopus.contributor.name Dominique -
scopus.contributor.name Giulia -
scopus.contributor.subaffiliation Institute for Computational Linguistics A. Zampolli (ILC-CNR); -
scopus.contributor.subaffiliation Institute for Computational Linguistics A. Zampolli (ILC-CNR); -
scopus.contributor.surname Brunato -
scopus.contributor.surname Venturi -
scopus.date.issued 2023 *
scopus.description.abstracteng This paper investigates linguistic complexity across natural languages from a corpus-based perspective and relies on the assumptions of linguistic profiling as a methodological framework. We focus in particular on the domain of syntactic complexity and analyze the distribution of a set of features taken as proxies of complexity phenomena at sentence level, which were extracted from 63 treebanks annotated according to the Universal Dependencies formalism. This dataset guarantees that the features considered are modeling the same linguistic phenomena in different treebanks, allowing reliable comparison among languages. We show that our approach is able to identify tendencies of structural proximity between languages not necessarily in line with typologically-supported classification, thus shedding light on new corpus-based findings. *
scopus.description.allpeopleoriginal Brunato D.; Venturi G. *
scopus.differences scopus.subject.keywords *
scopus.differences scopus.date.issued *
scopus.differences scopus.description.allpeopleoriginal *
scopus.differences scopus.description.abstracteng *
scopus.differences scopus.relation.issue *
scopus.differences scopus.relation.volume *
scopus.document.type ar *
scopus.document.types ar *
scopus.identifier.doi 10.1515/lingvan-2021-0017 *
scopus.identifier.eissn 2199-174X *
scopus.identifier.pui 2020999860 *
scopus.identifier.scopus 2-s2.0-85141200922 *
scopus.journal.sourceid 21100860908 *
scopus.language.iso eng *
scopus.publisher.name Walter de Gruyter GmbH *
scopus.relation.firstpage 59 *
scopus.relation.issue 1 s *
scopus.relation.lastpage 72 *
scopus.relation.volume 9 *
scopus.subject.keywords linguistic complexity; linguistic profiling; syntactic domain; universal dependencies; *
scopus.title Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks *
scopus.titleeng Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks *
Appare nelle tipologie: 01.01 Articolo in rivista
File in questo prodotto:
File Dimensione Formato  
prod_472409-doc_192275.pdf

accesso aperto

Descrizione: Why is this language complex? Cherry-pick the optimal set of features in multilingual treebanks
Tipologia: Documento in Post-print
Licenza: Creative commons
Dimensione 2.74 MB
Formato Adobe PDF
2.74 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/420475
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 1
social impact