We illustrate an approach for multilingual treebanks explorations by introducing a novel adaptation to small treebanks of a methodology for identifying cross-lingual quantitative trends in the distribution of dependency relations. By relying on the principles of cross-validation, we reduce the amount of data required to execute the method, paving the way to expanding its use to low-resources languages. We validated the approach on 8 small treebanks, each containing less than 100,000 tokens and representing typologically different languages. We also show preliminary but promising evidence on the use of the proposed methodology for treebank expansion.
Atypical or underrepresented? A pilot study on small treebanks
Alzetta C
2021
Abstract
We illustrate an approach for multilingual treebanks explorations by introducing a novel adaptation to small treebanks of a methodology for identifying cross-lingual quantitative trends in the distribution of dependency relations. By relying on the principles of cross-validation, we reduce the amount of data required to execute the method, paving the way to expanding its use to low-resources languages. We validated the approach on 8 small treebanks, each containing less than 100,000 tokens and representing typologically different languages. We also show preliminary but promising evidence on the use of the proposed methodology for treebank expansion.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.