SyntaxNet is the NLP framework released by Google in 2016, claimed by its authors as the most accurate dependency parser over across 40 languages beyond English. It relies on a transition-based model implementing POS tagger and dependency parser modules. SyntaxNet is provided with source code, so it can be trained and configured differently from the pre-trained models already provided. In this work, we present a case study aiming at investigating how to refine Google SyntaxNet NLP framework for the Italian language. In particular, we describe a procedure for tuning the native SyntaxNet model, to address some shortcomings evidenced during preliminary tests. We mainly acted by customizing the original model for Italian POS tagging task by exploiting a particularly interesting dataset for training, and by testing a number of network configurations, different from the original one released by Google. In detail, different sets of features are included, starting from the simplest possible configuration, by employing a forward selection approach. A discussion, comparing our results with the SyntaxNet current state of the art, is provided, thus evidencing how network performances are influenced by different feature types. Finally, some tests are performed by further changing network settings, in order to search how to avoid shortcomings of the original implementation, for a potential deployment in real-time applications.
Tuning SyntaxNet for POS Tagging Italian Sentences
M Pota;M Esposito;R Guarasci;
2017
Abstract
SyntaxNet is the NLP framework released by Google in 2016, claimed by its authors as the most accurate dependency parser over across 40 languages beyond English. It relies on a transition-based model implementing POS tagger and dependency parser modules. SyntaxNet is provided with source code, so it can be trained and configured differently from the pre-trained models already provided. In this work, we present a case study aiming at investigating how to refine Google SyntaxNet NLP framework for the Italian language. In particular, we describe a procedure for tuning the native SyntaxNet model, to address some shortcomings evidenced during preliminary tests. We mainly acted by customizing the original model for Italian POS tagging task by exploiting a particularly interesting dataset for training, and by testing a number of network configurations, different from the original one released by Google. In detail, different sets of features are included, starting from the simplest possible configuration, by employing a forward selection approach. A discussion, comparing our results with the SyntaxNet current state of the art, is provided, thus evidencing how network performances are influenced by different feature types. Finally, some tests are performed by further changing network settings, in order to search how to avoid shortcomings of the original implementation, for a potential deployment in real-time applications.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.