Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that these methods improve the precision and recall of extracted features and enhance performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction.

Leveraging encoder-only large language models for mobile app review feature extraction

Miaschi A.;Dell'Orletta F.;
2025

Abstract

Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that these methods improve the precision and recall of extracted features and enhance performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction.
Campo DC Valore Lingua
dc.authority.ancejournal EMPIRICAL SOFTWARE ENGINEERING en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Motger Q. en
dc.authority.people Miaschi A. en
dc.authority.people Dell'Orletta F. en
dc.authority.people Franch X. en
dc.authority.people Marco J. en
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2026/03/03 15:02:32 -
dc.date.available 2026/03/03 15:02:32 -
dc.date.firstsubmission 2026/03/02 19:07:53 *
dc.date.issued 2025 -
dc.date.submission 2026/03/02 19:07:53 *
dc.description.abstracteng Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that these methods improve the precision and recall of extracted features and enhance performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction. -
dc.description.allpeople Motger, Q.; Miaschi, A.; Dell'Orletta, F.; Franch, X.; Marco, J. -
dc.description.allpeopleoriginal Motger Q.; Miaschi A.; Dell'Orletta F.; Franch X.; Marco J. en
dc.description.fulltext restricted en
dc.description.numberofauthors 5 -
dc.identifier.doi 10.1007/s10664-025-10660-y en
dc.identifier.isi WOS:001471816400001 -
dc.identifier.scopus 2-s2.0-105003228374 en
dc.identifier.source scopus *
dc.identifier.uri https://hdl.handle.net/20.500.14243/570522 -
dc.language.iso eng en
dc.relation.issue 3 en
dc.relation.volume 30 en
dc.subject.keywords Extended pre-training -
dc.subject.keywords Feature extraction -
dc.subject.keywords Instance selection -
dc.subject.keywords Large language models -
dc.subject.keywords Mobile app reviews -
dc.subject.keywords Named-entity recognition -
dc.subject.singlekeyword Extended pre-training *
dc.subject.singlekeyword Feature extraction *
dc.subject.singlekeyword Instance selection *
dc.subject.singlekeyword Large language models *
dc.subject.singlekeyword Mobile app reviews *
dc.subject.singlekeyword Named-entity recognition *
dc.title Leveraging encoder-only large language models for mobile app review feature extraction en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.miur 262 -
iris.isi.extIssued 2025 -
iris.isi.extTitle Leveraging encoder-only large language models for mobile app review feature extraction -
iris.mediafilter.data 2026/03/04 02:52:31 *
iris.orcid.lastModifiedDate 2026/03/04 01:09:50 *
iris.orcid.lastModifiedMillisecond 1772582990394 *
iris.scopus.extIssued 2025 -
iris.scopus.extTitle Leveraging encoder-only large language models for mobile app review feature extraction -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.bestoahost repository *
iris.unpaywall.bestoaversion submittedVersion *
iris.unpaywall.doi 10.1007/s10664-025-10660-y *
iris.unpaywall.hosttype repository *
iris.unpaywall.isoa true *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.landingpage https://hdl.handle.net/2117/432930 *
iris.unpaywall.license cc-by-nc-nd *
iris.unpaywall.metadataCallLastModified 04/03/2026 04:34:28 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1772595268422 -
iris.unpaywall.oastatus green *
isi.authority.ancejournal EMPIRICAL SOFTWARE ENGINEERING###1382-3256 *
isi.category EW *
isi.contributor.affiliation Universitat Politecnica de Catalunya -
isi.contributor.affiliation Inst Computat Linguist A Zampolli ILC CNR -
isi.contributor.affiliation Inst Computat Linguist A Zampolli ILC CNR -
isi.contributor.affiliation Universitat Politecnica de Catalunya -
isi.contributor.affiliation Universitat Politecnica de Catalunya -
isi.contributor.country Spain -
isi.contributor.country Italy -
isi.contributor.country Italy -
isi.contributor.country Spain -
isi.contributor.country Spain -
isi.contributor.name Quim -
isi.contributor.name Alessio -
isi.contributor.name Felice -
isi.contributor.name Xavier -
isi.contributor.name Jordi -
isi.contributor.researcherId CCM-5349-2022 -
isi.contributor.researcherId GCD-5321-2022 -
isi.contributor.researcherId NVY-1615-2025 -
isi.contributor.researcherId KAM-2369-2024 -
isi.contributor.researcherId C-7258-2015 -
isi.contributor.subaffiliation Dept Serv & Informat Syst Engn -
isi.contributor.subaffiliation ItaliaNLP Lab -
isi.contributor.subaffiliation ItaliaNLP Lab -
isi.contributor.subaffiliation Dept Serv & Informat Syst Engn -
isi.contributor.subaffiliation Dept Comp Sci -
isi.contributor.surname Motger -
isi.contributor.surname Miaschi -
isi.contributor.surname Dell'Orletta -
isi.contributor.surname Franch -
isi.contributor.surname Marco -
isi.date.issued 2025 *
isi.description.abstracteng Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that these methods improve the precision and recall of extracted features and enhance performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction. *
isi.description.allpeopleoriginal Motger, Q; Miaschi, A; Dell'Orletta, F; Franch, X; Marco, J; *
isi.document.sourcetype WOS.SCI *
isi.document.type Article *
isi.document.types Article *
isi.identifier.doi 10.1007/s10664-025-10660-y *
isi.identifier.eissn 1573-7616 *
isi.identifier.isi WOS:001471816400001 *
isi.journal.journaltitle EMPIRICAL SOFTWARE ENGINEERING *
isi.journal.journaltitleabbrev EMPIR SOFTW ENG *
isi.language.original English *
isi.publisher.place VAN GODEWIJCKSTRAAT 30, 3311 GZ DORDRECHT, NETHERLANDS *
isi.relation.issue 3 *
isi.relation.volume 30 *
isi.title Leveraging encoder-only large language models for mobile app review feature extraction *
scopus.authority.ancejournal EMPIRICAL SOFTWARE ENGINEERING###1382-3256 *
scopus.category 1712 *
scopus.contributor.affiliation Universitat Politècnica de Catalunya -
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.affiliation ItaliaNLP Lab -
scopus.contributor.affiliation Universitat Politècnica de Catalunya -
scopus.contributor.affiliation Universitat Politècnica de Catalunya -
scopus.contributor.afid 60007592 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60007592 -
scopus.contributor.afid 60007592 -
scopus.contributor.auid 57209540522 -
scopus.contributor.auid 57211678681 -
scopus.contributor.auid 57540567000 -
scopus.contributor.auid 6603081752 -
scopus.contributor.auid 8332219900 -
scopus.contributor.country Spain -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Spain -
scopus.contributor.country Spain -
scopus.contributor.dptid 109636042 -
scopus.contributor.dptid 121833164 -
scopus.contributor.dptid 121833164 -
scopus.contributor.dptid 109636042 -
scopus.contributor.dptid 112881698 -
scopus.contributor.name Quim -
scopus.contributor.name Alessio -
scopus.contributor.name Felice -
scopus.contributor.name Xavier -
scopus.contributor.name Jordi -
scopus.contributor.subaffiliation Department of Service and Information System Engineering; -
scopus.contributor.subaffiliation Institute for Computational Linguistics “A. Zampolli” (ILC-CNR); -
scopus.contributor.subaffiliation Institute for Computational Linguistics “A. Zampolli” (ILC-CNR); -
scopus.contributor.subaffiliation Department of Service and Information System Engineering; -
scopus.contributor.subaffiliation Department of Computer Science; -
scopus.contributor.surname Motger -
scopus.contributor.surname Miaschi -
scopus.contributor.surname Dell’Orletta -
scopus.contributor.surname Franch -
scopus.contributor.surname Marco -
scopus.date.issued 2025 *
scopus.description.abstracteng Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that these methods improve the precision and recall of extracted features and enhance performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction. *
scopus.description.allpeopleoriginal Motger Q.; Miaschi A.; Dell'Orletta F.; Franch X.; Marco J. *
scopus.differences scopus.subject.keywords *
scopus.document.type ar *
scopus.document.types ar *
scopus.funding.funders 501100004895 - European Social Fund Plus; 501100004837 - Ministerio de Ciencia e Innovación; 501100004837 - Ministerio de Ciencia e Innovación; *
scopus.funding.ids PID2020-117191RB-I00 / AEI/10.13039/501100011033; *
scopus.identifier.doi 10.1007/s10664-025-10660-y *
scopus.identifier.eissn 1573-7616 *
scopus.identifier.pui 2034300319 *
scopus.identifier.scopus 2-s2.0-105003228374 *
scopus.journal.sourceid 18650 *
scopus.language.iso eng *
scopus.publisher.name Springer *
scopus.relation.article 104 *
scopus.relation.issue 3 *
scopus.relation.volume 30 *
scopus.subject.keywords Extended pre-training; Feature extraction; Instance selection; Large language models; Mobile app reviews; Named-entity recognition; *
scopus.title Leveraging encoder-only large language models for mobile app review feature extraction *
scopus.titleeng Leveraging encoder-only large language models for mobile app review feature extraction *
Appare nelle tipologie: 01.01 Articolo in rivista
File in questo prodotto:
File Dimensione Formato  
s10664-025-10660-y.pdf

solo utenti autorizzati

Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 2.05 MB
Formato Adobe PDF
2.05 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/570522
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 1
social impact