Use of natural language processing to classify mobile health apps: performance evaluation

Paglialonga, A; Schiavo, M; Caiani, Eg

Background: Use of conventional keyword-based search (KBS) to find mobile health apps on the app stores has limitations as the user is provided with long lists of apps with little insight into the apps' relevance to the search. Purpose: The aim of this study was to develop a novel method, based on natural language processing (NLP), able to identify apps in a given topical area more accurately than KBS. Methods: We developed an automated NLP method based on MetaMap and the Unified Medical Language System (UMLS) to extract medical concepts from the apps' description reported on the store webpages. We built a classifier able to identify, based on the medical concepts retrieved, the most relevant topical area(s) of the app. We extracted a random sample of 800 apps from the Medical and Health & Fitness categories on the US iTunes app store and we built a training set and a test set of 400 apps each. We classified apps into topical areas (e.g., cardiology, emergency medicine, neurology, oncology, surgery, fitness & wellness), one or more whenever relevant, or none in case of no medical content. Classification was performed: (i) manually (gold standard); (ii) by using NLP; and (iii) by using KBS, which was implemented by using comprehensive lists of keywords for a safer estimate. We evaluated the performance of the NLP and KBS methods by computing multi-label classification metrics such as accuracy (i.e., the average of accuracy values across topical areas), exact match (i.e., the proportion of predicted sets matching the true sets exactly), recall (i.e., sensitivity), and Hamming loss (i.e., the fraction of topical areas not correctly predicted). Results: By optimizing the proposed NLP method on the training set we obtained 94% accuracy, 49% exact match, 88% recall, and 6% Hamming loss. On the test set, the performance of NLP was similar as in the training set whereas the performance of KBS was lower (accuracy: 92% NLP, 39% KBS; exact match: 36% NLP, 28% KBS; recall: 63% NLP, 50% KBS; Hamming loss: 8% NLP, 9% KBS). KBS performance in real settings is likely to be lower as typically few keywords are used in app searches. NLP performance was higher in some topical areas (e.g., endocrinology, oncology, gastroenterology) and lower in others (e.g., surgery, fitness & wellness) due to less specific vocabularies. Conclusions: The proposed NLP method performed better than KBS in classifying health apps into topical areas as the former is able to extract medical concepts from their context and uses optimized classification rules whereas the latter simply retrieves keywords from a text. The NLP method can be further improved by inclusion of additional vocabularies and more complex classification rules. The proposed method is context-aware and able to classify apps into multiple categories whenever relevant and, as such, it may be the basis for novel filtering tools to support patients and healthcare professionals in informed adoption of apps.

CNR Institutional Research Information System