The evolution of a parent malware into a family of slightly different mutations may hinder detection mechanisms based on signatures, while the limited number of training examples may reduce the effectiveness of machine learning methods in the early stages of the infection. To address these challenges, we define a framework to improve the ability to generalize the detection of 'evolving' malware samples. Specifically, we leverage a Large Language Model (LLM) to map malware instructions into a latent space. The obtained embeddings are then used to train a Variational Autoencoder for generating realistic variants. Experimental results obtained by training a detector on both real and synthetic embeddings demonstrate the effectiveness of our approach, especially when facing three real malware families. Our LLM-based feature extraction approach should be then considered a promising mechanism for pursuing robust malware detection in dynamic threat environments.

Days of Future Past: Towards Robust Detection of Malware Variants via LLM-Based Embedding Generation

Benedetti G.;Caviglione L.;Guarascio M.;Liguori A.;Manco G.;Rullo A.
2025

Abstract

The evolution of a parent malware into a family of slightly different mutations may hinder detection mechanisms based on signatures, while the limited number of training examples may reduce the effectiveness of machine learning methods in the early stages of the infection. To address these challenges, we define a framework to improve the ability to generalize the detection of 'evolving' malware samples. Specifically, we leverage a Large Language Model (LLM) to map malware instructions into a latent space. The obtained embeddings are then used to train a Variational Autoencoder for generating realistic variants. Experimental results obtained by training a detector on both real and synthetic embeddings demonstrate the effectiveness of our approach, especially when facing three real malware families. Our LLM-based feature extraction approach should be then considered a promising mechanism for pursuing robust malware detection in dynamic threat environments.
2025
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
large language models
malware detection
malware variants
synthetic malware generation
File in questo prodotto:
File Dimensione Formato  
Days_of_Future_Past_Towards_Robust_Detection_of_Malware_Variants_via_LLM-Based_Embedding_Generation.pdf

accesso aperto

Licenza: Dominio pubblico
Dimensione 443.54 kB
Formato Adobe PDF
443.54 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/582379
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact