CNR Institutional Research Information System

The evolution of a parent malware into a family of slightly different mutations may hinder detection mechanisms based on signatures, while the limited number of training examples may reduce the effectiveness of machine learning methods in the early stages of the infection. To address these challenges, we define a framework to improve the ability to generalize the detection of 'evolving' malware samples. Specifically, we leverage a Large Language Model (LLM) to map malware instructions into a latent space. The obtained embeddings are then used to train a Variational Autoencoder for generating realistic variants. Experimental results obtained by training a detector on both real and synthetic embeddings demonstrate the effectiveness of our approach, especially when facing three real malware families. Our LLM-based feature extraction approach should be then considered a promising mechanism for pursuing robust malware detection in dynamic threat environments.

Days of Future Past: Towards Robust Detection of Malware Variants via LLM-Based Embedding Generation

Benedetti G.;Caviglione L.;Choras M.;Guarascio M.;Liguori A.;Manco G.;Rullo A.

2025

Abstract

The evolution of a parent malware into a family of slightly different mutations may hinder detection mechanisms based on signatures, while the limited number of training examples may reduce the effectiveness of machine learning methods in the early stages of the infection. To address these challenges, we define a framework to improve the ability to generalize the detection of 'evolving' malware samples. Specifically, we leverage a Large Language Model (LLM) to map malware instructions into a latent space. The obtained embeddings are then used to train a Variational Autoencoder for generating realistic variants. Experimental results obtained by training a detector on both real and synthetic embeddings demonstrate the effectiveness of our approach, especially when facing three real malware families. Our LLM-based feature extraction approach should be then considered a promising mechanism for pursuing robust malware detection in dynamic threat environments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Strutture organizzative
	
				Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
			
	Parole chiave
	
				large language models
malware detection
malware variants
synthetic malware generation
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Days_of_Future_Past_Towards_Robust_Detection_of_Malware_Variants_via_LLM-Based_Embedding_Generation.pdf accesso aperto Licenza: Dominio pubblico Dimensione 443.54 kB Formato Adobe PDF Visualizza/Apri	443.54 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/582379

Citazioni

ND

0

ND

social impact