The increasing complexity of cyber threats necessitates robust cyber security measures. Effective threat detection and mitigation depend on Cyber Threat Intelligence, which includes structured and unstructured data critical for proactive defense strategies. While databases like the NVD and ExploitDB offer structured security information, a significant amount of vital intelligence initially appears in unstructured formats, such as blogs, mailing lists, and news sites. Extracting meaningful information from these sources is particularly challenging in cyber security, requiring specialized Named Entity Recognition (NER) tools to identify domain-specific entities. This paper presents a NER dataset obtained by merging two cyber security domain datasets, CyNER and APTNER, creating a unified resource that enhances NER model training. Experimental results with advanced NER models show significant performance gains, underscoring the value of the proposed dataset in advancing cyber security practices, and highlighting the needs of such kind of resources.

A Dataset for the Fine-tuning of LLM for the NER Task in the Cyber Security Domain

Stefano Silvestri
Primo
;
Giuseppe Felice Russo;Giuseppe Tricomi;Mario Ciampi
2025

Abstract

The increasing complexity of cyber threats necessitates robust cyber security measures. Effective threat detection and mitigation depend on Cyber Threat Intelligence, which includes structured and unstructured data critical for proactive defense strategies. While databases like the NVD and ExploitDB offer structured security information, a significant amount of vital intelligence initially appears in unstructured formats, such as blogs, mailing lists, and news sites. Extracting meaningful information from these sources is particularly challenging in cyber security, requiring specialized Named Entity Recognition (NER) tools to identify domain-specific entities. This paper presents a NER dataset obtained by merging two cyber security domain datasets, CyNER and APTNER, creating a unified resource that enhances NER model training. Experimental results with advanced NER models show significant performance gains, underscoring the value of the proposed dataset in advancing cyber security practices, and highlighting the needs of such kind of resources.
2025
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR - Sede Secondaria Napoli
Cyber Threat Intelligence
Named Entity Recognition
Cybersecurity
Large Language Model
NLP
File in questo prodotto:
File Dimensione Formato  
paper_170.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 180.32 kB
Formato Adobe PDF
180.32 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/539196
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact