Benchmark per pentesting automatico utilizzando agenti LLM

Large Language Models (LLMs), such as ChatGPT, are gaining increasing popularity across various fields. Researchers have started to explore the use of AI-based Generative Agents, powered by LLMs, for cybersecurity applications, particularly in offensive tasks like penetration testing, demonstrating that these agents can autonomously perform penetration testing tasks, often without requiring user input or prior knowledge of the vulnerabilities involved. While this represents significant progress in the field, a lack of standardized benchmarks for comparing the performance of these agents still affects progress in this field. With the aim of achieving a standardized method to evaluate Generative Agents in penetration testing tasks, some researchers have recently developed an open-source benchmark to accomplish this by means of CTF-like tasks to be fed to Generative Agents and asses their performances. In the context the work done by these researchers, this thesis seeks to develop a set of simple cryptographic tasks that can be a starting point for achieving a standardized method for Generative Agents evaluation and to begin to understand the limitations of these agents in cryptographic penetration testing tasks. In order to achieve this, in this thesis, two different Generative Agents were evaluated with tasks comprising basic cryptographic penetration testing problems, similar to those encountered in beginner-level Capture the Flag (CTF) competitions. The obtained results indicate that, in this domain, the agents still face difficulties in successfully completing cryptographic penetration testing tasks, demonstrating that, even though the tasks are simple, do they not only require Generative Agents to posses prior knowledge of cryptographic algorithms but also the ability of apply this knowledge logically and effectively.

I Large Language Model (LLM), come per esempio ChatGPT, stanno guadagnando sempre più popolarità in vari settori. I ricercatori hanno iniziato a esplorare l'uso di agenti generativi basati su IA, alimentati da LLM, per applicazioni in ambito di cybersecurity, in particolare per compiti offensivi come il penetration testing, dimostrando che questi agenti possono eseguire autonomamente attività di penetration testing, spesso senza richiedere input dell'utente o conoscenze preliminari sulle vulnerabilità coinvolte. Sebbene ciò rappresenti un progresso significativo nel settore, la mancanza di benchmark standardizzati per confrontare le prestazioni di questi agenti continua a ostacolare i progressi in questo campo. Con l'obiettivo di raggiungere un metodo standardizzato per valutare gli agenti generativi nei compiti di penetration testing, alcuni ricercatori hanno recentemente sviluppato un benchmark open-source per questo scopo, utilizzando problemi simili a quelli delle competizioni CTF (Capture the Flag) da sottoporre agli agenti generativi per valutarne le prestazioni. Nel contesto di questo lavoro, questa tesi si propone di sviluppare un insieme di problemi crittografici semplici che possano rappresentare un punto di partenza per ottenere un metodo standardizzato per la valutazione degli agenti generativi e iniziare a comprendere le limitazioni di questi agenti nei compiti di penetration testing crittografico. Per raggiungere questo obiettivo, in questa tesi sono stati valutati due diversi agenti generativi con compiti che comprendono problemi di penetration testing crittografico di base, simili a quelli incontrati in competizioni CTF per principianti. I risultati ottenuti indicano che, in questo ambito, gli agenti continuano a incontrare difficoltà nel completare con successo i compiti di penetration testing crittografico, dimostrando che, anche se i compiti sono semplici, richiedono agli agenti generativi non solo la conoscenza preliminare degli algoritmi crittografici, ma anche la capacità di applicare tale conoscenza in modo logico ed efficace.