Exploring representation-level augmentation and RAG-based vulnerability augmentation with LLMs for vulnerability detection
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Using deep learning (DL) for detecting software vulnerabilities has become commonplace. However, data shortage remains a significant challenge due to the scarce nature of vulnerabilities. A few papers have attempted to address the data scarcity issue through oversampling, creating specific types of vulnerabilities, or generating code with single-statement vulnerabilities. In this thesis, we aim to find a general-purpose methodology that covers various types of vulnerabilities and multiple-statement ones while beating previous methods. Specifically, we first explore traditional mixup-inspired augmentation methods that work at the representation level and show that these methods can be useful, although they cannot beat random oversampling. One possible reason is that mixing samples heavily degrades the integrity of the code. Hence, we introduce VulScribeR, a RAG-based vulnerability augmentation pipeline that leverages LLMs and maintains code integrity, unlike mixup-based methods. We show that VulScribeR outperforms the state-of-the-art (SOTA), oversampling, and representation-level augmentation methods.