In cybersecurity, system provenance graphs are a key primitive to support intrusion detection and program identification tasks. Recent movement towards using data-hungry graph learning models for security-critical applications has exposed significant limitations in existing provenance datasets. Imbalanced representation of programs induces bias and performance degradation in downstream models. Further, these models rely on rich numeric and textual node attributes to accurately encode program behaviors, limiting the ability of existing data augmentation techniques to address data imbalance in provenance graphs.
We present PROVCREATOR, a novel graph synthesis framework designed for feature-rich system provenance graphs. PROVCREATOR learns the joint distribution of node attributes and graph structures conditioned on program class labels, enabling targeted generation of realistic system provenance graphs to supplement underrepresented programs. Our evaluation shows that PROVCREATOR produces provenance graphs with higher structural fidelity, attribute fidelity, and downstream utility compared to those of previous graph synthesis methods.