Quantifying and Optimising the Performance of Cyber Deceptions

Roelien C. Timmer

Published: 2024, Last Modified: 18 Dec 2024undefined 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Cybersecurity is essential in today's digital era due to the extensive digitalisation of critical aspects of daily life, including communication, business, healthcare, and education. The advanced reliance on digital platforms has increased the potential impacts of cyberattacks, encompassing financial losses, theft of sensitive data, disruption of services, and threats to physical safety. The 2023 Cost of a Data Breach Report highlighted a significant rise in the global average cost of data breaches, emphasising the necessity for robust cybersecurity measures across various sectors to mitigate these risks (IBM, 2023). Cyber deception is a field within cybersecurity that employs deceptive tactics to detect, confuse, and deter attackers. Honeyfiles, a subset of cyber deception, are decoy documents placed on file servers to act as bait for hackers. When a hacker interacts with a honeyfile, an alarm is triggered, alerting the system administrator, who can investigate the breach. Honeyfiles provide multiple advantages, including early detection of intruders, acting as a deterrent, slowing down attackers' operations, offering insights into their tactics and objectives, being effective against insider threats, and when deployed effectively having lower false-positive and false-negative rates than other network intrusion detection systems. Honeyfiles have become a crucial part of the cybersecurity toolkit, but creating a honeyfile is labour-intensive. With generative artificial intelligence (genAI), we can automate and optimise the honeyfile creation process. However, quantitative metrics for evaluating honeyfiles remain limited. This thesis focuses on enhancing honeyfiles' text quality through quantitative measures to improve the generation process and the quality of honeyfiles. To ensure honeyfiles effectively entice intruders, we introduce Topic Semantic Matching (TSM), a novel metric utilising Natural Language Processing (NLP). Through topic modelling and semantic similarity, TSM compares the honeyfile content with nearby files. To validate TSM effectiveness, we create a honeyfile corpus that reaches across various technical subjects and contains honeyfiles generated using various techniques. We recruited participants on a crowdsourcing platform to assess whether the enticement aligns with human judgement. The participants evaluated honeyfile enticement in simulated breach scenarios and we demonstrate that TSM aligns with the human judgement of enticement. This comprehensive analysis, enriched by participant feedback, enhances our understanding of textual enticement evaluation in honeyfiles. In addition to enticing intruders, honeyfiles must appear realistic to avoid detection as decoys. Despite cohesion and coherence being key indicators of textual realism in honeyfile literature, their alignment with human perceptions of realism has been inconsistently validated across various honeyfile generation techniques. Through a crowdsourced study, we found that these metrics do not consistently reflect perceived realism. By analysing participant feedback, we identified a focus on textual consistency as a crucial aspect of realism, even though statistical analyses did not firmly correlate cohesion and coherence with realism perceptions. This feedback, made public for community analysis, underscores the complexity of achieving and evaluating realism in honeyfiles. After generating a honeyfile, it is crucial to ensure that the text does not contain any sensitive information. While leveraging fine-tuning for Large Language Models (LLMs) can enhance honeyfile realism, this process risks leaking sensitive content within the honeyfiles. We focus on detecting complex, sensitive information within sentences with Transformer models. We found that Transformer models are significantly more effective in identifying intricate sensitive information by evaluating these models against traditional methods on a dataset of legal documents from the Monsanto trial. To prevent intruders from suspecting the authenticity of honeyfiles, ensuring that their filenames blend seamlessly with the existing filesystem within which they are deployed, termed camouflage, is crucial. We explored the quantification of this camouflage by developing two metrics using cosine distances within a semantic vector space: a simple metric comparing the honeyfile name's embedding with the local files' average embedding and a cluster-based metric evaluating the distance to the nearest cluster mean in a mixture model. Although the simple metric is computationally less intensive, the cluster-based approach more accurately reflects directories with diverse content. Both metrics, tested on a GitHub software repository dataset representing a broad range of directory sizes, showed effective camouflage, with minimal differences between the metrics across various directory sizes.