Abstract: In this paper, a deep learning based approach to malicious Web request detection is proposed, where a Web request is a message sent from a client to a server for communication. The characters in a request message can be considered as words in a sentence from the perspective of natural language processing. Consequently, character embeddings can be learned from a corpus of Web requests in an unsupervised manner via language models. In terms of Web requests, the URL path and request query string convey the most information, which can be extracted as the corpus to learn character-level embeddings. Furthermore, a specially designed convolutional neural network (CNN) is applied to the learned character-level embeddings for malicious and legitimate request identification. Our approach is totally data-driven, being free from hand-crafted features that require domain expertise and high computational cost. Extensive experiments conducted on HTTP DATASET CSIC 2010 dataset show significant improvement on the performance against existing state-of-the-art methods, especially for extremely imbalanced cases. The devised CNN model exploiting the embedding representation of Web requests achieves a false positive rate (FPR) of 0.02% with 98.72% on true positive rate (TPR), and 99.46% on accuracy (ACC), respectively. Codes are publicly available at https://github.com/acew14/CL-Embedding-CNN.
Loading