Abstract: The current web has become a platform where different web resources are combined together. These resources span different URLs and often involve malicious and sensitive content or advertisements (ads). Much of the content is dynamically generated. Thus, diagnosing these complex HTTP URLs hosted on which website is a daunting challenge. Although many tracing methods exist, they are typically designed for specific kinds of websites. There is currently no tool for reconstructing a comprehensive view of identifying landing URLs which are requested by users from noisy URLs automatically fired by browsers. In this paper, we propose Traceback, a tracing framework that provides such a comprehensive view. We build per-user and multi-user chains from passively collecting traffic. Then we extract novel statistical features from graph structures, HTTP states, and semantics. We demonstrate that our methodology is very effective in accurately identifying landing URLs, with recall and precision values up to 95% and over 94% by cross-validation experiments on Random Forest in a real local area environment.
Loading