Abstract: The trace of double compression can serve as a crucial evidence of image manipulation for forensic investigation. With the ever-increasing popularity of WebP format, a new type of double compression case, WebP-JPEG transcoding, has emerged. However, distinguishing it from two common compression cases, single JPEG (SJPEG) and double JPEG (DJPEG) has not yet been studied. In this paper, we propose a specialized method for the new task. Firstly, a detailed analysis is conducted to reveal the differences in compression artifacts between WebP-JPEG and SJPEG/DJPEG, which manifests in the distributions of $4\times 4$ / $8\times 8$ DCT coefficients and the high-frequency portions of image spectrum. Then, multi-modality DCT histograms (MMDH) and high-pass-filtered image residuals (HPFIR) are proposed as front-end dual-domain forensic features to expose the above differences. An indispensable part of these features are extracted through a novel frequency-isolation module (FIM), offering additional information based on the derived relationship between $4\times 4$ and $8\times 8$ DCT coefficients. Finally, a CNN-ViT (Convolutional Neural Network-Vision Transformer) dual-stream network is designed to learn back-end deep features for a reliable detection, where a CNN stream is used to process statistical features in MMDH while a ViT stream to learn spatial correlations in HPFIR. Extensive experimental results demonstrate that the proposed method significantly outperforms state-of-the-art double compression detection methods in distinguishing WebP-JPEG from SJPEG/DJPEG and is more effective in tampering localization. In specific, the proposed method achieves an average detection accuracy of 0.942 for small images of size $128\times 128$ .