Abstract: Convolutional neural networks (CNNs) have been widely used in the field of remote sensing (RS) scene classification, which have achieved remarkable results. In RS scene classification, local key objects are particularly crucial for classification results. However, most existing CNN methods directly utilize the deep-level global features of CNN, ignoring object-level information in shallow features or leading to redundant and erroneous information when using shallow features. To fully utilize the important information in shallow features, we proposed an end-to-end contextual spatial-channel attention network (CSCANet) to learn multilayer feature representations and further improve classification performance by employing shallow object-level semantic information. First, ResNet34 is pretrained to extract different levels of features. Second, a contextual spatial-channel attention module (CSCAM) is constructed to generate contextual spatial-channel attention features by exploiting features at different levels. Finally, the triple loss function is combined with the central loss function to guide the model training. Experiments on three public RS scene classification datasets [UC-Merced (UCM), aerial image dataset (AID), and NWPU-RESISC45 (NWPU)] demonstrate that the proposed method achieves highly competitive results.