Abstract: Cancer classification based on gene expression profiles and the identification of significant oncogenes have become hot research topics in the field of bioinformatics. However, the complexity, high-dimensionality and limited sample size of gene expression data make comprehensive analysis challenging. Deep learning networks have achieved great success in addressing such issues. However, neural network models are mostly considered as “black box” approaches, and their interpretability has always been a bottleneck. We propose an interpretable pre-trained deep learning framework for identifying genetic markers, named gaBERT. This approach, leveraging BERT's pre-training and fine-tuning methodology, gains a general understanding of gene interaction patterns through pre-training on a vast amount of unlabeled gene expression data; it then transfers this knowledge to new cancer disease expression data for supervised fine-tuning, ultimately producing a list of genes that significantly contribute to specific disease phenotypes. Experiments demonstrate gaBERT's good performance in cancer prediction, tumor-related gene identification, and model interpretability.
Loading