Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers

Sumeer Ahmad Khan; Alberto Maillo; Vincenzo Lagani; Robert Lehmann; Narsis A. Kiani; David Gomez-Cabrero; Jesper Tegnér

Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers

Sumeer Ahmad Khan, Alberto Maillo, Vincenzo Lagani, Robert Lehmann, Narsis A. Kiani, David Gomez-Cabrero, Jesper Tegnér

Published: 01 Jan 2023, Last Modified: 02 Oct 2024Nat. Mac. Intell. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The rise of single-cell genomics is an attractive opportunity for data-hungry machine learning algorithms. The scBERT method, inspired by the success of BERT (‘bidirectional encoder representations from transformers’) in natural language processing, was recently introduced by Yang et al. as a data-driven tool to annotate cell types in single-cell genomics data. Analogous to contextual embedding in BERT, scBERT leverages pretraining and self-attention mechanisms to learn the ‘transcriptional grammar’ of cells. Here we investigate the reusability beyond the original datasets, assessing the generalizability of natural language techniques in single-cell genomics. The degree of imbalance in the cell-type distribution substantially influences the performance of scBERT. Anticipating an increased utilization of transformers, we highlight the necessity to consider data distribution carefully and introduce a subsampling technique to mitigate the influence of an imbalanced distribution. Our analysis serves as a stepping stone towards understanding and optimizing the use of transformers in single-cell genomics. scBERT, a pretrained neural network for single-cell sequencing tasks, was published last year in Nature Machine Intelligence. To test the reusability of the method, Khan et al. use the code to assess the generalizablility of transformer architectures on single-cell genomics tasks.

Loading