Unsupervised reference-free inference reveals unrecognized regulated transcriptomic complexity in human single cells

Abstract: Myriad mechanisms diversify the sequence content of eukaryotic transcripts at the DNA and RNA level
with profound functional consequences. Examples include diversity generated by RNA splicing and V(D)J
recombination. Today, these and other events are detected with fragmented bioinformatic tools that require
predefining a form of transcript diversification; moreover, they rely on alignment to a necessarily incomplete
reference genome, filtering out unaligned sequences which can be among the most interesting. Each of these
steps introduces blindspots for discovery. Here, we develop NOMAD+, a new analytic method that performs
unified, reference-free statistical inference directly on raw sequencing reads, extending the core NOMAD
algorithm to include a micro-assembly and interpretation framework. NOMAD+ discovers broad and new
examples of transcript diversification in single cells, bypassing genome alignment and without requiring cell
type metadata and impossible with current algorithms. In 10,326 primary human single cells in 19 tissues
profiled with SmartSeq2, NOMAD+ discovers a set of splicing and histone regulators with highly conserved
intronic regions that are themselves targets of complex splicing regulation and unreported transcript diversity in
the heat shock protein HSP90AA1. NOMAD+ simultaneously discovers diversification in centromeric RNA
expression, V(D)J recombination, RNA editing, and repeat expansions missed by or impossible to measure
with existing bioinformatic methods. NOMAD+ is a unified, highly efficient algorithm enabling unbiased
discovery of an unprecedented breadth of RNA regulation and diversification in single cells through a new
paradigm to analyze the transcriptome.
0 Replies
Loading