Keywords: Triplet Networks, Archetypal Analysis, High Content Imaging, Representation Learning, Drug Discovery, Self-Supervision
TL;DR: A representation learning approach based on archetypal analysis and self-supervision is introduced to alleviate the biased and highly expensive data curation process for supervised endpoint classification in the context of biopharma drug discovery.
Abstract: Biopharma drug discovery requires a set of approaches to find, produce, and test the safety of drugs for clinical application. A crucial part involves image-based screening of cell culture models where single cells are stained with appropriate markers to visually distinguish between disease and healthy states. In practice, such image-based screening experiments are frequently performed using highly scalable and automated multichannel microscopy instruments. This automation enables parallel screening against large panels of marketed drugs with known function. However, the large data volume produced by such instruments hinders a systematic inspection by human experts, which consequently leads to an extensive and biased data curation process for supervised phenotypic endpoint classification. To overcome this limitation, we propose a novel approach for learning an embedding of phenotypic endpoints, without any supervision. We employ the concept of archetypal analysis, in which pseudo-labels are extracted based on biologically reasonable endpoints. Subsequently, we use a self-supervised triplet network to learn a phenotypic embedding which is used for visual inspection and top-down assay quality control. Extensive experiments on two industry-relevant assays demonstrate that our method outperforms state-of-the-art unsupervised and supervised approaches.
Registration: I acknowledge that publication of this at MIDL and in the proceedings requires at least one of the authors to register and present the work during the conference.
Authorship: I confirm that I am the author of this work and that it has not been submitted to another publication before.
Paper Type: both
Primary Subject Area: Unsupervised Learning and Representation Learning
Secondary Subject Area: Application: Other
Confidentiality And Author Instructions: I read the call for papers and author instructions. I acknowledge that exceeding the page limit and/or altering the latex template can result in desk rejection.
Code And Data: Two Real-world HCS datasets had been used in this study. The NTR1 (Peddibhotla et al., 2013) dataset is used for a qualitative assessment of the self-supervised embedding. The BBBC013 (Ljosa et al., 2012) dataset is used to evaluate the quantitative performance of the embedding features with regard to assay quality metrics. The BBBC013 is available at the Broad Bioimage Benchmark Collection (https://bbbc.broadinstitute.org/BBBC013).