The reproducibility of results obtained using RNA data across labs is a major hurdle in cancer research. Difference in library preparation methods and gene expression quantification platforms prevent the application of trained models to new data across labs. SpinAdapt is a novel unsupervised domain adaptation algorithm that enables the transfer of existing molecular models across labs and technological platforms, without requiring re-training or calibration of existing models for future prospective data. Furthermore, SpinAdapt uses summary statistics (independent latent space representations) to calculate data corrections, rather than requiring full data access. This allows for transfer of molecular models across sequencing platforms and between labs without loss of data ownership or compromise of data privacy.
To evaluate SpinAdapt, we performed two sets of experiments:
A) We transferred molecular tumor subtype classifiers across four pairs of publicly available cancer datasets (bladder, breast, colorectal, pancreatic), covering 4,076 samples across 18 different tumor subtypes and three technological platforms (RNASeq, Affymetrix U133plus2, and HE1ST). For each pair of datasets we trained a subtype classifier on one dataset (target) according to well-accepted subtyping annotations (Zea Tan et al. 2019; Prat et al. 2012; Guinney et al. 2015; Bailey et al. 2016), and then evaluated the classifier accuracy on the other dataset (source).
For each tumor subtype, we quantified the classification performance using mean AUC score across random subsets of the source dataset, where each subset was corrected using SpinAdapt. We aggregated performance across all subtypes and report the average mean AUC score for each cancer type: bladder 0.95, breast 0.98, colorectal 0.98, pancreatic 0.96; demonstrating high accuracy on all diagnostic tasks.
B) To demonstrate the transferability of prognostic models, we trained five Cox survival models on five target cancer datasets respectively (breast, lung, colorectal, liver, pancreatic) ranging from 186 to 2,919 RNASeq samples. We used SpinAdapt to adapt five source cancer datasets to the target datasets, ranging from 226 to 1,038 samples across different platforms (RNASeq, Affymetrix U133Plus2 and HG-U133A, Illumina HumanHT-12v4). For every cancer type, we trained a Cox model on the target dataset, and measured its performance by predicting survival risk on the corresponding adapted source dataset.
We show high survival prediction accuracy for all datasets (Log-rank P-values and c-index): lung [1e-6, 01.661], breast [5e-5, 0.626], liver [1e-4, 0.708], pancreatic [2e-4, 0.629], colorectal [9e-4, 0.661].
SpinAdapt transferred diagnostic and prognostic models over 14 cancer datasets covering 7,146 samples across six different cancer types and various platforms (RNASeq, microarray), while maintaining model accuracy and statistical significance.
VIEW THE POSTER
VIEW THE PUBLICATION