Single cell RNA-sequencing (scRNA-seq) technology allows researchers to profile the transcriptomes of thousands of cells simultaneously. The high throughput nature of this approach leads to a more complex data structure, with a mix of designed and random barcodes used to identify different cells and molecules that must be efficiently dealt with to arrive at a count matrix for further analysis. For downstream analysis, there is already a large number of methods available for scRNA-seq data that deal with problems ranging from normalization to clustering and trajectory analysis. Efforts to comprehensively benchmark different methods are still in their infancy and are currently hampered by the lack of gold-standard data sets.
To address these issues we developed new software tools and benchmarking data sets. For data preprocessing, the scPipe R/Bioconductor package was created to take raw sequence reads from FASTQ files generated by different protocols (including CEL-seq, MARS-seq, Chromium 10X and Drop-seq) and arrive at a gene count matrix for downstream analysis. scPipe performs demultiplexing, UMI deduplication, alignment and gene counting. It also aids in quality control by generating plots of a number of key quality metrics and robust outlier detection to remove poor quality cells.
To improve our ability to benchmark analysis methods, we designed and generated a number of scRNA-seq control data sets. Using a mixture design, cells and RNA from three lung adenocarcinoma cell lines were combined in different ratios to create known populations and pseudo-trajectories. Data was generated using CEL-seq and Chromium 10x protocols. A benchmarking software platform will be developed to facilitate methods comparisons using common diagnostics and make it easier to select optimal analysis methods for different tasks.
These software and data will be made freely available to help researchers process their raw data and guide the selection and development of better scRNA-seq analysis pipelines.