Integration of several datasets in single-cell RNA-Seq data analysis

In the last 2 decades, single-cell RNA-Seq has significantly improved our knowledge of biological systems. The power of single-cell data and the amount of information that we can obtain with a proper analysis led to the development of almost 100 different single-cell sequencing methods. Consequently, tons of papers have been published and several datasets have been publicly released (if you search single-cell RNA-seq on www.ncbi.nlm.nih.gov, you will get almost 400.000 results!).

The great diversity of both experimental techniques and available datasets drove the need of investigating computational methods to integrate data coming from different methods, labs, or simply different runs, in order to compare the results, without being biased by batch effects.

In this tutorial, we provide a some of examples to deal with the integration of two different datasets when analyzing single-cell RNA-Seq data, i.e. how to adjust the batch effect. To do so, we use four different approaches and two different programming languages, R and python respectively. From R, we use ComBat-Seq [1] and the the integration pipeline from Seurat [2], while from python we use scanorama [3] and BBKNN [4].

Note that ComBat-Seq, the integration pipeline from Seurat, scanorama, and BBKNN are only some examples among several tools that have been developed to achieve the same purpose. For a more general overview of the existing tools, we recommend the review [5].

In general, it is always a good practice to use different tools, methods, or algorithms with the same purpose and compare the results.

Please, take into account that this is not a bioinformatics lecture, but rather a way to get familiar with a very common issue when analyzing single-cell RNA-Seq datasets. The purpose is to provide a simple overview of how to recognize and face the batch effect by using very common tools, easily accessible also by people who do not have a strong bioinformatics background.

Finally, if you are looking for a comprehensive guide about single-cell RNA-Seq data analysis in R by using Seurat, from the preprocessing workflow up to clustering and cell type identification, refer to the Seurat Vignettes. If you are instead looking for a comprehensive guide about single-cell RNA-Seq data analysis in python by using scanpy, refer to the scanpy tutorials.

Batch effect in brief

We call batch effects the differences in samples due to technical and/or experimental reasons rather than biological basis. If not adjusted, the batch effect can induce false considerations when comparing samples coming from different experimental protocols, platforms, or simply from different experimental runs. After batch effect correction, we can be confident that differences or similarities that are observed in the downstream analysis are consequences of the biological nature of the samples, rather than experimental artifacts.

How to detect batch effects in Single-Cell RNA-Seq data: visual inspection

The simplest way to check for batch effects in your data is dimensionality reduction, i.e. Principal Component Analysis (PCA). If most of the variation in your data is explained by different batches, it means that there is a batch effect.

Another way, visually simpler, is to run a non-linear dimensionality reduction to your data, such as t-SNE or UMAP, and visualize the data in the t-SNE/UMAP space: if cells coming from the same batch tend to cluster together and to be separated from the cells coming from other batches, it is very likely that the experiment is affected by a batch effect. After correcting the batch effect, cells should instead mix together in the t-SNE/UMAP space.

To better understand what we said, two very explicative pictures from [8] are reported below:

drawing

In this t-SNE, cells are colored by batches: blue and red dots represent two different datasets, respectively. It seems there is a batch effect!

drawing

Here, the same cells have been represented in t-SNE space after batch correction: red and blue dots are not clustered separately anymore.

Datasets description

For this work, two datasets were downloaded from the KPMP Kidney Tissue Atlas [7]. More specifically, healthy references with ID Samples1157-E02 and PRE062-1 were used for the analysis, both coming from the 10x Genomics platform.

Click here for the analysis in R and here for the analysis in python.

References

[1] Yuqing Zhang, Giovanni Parmigiani, W Evan Johnson, ComBat-seq: batch effect adjustment for RNA-seq count data, NAR Genomics and Bioinformatics, Volume 2, Issue 3, 1 September 2020, lqaa078, https://doi.org/10.1093/nargab/lqaa078

[2] Stuart et al., 2019, Cell 177, 1888–1902 June 13, 2019 ª 2019 Elsevier Inc., https://doi.org/10.1016/j.cell.2019.05.031

[3] Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat Biotechnol 37, 685–691 (2019). https://doi.org/10.1038/s41587-019-0113-3

[4] Polański K, Young MD, Miao Z, Meyer KB, Teichmann SA, Park JE. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics. 2020 Feb 1;36(3):964-965. doi: 10.1093/bioinformatics/btz625. PMID: 31400197; PMCID: PMC9883685.

[5] Tran, H.T.N., Ang, K.S., Chevrier, M. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9

[6] https://blog.bioturing.com/2022/03/24/batch-effect-in-single-cell-rna-seq-frequently-asked-questions-and-answers/

[7] https://qa-atlas.kpmp.org/

[8] https://satijalab.org/seurat/articles/sctransform_v2_vignette.html

Feedback, suggestions and questions

If you have any questions or suggestions, or if you want to share your feedback, please write to sergio.sarnataro@ens-lyon.fr