Update: This paper has now been published at Nature Biotech
“Panoramic stitching of heterogeneous single-cell transcriptomic data” by Brian L Hie, Bryan Bryson, Bonnie Berger. https://doi.org/10.1101/371179
I selected this article for review for three reasons:
- I work with single-cell RNA-Sequencing (scRNA-seq) data quite a bit.
- Methods that can integrate two or more scRNA-seq data sets (across experiments, conditions, treatments, etc) are in high demand and are actively being worked on by many groups.
- I was intrigued about the idea of drawing “inspiration from computer vision algorithms for panorama stitching that identify images with … overlapping content, which are then merged into a larger panorama”. It sounded really interesting and I wanted to learn more about it.
Recent efforts to integrate scRNA-Seq data have included the use of Canonical Correlation Analysis (CCA) in Seurat and Mutual Nearest Neighbors (MNN) in scran as an R/Bioconductor package. In this manuscript, Hie et al. propose to extend the MNN approach from integrating two datasets to multiple datasets, which is called Scanorama.
Four reviewers provided comments below on Scanorama. While the reviewers thought many of the benchmark comparisons and analyses were useful, especially the runtime performance of Scanorama over other methods, a consensus emerged that more work was needed from the authors to understand the assumptions and intuition of the method and to accurately assess the performance and significance. For example, one reviewer stated “However, Scanorama’s utility beyond a visualization technique to improve cell-type, cell-state, or transcriptional subpopulation identification remains to be convincingly shown.” A second reviewer described the “manuscript as it stands appears somewhat as a black box. … The authors should discuss the shortcomings of the approach. What are the major assumptions of the approach and are there cases where scanorama would not work so well?” A third reviewer was convinced that the method works, but brought up concerns about the novelty of method over what was already published. The reviewer stated “I think my only concerns about this paper is not whether the method works, because it clearly does, but rather (1) how much of the methodological innovation is new vs. just being a novel application of existing tools, and (2) how well it actually works, as compared to prior work and other batch correction methods for scRNA-seq or gene expression in general.” Addressing these general concerns (and more specific concerns described below) would significantly enhance the manuscript.
I want to thank the authors for sharing their work on bioRxiv before it has undergone peer review. I also want to take an opportunity to thank the reviewers who have donated their time to evaluate the novelty, strengths, and limitations of this work. Three reviewers have chosen to be named. One reviewer was a postdoctoral research fellow, one was research scientist, and and two were a faculty members. The four reviews are available in full below.
Reviewer 1 (Jean Fan)
There is a pressing need for scalable computational techniques that can perform joint clustering across multiple, large-scale single-cell datasets in an efficient manner while taking into consideration potential heterogeneity in cell-type composition across datasets and is robust to potential batch or platform specific differences. Here, Hie et al propose a creative solution called Scanorama that takes inspiration from panoramic photo stitching. Scanorama identifies mutual nearest neighbors across all pairs of datasets to create beautiful joint embeddings across highly disparate datasets while offering substantially improved runtime performance over previously published approaches. However, Scanorama’s utility beyond a visualization technique to improve cell-type, cell-state, or transcriptional subpopulation identification remains to be convincingly shown.
Major comments:
One hope for such joint clustering approaches is to be able to leverage information across many datasets to improve identification of rare cell types. Consider this extreme example: if one sample has only one cell of a particular cell-type, this cell would likely be erroneously clustered into another group or perhaps just disregarded as you can’t have a cell-type that is only 1 cell. However, if I have 100 samples where this cell-type is present as only 1 cell, I should be able to combine across samples to achieve a cluster with 100 cells, 1 cell from each sample. Can the authors simulate an example to assess Scanorama’s ability to resolve such rare subpopulations?
With respect to Supplementary Figure 7: the resulting Scanorama embedding seems barely able to distinguish CD4+ and CD8+ T cells. Are potentially more subtle cell-types like naive, memory, effector, and regulatory T cell states being lost due to the forced merging? After correction/merging, can standard clustering approaches be applied to re-identify CD4+ and CD8+ T cells?
Similarly, with respect to Figure 3A: there is presumably a lot of additional cell subtype heterogeneity within neurons (ex. inhibitory and excitatory neurons) that are not apparent in the Scanorama joint embedding. Is Scanorama biased to preserving large-scale global differences? Are local structures being lost? Does this depend on the degree of differences among the datasets to be aligned ie. aligning datasets with cell-types that are more different will result in more subtle cell-subtype differences being lost? Could this be due to restricting to the top 10,000 most highly variable genes? In particular with respect to the PBMCs in Figure 3A, it’s not clear whether differences between B cells, monocytes, NKs, etc, are still apparent like they are in Supplementary Figure 7.
With respect to Supplementary Figure 3’s statement “When given a collection of six diverse data sets with no overlapping cell types (Jurkat cells, mouse neurons, HSCs, unstimulated macrophages, pancreatic islets, and PBMCs), Scanorama does not merge disparate cell types together. The only matches found by Scanorama were between a small portion of Jurkat cells and PBMCs, which may have some biological basis since Jurkat cells are an immortalized cell line of human T cells.” Are the PBMCs that cluster with Jurkat cells actually T cells? Please show a few T cell markers in Supplementary Figure 3 to confirm. Or alternatively, perform differential expression analysis between the PBMCs that cluster with the Jurkat cells versus the other PBMCs and show that the differentially upregulated genes in the Jurkat-clustered PBMCs are enriched for T cell markers.
Following up on comment 4, are these Jurkat-like PBMCs and Jurkat cells also clustered together in Figure 3A?
The runtime comparison in Figure 3F is very impressive! However, the runtime for Scanorama looks essentially like a constant line at 0 and is not particularly informative with respect to the true scalability of Scanorama. Plotting on a log scale may be more informative. It is not unreasonable to assume that the number of cells will approach the multiples of millions in the next few years, so readers will be interested in getting a sense of the projected runtimes for datasets on such scales.
One potential concern, especially when integrating across multiple platforms or highly disparate cell types, is that lower quality cells with high drop-outs end up being mutual nearest neighbors with biologically transcriptionally quieter cells. Please include supplementary figures showing the joint Scanorama embeddings colored by library size or library complexity to double check this. Please include a discussion of this limitation if it is indeed a limitation.
For the statement “We found that the alignment of macrophage populations between these data sets is directed by genes that are enriched for processes relating to bacterial infection (Fig. 3g, Supplementary Table 2)”, what are the genes? Supplementary Table 2 are the enriched GO terms, but readers will want to know the genes as well.
Minor comments:
How sensitive are the results on the SVD dimensionality reduction step? If I reduce to 10 dimensions vs. 100 vs. 1000, will I get different results?
Similarly, how sensitive are the results to the number of approximate neighbors k considered?
Given that approximate nearest neighbors and other approximations are being used, how stable are the results from run to run?
How much memory is used just loading a 100,000 cell dataset? Supplementary Figure 9 would be more informative if we take into consideration the memory usage from holding the dataset in memory so users can get a sense for how much additional memory is being used for compute.
With respect to Supplementary Figure 10: the medians are different for the silhouette coefficients but is this difference significant? The distributions look rather wide.
The authors should consider submitting the software to something like PyPi so that users will be able to install via pip to make Scanorama more accessible to the scientific community.
Users will greatly benefit from a full walk-through tutorial / vignette on how to use Scanorama starting from, for example, a 10X sparse matrix. The current README and documentation are insufficient.
Reviewer 2 (Kim-Anh Le Cao)
Panoramic stitching of heterogeneous single-cell transcriptomic data Named review: Dr Kim-Anh Le Cao, Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne
Overall comments Hie et al propose a computational method, Scanorama, to combine single cell RNA-seq data sets. Scanorama is based on panorama stitching and generalises mutual nearest neighbours. The approach addresses some of the limitations current integrative methods face, such as the requirement of common cell types in all studies, or their assumptions of the same correlation structure across data sets. When combining 26 scRNA-seq datasets, the authors demonstrate that Scanorama is robust against data cell types and sample size.
Scanorama, I believe, will make an important contribution in the single cell data analysis field. I found some of the benchmark comparisons quite insightful, for example the effect of the sensitivity of the order in which the datasets are combined with scran MNN and Seurat CCA (suppl Fig 1) which could be highlighted even more. The combined computational solutions to deal with those large datasets is also an important contribution (randomised SVD and nearest neighbour search). The manuscript as it stands appears somewhat as a black box. My main comments pertain to the justification of several hyper-parameters set in the approach and their impact in the analyses. Generally, computational tools developed for scRNA-seq data include many hyper-parameters that are set by default by the developers. Those parameters are rarely questioned by the users who may not use those methods optimally as they as not well informed about those. I believe this should be addressed in this current manuscript.
Major comments
The authors should discuss the shortcomings of the approach. What are the major assumptions of the approach and are there cases where scanorama would not work so well? One limitation I see for example is that the analysis is based on the common set of genes across all datasets. This number (5,216 here) could be quite reduced depending on the number of datasets to combine. Are we missing out crucial genes that were dropped? Was this tested by analysing each dataset individually for specific cell types comparisons (see also comment below)?
I struggled to understand the rationale of simulating data based on a 2D Gaussian model and whether such model is reflective of real scRNA-seq data (visually in Supp Fig 2 the patterns seem quite unusual).
I would have liked to see more justification of the choice of parameters (e.g. \delta oversampling parameter = 2; choice of \kappa = 100), the effect and choice of reference in the nearest neighbour search, the underlying justification of the l_2 norm normalisation. Users should be made aware of those parameters and metrics, which presumably have a strong effect in the results we can obtain from scanorama. Additionally, choosing the default parameters in MNN and CCA may not act in favour of those methods. Were other parameters tested for these approaches?
FIgure 3 d and e displays PCA and t-SNE plots to show different panoramas. Single cell data analysts still struggle to understand the different between PCA and t-SNE and I believe guidance in the interpretation should be provided here ? as well as justification of the different types of methods used here. Does this figure suggest that each panorama can be independently extracted for visualisation? Perhaps a clearer diagram of the different steps in the methods would help to fully understand the process in scanorama.
Additional graphical outputs I would suggest in Fig Supp 6 e-g (boxplots of some top marker genes) is to also display their scaled expression in all five individual data sets. We naturally expect smaller p-values when we combine data sets (as the number of cells increases), but a critical question is whether some of the important biological variation removed during the scanorama process. Where key genes identified in a single data set missing during the integration process?
Reviewer 3: (Anonymous Reviewer)
Hie et al. present a computationally efficient and nonlinear method called Scanorama for integration of multiple scRNA-seq data sets. Scanorama stitches different data sets together based on matching of mutual nearest neighbours among them. The methods is similar to a previously proposed method called MNN on several levels including L2 normalisation for scale-invariant comparison, matching of mutual nearest neighbors strategy and smoothing of the batch vectors using a Gaussian kernel.
Unlike MNN which starts with one of the data sets as a reference and expands the reference by successive integration of later datasets, Scanorama first identifies the overlap (by finding mutual nearest neighbors) among all pairs of data sets. It then builds up a panorama by integration of new data sets in decreasing order of the overlap percentages. The mutual nearest neighbors matching of Scanorama is performed in a low dimensional embedding obtained by 100 dimensional SVDs of the combined data sets, which results in improved integration performance and more efficient computations. Scanorama also benefits from efficient algorithms of nearest neighbours search and parallel computing across several cores.
Scanorama’s performance on integration of 26 scRNA-seq data sets representing 9 different technologies and containing a total of 105,476 cells. is quite promising for large scale data sets integration such as in the Human Cell Atlas (HCA) project, where integrating several data sets collected at several laboratories using different technologies is a necessity.
My questions about the method:
- What is a safe number of SVD dimensions to keep? I assume with too few kept dimensions the batch vectors become correlated with biological signals which could impair the performance of matching overlaps.
- What is the interpretation of the corrected expression values? Can they be used in normal procedures of downstream analysis (e.g. differentially expressed genes analysis) similar to normalized read counts?
- How will Scanorama treat a data set which does not have any overlap with any other data sets (any two random data sets will have at least one mutually closest point)? To my understanding the initial identification of MNNs between all pairs of batches addresses the problem partially for lack of overlap with respect to the previous batches in an assumed sequence of batches. However, it is not discussed what will happen in case of no existing overlap at all.
Reviewer 4 (W. Evan Johnson)
This is a very interesting article on a very important but unsolved problem. Single cell RNA-seq (scRNA-seq) data are beginning to be very abundant as researchers are adopting this new technology with great enthusiasm. However, unfortunately, computational and statistical tools for analyzing these data are not keeping pace. There is a great need for more high-quality methods and user-friendly tools for analyzing scRNA-seq data. This article, by Hie and colleagues, addresses one of the more important problems—how to combine data from multiple scRNA-seq sources or studies.
One of the most difficult problems introduced by single cell profiling in heterogeneous tissues is that the cell types for each cell/sample are unknown. This introduces difficulty for batch adjustment, as it may be difficult to link cells of the same type across studies. A batch correction method that fails to make this connection (in unbalanced designs) could lead to biased adjustments or the over-correction of batch, as shown in the authors’ Figure 2(c/d) in the manuscript. Thus, the greatest strength of the authors’ approach is the “nearest neighbors matching” step that first links cells in one dataset to their nearest neighbors in another dataset.
Overall the presented method seems to work extremely well. The authors present multiple use cases on several real data examples. The fact that they can combine 26 datasets across 9 platforms (>100K cells) is impressive and a clear demonstration of the value of their method. Overall, I think this will represent a significant contribution with real impact on how scRNA-seq data are combined in practice.
I think my only concerns about this paper is not whether the method works, because it clearly does, but rather (1) how much of the methodological innovation is new vs. just being a novel application of existing tools, and (2) how well it actually works, as compared to prior work and other batch correction methods for scRNA-seq or gene expression in general. I detail my concerns below:
Figure 1 provides a useful heuristic illustration of the general method and how it works. This is a very helpful description. However, it would be helpful to give some more details of the methods in the actual paper. I found a satisfying amount of detail on the Supplement, but since this is a batch effect methods paper, I would have liked to have seen it in the main manuscript—as this is an important part of the innovation of the paper.
In addition, it is unclear which parts of the method are innovative and new. Is there some new innovation in how they are applying the nearest neighbor algorithm or are they just applying an existing approach? Also, in regard to their panorama merging approach, in the main manuscript they claim “We build upon the nonlinear batch-correction strategy of Haghverdi et al.”, making it sound like they are extending the method. However, in the Supplement they state “Our merging procedure uses the technique of Haghverdi et al.,” thus suggesting that they are only applying an existing method. Which is it? Is the only methodological innovation in paper the combination of neighbor joining and the Haghverdi adjustment? With that in mind, would some other approach for batch correction (say my group’s ComBat method) work well/better in combination with the neighbor finding?
Also, the authors conducted the neighbor joining on projected, reduced dimension data (singular value decomposition). While this seems like a reasonable approach, it would be helpful for the authors to benchmark how this works compared to neighbor joining on the original data (or using other reduced data approaches). That way users could evaluate whether the reduced approach is sufficient, or whether they should be using the original data (likely at a high cost computationally).
Overall, I think the greatest weakness with this manuscript is the lack of adequate comparison with existing methods. They only compare (visually in Figure 2) with the scran MNN and Seurat CCA methods. In their literature review, they point to the methods of Haghverdi and Butler, but do not compare with these methods. In addition, there are many other methods not specifically for scRNA-seq that might perform well in this case. Above I mentioned comparing with my group’s ComBat, but that would only be appropriate for datasets where the cell types are known (or inferred using clustering or their neighbor finding approach, etc) or with experiments with balanced numbers of cells of each type (I’m interested how it breaks down or fails for unbalanced designs). Also, there are other batch effect methods that may handle this problem well, including SVAseq or RUVseq. I think RUV has a variant for cases where there are unknown covariates, which in theory should work well for scRNA-seq for experiments where the cell type is unknown. Overall, it is clear that the authors method works on the example data, but the lack of adequate comparisons with existing methods was disappointing, leaving me to wonder how well it works (especially relative to existing methods).