Practical training

Spatial Transcriptomics Data Analysis

Author
Affiliation

Sergio Sarnataro

Spatial-Cell-ID @ ENS Lyon

In this session, we will analyze four olfactory bulb (OB) samples from adult mice, using data generated through 10x Genomics Visium technology. These samples are part of a study conducted by (Chaker et al. (2023)).

The key biological question addressed by the authors was how pregnancy impacts neural stem cells (NSC) proliferation and how these pregnancy-activated neurons contribute to olfactory functions that are essential for maternal care. By examining the olfactory bulb, the researchers sought to uncover how different subsets of neurons generated during pregnancy are functionally integrated into the brain circuitry and influence behaviors, such as the recognition of pups.

We will walk through the bioinformatics analysis of these olfactory bulb samples, following a series of main steps: 1. Alignment of FASTQ files using Space Ranger: We will begin by aligning the raw sequencing data to the mouse reference genome, which will allow us to map spatial barcodes and unique molecular identifiers (UMIs) to their corresponding locations in the olfactory bulb tissue. 2. Preprocessing and Filtering: After alignment, we will filter the data to remove low-quality spots and background noise, preparing the gene expression matrix for further analysis.. 3. Downstream Analysis using Scanpy/Squidpy: Using the Scanpy and Squidpy frameworks, we will perform several key analyses, including data normalization, clustering, and spatial visualization of gene expression. We will explore how gene expression varies across different regions of the olfactory bulb and identify specific cell types or neuronal subpopulations.

The main goal of this practical training is to gain insights into how spatial transcriptomics can be used and to have a better understanding of how to analyze spatial transcriptomics data, from raw sequencing reads to biological interpretation, using powerful computational tools.

Data Alignment with Space Ranger

Space Ranger is the official tool provided by 10x Genomics for aligning Visium 10x spatial transcriptomics data generated by the sequencing. It performs key tasks such as aligning the raw FASTQ files, identifying spatial barcodes, mapping reads to a reference transcriptome, and generating outputs that can be used for downstream analysis.

In this tutorial, instead of performing the alignment ourselves — which is computationally expensive and time-consuming — we will simulate the alignment by reviewing the necessary inputs and explaining how the process works. However, we will directly use the output files generated from a previous alignment, so our downstream analysis can begin from those.

Here is an example command to run Space Ranger on a bash terminal:

spaceranger count --id="periweaning_alignment" \
            --sample=peri_weaning \
            --transcriptome=/home/jovyan/ifbdata/spatial_cell_id/Sergio/References/refdata-gex-mm10-2020-A \
            --fastqs=/home/jovyan/ifbdata/spatial_cell_id/Sergio/Zayna_data/fastqs/peri_weaning \
            --colorizedimage=/home/jovyan/ifbdata/spatial_cell_id/Sergio/Zayna_data/tiffs/D2OBmother30.tif \
            --loupe-alignment=/home/jovyan/ifbdata/spatial_cell_id/Sergio/Zayna_data/json_files_from_manual_alignment/V10S29-127-D1_PeriWeaning.json \
            --slide=V10S29-127 \
            --area=D1 \
            --localcores=16 \
            --localmem=128 \
            --create-bam=false

Let’s break down the components of this command and explain what each input parameter represents:

  • –id: This specifies a unique identifier for the output folder that Space Ranger will create for this particular analysis. In this case, the output folder will be named periweaning_alignment. All the result files will be stored in this folder.

  • –sample: The name of the sample being analyzed. Here, the sample is referred to as peri_weaning.

  • –transcriptome: The path to the reference transcriptome used for alignment. In this example, it is a reference for the mouse genome (mm10), and it includes information about all possible gene transcripts to which the RNA reads will be aligned. Space Ranger uses this reference to map RNA sequences to specific genes.

  • –fastqs: The directory containing the FASTQ files, which are the raw sequencing reads from the Visium experiment. These FASTQ files contain both the RNA sequences (in Read_2) and the spatial barcodes (in Read_1) that Space Ranger will process to assign RNA sequences to their spatial coordinates in the tissue.

  • –colorizedimage: The image of the tissue section, which is typically generated during the Visium experiment. This file provides the spatial context by mapping the barcodes to physical locations on the tissue. Space Ranger will align the sequencing data to this image to assign gene expression to specific spots on the tissue.

  • –loupe-alignment: This is a JSON file generated by manual alignment using the Loupe Browser software. It contains information about the alignment of the tissue image to the array and ensures that the data corresponds accurately to the tissue section.

  • –slide: The slide identifier for the Visium experiment. Each Visium slide has a unique ID, which is important for matching the sample data to the correct slide.

  • –area: The area of the slide being analyzed. Visium slides are divided into distinct capture areas, labeled by letters and numbers (e.g., A1, B1, C1, D1). This tells Space Ranger which part of the slide the sample corresponds to.

  • –localcores: Specifies the number of CPU cores to use for parallel processing. In this case, Space Ranger will use 16 cores to speed up the analysis (if available!).

  • –localmem: Defines the amount of memory (RAM) available for the analysis. In this case, Space Ranger will allocate 128 GB of RAM (if available!).

  • –create-bam: This flag disables the creation of the BAM file, which is a file format that stores aligned sequence reads. Since we are focusing on downstream analysis, the BAM file might not be necessary for our purposes, so this flag reduces computational time and storage space.

Overview of the dowstream analysis with squidpy/scanpy

In this chapter, we will dive into the downstream analysis of the olfactory bulb (OB) samples using the Squidpy and Scanpy frameworks. These Python-based tools are widely used for spatial and single-cell transcriptomic data analysis, and they will allow us to explore the spatial gene expression patterns and cellular diversity within the tissue sections.

What We Are Going to Do

The main goal of this chapter is to take the pre-processed data from the Space Ranger alignment and perform various analyses to extract meaningful biological insights. Here’s an outline of the steps we’ll follow:

  1. Qualitative Quality Control (QC):
    • We will first assess the quality of the data to ensure that the spatial gene expression data is robust. This involves visualizing metrics such as gene count, UMI count, and mitochondrial gene content across the samples.
  2. Concatenating Datasets:
    • Since we are working with multiple OB samples from different conditions (i.e., virgin, perinatal and periweaning), we will concatenate the datasets to create a unified view of all the samples.
  3. Checking for Batch Effects:
    • We will investigate whether there are any batch effects present in the combined data (differences between samples that are technical rather than biological). This is a common issue when working with multiple samples from different experiments.
  4. Removing Batch Effects (BBKNN):
    • To ensure that biological differences are the main drivers of variation in the data, we will remove batch effects using the BBKNN (Batch Balanced K-Nearest Neighbors) method. This will help align the datasets while preserving biological variability.
  5. Normalization:
    • After batch correction, we will normalize the data to ensure that gene expression levels are comparable across spots, allowing us to perform meaningful downstream analyses.
  6. Dimensionality Reduction:
    • We will apply techniques such as PCA (Principal Component Analysis) and UMAP (Uniform Manifold Approximation and Projection) to reduce the dimensionality of the data and visualize its structure in a lower-dimensional space.
  7. Clustering:
    • Using the reduced dimensions, we will group the spatial spots into clusters, which correspond to different cellular populations or regions in the tissue.
  8. Cell Type Assignment:
    • For this analysis, the cell types in the olfactory bulb have already been manually annotated the authors of the paper. In practice, cell type assignment can be performed using automatic methods based on reference datasets, marker genes, or unsupervised clustering. We can discuss about that.
  9. Visualizing Spatial Distribution of Gene Expression:
    • One of the key strengths of spatial transcriptomics is the ability to visualize how specific genes are expressed across different regions of the tissue. We will explore this by plotting the spatial distribution of key marker genes and clusters.
  10. Many other potential analysis can be done and discussed together!: This is probably the most important point: spatial transcriptomics data analysis is not a standard procedure, there is not an established and universal pipeline, being understood that there are some main steps of the analysis that must be performed. But we can ask questions, investigate deeper in some directions, being guided by the data and our curiosity, and keeping in mind the initial biological questions (but we can find other biological question to address during di analysis, highlighted by the data itself).

A Challenge for the Students

After we remove the batch effects from the dataset, the remaining analysis steps will not be explicitly shown in this tutorial. Instead, you will have the opportunity to explore and discover the analysis steps on your own by: - Consulting Squidpy and Scanpy references. - Reviewing online tutorials and documentation. - Collaborating with your peers.

This is designed to simulate a real-world analysis process, where researchers need to adapt tools and methods to their specific datasets. However, I will be available to supervise and assist you at any point during the process, so feel free to ask for help when needed.

Let’s get started with the analysis

Import the needed libraries

First, we need to import the libraries we will use for our analysis.

import squidpy as sq
import numpy as nu
import pandas as pd
import sys
import os
from matplotlib import pyplot as plt
import seaborn as sns

Load the data

Then, we read the 4 samples separately, and visualize the tissue image and the expression level of the spots over the tissue.


#virgin 1
adata1_path = '/path/to/virgin1_alignment/outs/'
adata1 = sq.read.visium(adata1_path, library_id=None, load_images=True, source_image_path=None)
sc.pp.calculate_qc_metrics(adata1, inplace=True)
sc.pl.spatial(adata1)
sc.pl.spatial(adata1, color="log1p_n_genes_by_counts")

#perinatal
adata2_path = '/path/to/perinatal_alignment/outs/'
adata2 = sq.read.visium(adata2_path, library_id=None, load_images=True, source_image_path=None)
sc.pp.calculate_qc_metrics(adata2, inplace=True)
sc.pl.spatial(adata2)
sc.pl.spatial(adata2, color="log1p_n_genes_by_counts")

#virgin2
adata3_path = '/path/to/virgin2_alignment/outs'
adata3 = sq.read.visium(adata3_path, library_id=None, load_images=True, source_image_path=None)
sc.pp.calculate_qc_metrics(adata3, inplace=True)
sc.pl.spatial(adata3)
sc.pl.spatial(adata3, color="log1p_n_genes_by_counts")

#periweaning
adata4_path = '/home/jovyan/ifbdata/spatial_cell_id/Sergio/Zayna_data/periweaning_alignment/outs/'
adata4 = sq.read.visium(adata4_path, library_id=None, load_images=True, source_image_path=None)
sc.pp.calculate_qc_metrics(adata4, inplace=True)
sc.pl.spatial(adata4)
sc.pl.spatial(adata4, color="log1p_n_genes_by_counts")

In spatial transcriptomics and single-cell RNA sequencing, it is possible for certain gene names to be duplicated in the dataset, either due to multiple entries for the same gene or inconsistencies in the reference annotations. This can create problems in downstream analysis because Scanpy and other bioinformatics tools expect unique identifiers for each gene in the dataset. There is a command for ensure unique gene names:

adata1.var_names_make_unique()
adata2.var_names_make_unique()
adata3.var_names_make_unique()
adata4.var_names_make_unique()

Merge the samples

We will now concatenate the four samples. First, we will add a metadata to each of them, to recover the corresponding cells when needed.

adata1.obs['sample'] = 'virgin1'
adata2.obs['sample'] = 'perinatal'
adata3.obs['sample'] = 'virgin2'
adata4.obs['sample'] = 'periweaning'

Then, we merge the samples:

adata = adata1.concatenate(adata2, adata3, adata4)

We have now an AnnData object, that we called adata, containing the four dataset.

It is a very good idea to save our object from time to time:

filename = '/path/to/save/concatenated_rawdata.h5da'
fadata.write(filename)

Let’s now perform quality control on the data by checking for mitochondrial and ribosomal gene expression. It is standard practice to remove cells in which mitochondrial genes account for more than 15-20% of total expression, as these cells are likely dying. In fact, apoptotic cells tend to have significantly higher mitochondrial content compared to healthy cells.

Additionally, in some cases, cells with a high proportion of ribosomal gene expression are also removed. However, this does not apply to our dataset.

adata.var['ribo'] = adata.var_names.str.startswith(("Rps","Rpl"))
adata.var['mt'] = adata.var_names.str.startswith(("mt"))

sc.pp.calculate_qc_metrics(adata, qc_vars=['ribo', 'mt'], percent_top=None, log1p=False, inplace=True) 
sc.settings.set_figure_params(dpi=100, facecolor='white')
sc.pl.highest_expr_genes(adata, n_top=20, )

Let’s visualize these metrics per sample:

sc.settings.set_figure_params(dpi = 100, facecolor='white')
sc.pl.violin(adata, ['pct_counts_mt'], jitter=0.4, groupby = 'sample', rotation= 45)
adata = adata[adata.obs['pct_counts_mt'] < 15, :]
print("Remaining cells %d"%adata.n_obs) 

Moreover, it is always a good idea to check for the number of genes and the total amount of expressed by cells

sc.settings.set_figure_params(dpi = 200, facecolor='white')
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts'], jitter=0.4, groupby = 'sample', rotation= 45)

We can remove cells showing a very high or very low amount of expressed gene, with respect to the others, as well as cells showing a particulary high amount of total reads count. This is because they could be not real cells (only fragments) or doublets.

adata = adata[adata.obs['n_genes_by_counts'] > 600, :]
adata = adata[adata.obs['n_genes_by_counts'] < 8000, :]
adata = adata[adata.obs['total_counts'] < 30000, :]

More generally, we may want to remove cells that express fewer than a certain number of genes (for example, 200), as they may not contribute significantly to the analysis. Similarly, we might exclude genes that are expressed in only a small number of cells (for example, fewer than 3).

sc.pp.filter_cells(adata, min_genes=200) 
sc.pp.filter_genes(adata, min_cells=3)

Let’s visualize again some of the metrics after our custom filtering of cells and genes.

sc.settings.set_figure_params(dpi = 300, facecolor='white')
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)
sc.settings.set_figure_params(dpi = 100, facecolor='white')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')

Data normalization and scaling

A crucial aspect of every omics data analysis is data normalization. The reason we normalize the data is to ensure that differences in gene expression are not simply due to technical variability, such as differences in sequencing depth or cell capture efficiency. If we want to say it in another way, normalization helps make the data more comparable across cells or samples.

Here, we use total-count normalization followed by log transformation. First, we normalize the total expression in each cell to a target sum of 10,000 using. This adjusts for differences in sequencing depth across cells. Then, we apply a log transformation to compress the range of expression values and stabilize variance, making the data more suitable for downstream analysis.

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

We then apply two additional preprocessing steps: 1. Identification of Highly Variable Genes: we identify the genes that show the most variation across cells. By setting min_disp=0.5, we filter for genes with a minimum dispersion threshold, ensuring that only genes with significant variability are retained. This step helps focus the analysis on the most informative genes, reducing noise from low-variability genes. 2. Scaling with a Maximum Value: After selecting highly variable genes, we scale the data and set max_value=10 as parameter. This limits the scaled values to a maximum of 10, preventing the influence of extreme outliers and ensuring that excessively high values do not dominate the analysis.

See below for the code:

sc.pp.highly_variable_genes(adata, min_disp=0.5) 
sc.pp.scale(adata, max_value=10) 

Dimensionality reduction and check for the batch effect

At this stage, we reduce the complexity of the dataset by performing a dimensionality reduction technique. This step helps to compress the data into a smaller number of dimensions (called principal components), while retaining as much of the variation in the data as possible. Afterward, we generate a visualization that projects the cells based on these components, where we color the points based on their group or sample origin.

sc.tl.pca(adata, svd_solver='arpack')
sc.settings.set_figure_params(dpi = 100, facecolor='white')
sc.pl.pca(adata, color='sample')

Now, we construct a graph that connects cells based on their similarity to their nearest neighbors. This allows us to visualize relationships between cells in a lower-dimensional space, providing insights into potential groupings or clusters. We project this graph using a specific visualization technique that spreads out the data, making it easier to distinguish clusters, and color the visualization by sample origin.

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)

sc.settings.set_figure_params(dpi = 200, facecolor='white')
sc.pl.umap(adata, color='sample', title = "Pre batch correction")

Correct for the batch effect

In this step, we address technical variation introduced during different experimental runs (referred to as batch effects). To ensure these variations don’t obscure the true biological signals, we identify genes that are highly variable across multiple batches.

sc.pp.highly_variable_genes(adata, batch_key = 'sample') 
var_select = adata.var.highly_variable_nbatches > 2
var_genes = var_select.index[var_select]

After identifying the highly variable genes across different batches, we apply batch correction using a method called BBKNN (Batch Balanced KNN). The goal of BBKNN is to adjust for technical variation that can occur when cells are processed in different batches.

Here’s how BBKNN works:

  • K-Nearest Neighbors (KNN) is typically used to find cells that are similar to each other based on their gene expression profiles.
  • BBKNN modifies this process to ensure that the nearest neighbors for each cell are balanced across different batches. This means that when BBKNN constructs a neighborhood graph, it selects an equal number of neighboring cells from each batch, rather than just grouping cells from the same batch. By doing this, BBKNN corrects for batch effects by making sure cells from different batches are well-integrated into the graph. This helps remove batch-specific biases while preserving the biological signal.

Once the batch effect is corrected, we project the corrected data into a lower-dimensional space again (using UMAP) to visually assess the results. The final plot shows a cleaner separation of cell populations, with the batch effects minimized, allowing for a more accurate comparison across samples.

sc.external.pp.bbknn(adata, batch_key='sample', n_pcs=30)  # running bbknn 1.3.6
sc.tl.umap(adata)

sc.settings.set_figure_params(dpi = 200, facecolor='white')
sc.pl.umap(adata, color="sample", title="BBKNN Corrected umap")

Clustering

sc.tl.leiden(
    adata, key_added="clusters", directed=False, n_iterations=2
)
plt.rcParams["figure.figsize"] = (4, 4)
sc.pl.umap(adata, color=["total_counts", "n_genes_by_counts", "clusters"], wspace=0.4)
sc.tl.leiden(
    adata, key_added="clusters_res0.8", directed=False, n_iterations=2, resolution=0.8)

plt.rcParams["figure.figsize"] = (10, 10)
sc.pl.umap(adata, color=["clusters_res0.8","sample"], wspace=0.1, size = 50)

Find marker genes

adata.uns['log1p']['base']=None
sc.tl.rank_genes_groups(adata, groupby = 'sample' , method='wilcoxon', key_addes = 'all_vs_all')
sc.settings.set_figure_params(dpi = 200,)
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key_addes = 'all_vs_all', ncols = 2)
sc.tl.rank_genes_groups(adata, groupby = 'clusters_res0.8', method='t-test', key_added = 'clusters_res0.8_DGE')
sc.settings.set_figure_params(dpi = 200,)
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key= 'clusters_res0.8_DGE', ncols = 4)

Visualize the clusters on the different tissues

# Subset the data for each sample
adata_virgin1 = adata[adata.obs['sample'] == 'virgin1', :]
adata_virgin1.uns['spatial'] = adata1.uns['spatial']

adata_perinatal = adata[adata.obs['sample'] == 'perinatal', :]
adata_perinatal.uns['spatial'] = adata2.uns['spatial']

adata_virgin2 = adata[adata.obs['sample'] == 'virgin2', :]
adata_virgin2.uns['spatial'] = adata3.uns['spatial']

adata_periweaning = adata[adata.obs['sample'] == 'periweaning', :]
adata_periweaning.uns['spatial'] = adata4.uns['spatial']
sq.pl.spatial_scatter(adata_virgin1, color = 'clusters_res0.8', title = 'virgin1')
sq.pl.spatial_scatter(adata_virgin2, color = 'clusters_res0.8', title = 'virgin2')
sq.pl.spatial_scatter(adata_perinatal, color = 'clusters_res0.8', title = 'perinatal')
sq.pl.spatial_scatter(adata_periweaning, color = 'clusters_res0.8', title = 'periweaning')
sc.pl.umap(adata, color=["clusters_res0.8","sample"], wspace=0.1, size = 50)
sc.pl.umap(adata, color="clusters_res0.8", wspace=0.1, size = 30, legend_loc='on data')
sc.pl.umap(adata, color="sample", wspace=0.1, size = 30)

Visualize the clusters on subdataset of each sample

# plt.rcParams["figure.figsize"] = (20, 20)
sc.pl.umap(adata_virgin1, color=["clusters_res0.8","sample"], wspace=0.1, size = 30, title = 'virgin1')
sc.pl.umap(adata_virgin2, color=["clusters_res0.8","sample"], wspace=0.1, size = 30, title = 'virgin2')
sc.pl.umap(adata_perinatal, color=["clusters_res0.8","sample"], wspace=0.1, size = 30, title = 'perinatal')
sc.pl.umap(adata_periweaning, color=["clusters_res0.8","sample"], wspace=0.1, size = 30, title = 'periweaning')
# Subset again the data for each sample, now that we also have the cell_type metadata
adata_virgin1 = adata[adata.obs['sample'] == 'virgin1', :]
adata_virgin1.uns['spatial'] = adata1.uns['spatial']

adata_perinatal = adata[adata.obs['sample'] == 'perinatal', :]
adata_perinatal.uns['spatial'] = adata2.uns['spatial']

adata_virgin2 = adata[adata.obs['sample'] == 'virgin2', :]
adata_virgin2.uns['spatial'] = adata3.uns['spatial']

adata_periweaning = adata[adata.obs['sample'] == 'periweaning', :]
adata_periweaning.uns['spatial'] = adata4.uns['spatial']
plt.rcParams["figure.figsize"] = (20, 20)
sq.pl.spatial_scatter(adata_virgin1, color = 'cell_type', title = 'virgin1')
sq.pl.spatial_scatter(adata_virgin2, color = 'cell_type', title = 'virgin2')
sq.pl.spatial_scatter(adata_perinatal, color = 'cell_type', title = 'perinatal')
sq.pl.spatial_scatter(adata_periweaning, color = 'cell_type', title = 'periweaning')

What now?

Feel free to explore other aspects of the data that interest you, such as differential expression, pseudotime analysis, or any other relevant techniques that come to mind.

You can, for example, refer to relevant analysis workflows available on the web, check out the Scanpy or Squidpy documentation for guidance on specific functions and approaches, collaborate and discuss with your peers… just as you would in a real-world research setting!

Think of this as a true, exploratory analysis—there are no fixed answers, and you have the freedom to interpret and investigate the data in meaningful ways.