Stereo-seq Mouse Embryonic Development Timecourse
Key features:
- Apply Gaussian smoothing to address data noise and sparsity across datasets
- Construct a gene co-expression network across samples using smoothed datasets
- Group genes into multi-sample spatial modules with hard and soft thresholding on the co-expression network
- Identify stable vs. dynamic gene expression distributions between dataset pairs using second-order correlations
import anndata as ad
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns
import smoothie
smoothie.suppress_warnings() # Suppress a few warnings for cleaner output
Load Multiple Datasets¶
Smoothie requires data input in the AnnData format. A guide for constructing the AnnData data structure for both imaging-based and sequencing-based spatial transcriptomics outputs can be found in /guides/convert_to_anndata.md
# Bash code to generate AnnData file from STOmics bin1 output: https://db.cngb.org/stomics/mosta/download/
# python -m smoothie.convert_to_anndata /path/to/E16.5_E1S1_GEM_bin1.tsv.gz \
# --x-col x \
# --y-col y \
# --gene-col geneID \
# --count-col MIDCounts \
# --format tsv.gz \
# --chunksize 1000000 \
# --output mouse_E16.5_E1S1_raw_dnb.h5ad
# Load your spatial transcriptomics datasets
E13_adata = ad.read_h5ad("/path/to/mouse_E13.5_E1S1_raw_dnb.h5ad")
E14_adata = ad.read_h5ad("/path/to/mouse_E14.5_E1S1_raw_dnb.h5ad")
E15_adata = ad.read_h5ad("/path/to/mouse_E15.5_E1S1_raw_dnb.h5ad")
E16_adata = ad.read_h5ad("/path/to/mouse_E16.5_E1S1_raw_dnb.h5ad")
# Create list of datasets and assign dataset names
adata_list = [E13_adata, E14_adata, E15_adata, E16_adata]
dataset_names = ["E13.5", "E14.5", "E15.5", "E16.5"]
Quality Control¶
The table below shows suggested QC values given the resolution of your AnnData spatial coordinates.
- Note: If you have cell-segmented imaging / fine resolution data, then you should use the "Cell-sized" QC settings. If you have true individual transcripts or fine resolution data, then you should use the Fine / Imaging QC settings.
| Resolution | min_counts (filter_cells) |
min_counts (filter_genes) |
min_cells (filter_genes) |
|---|---|---|---|
| "Cell-sized" (8-10 µm) | 10 - 100 | 100 - 500 | 10 - 100 |
| Binned (20-50 µm) | 50 - 500 | 100 - 500 | 10 - 100 |
| Fine / Imaging (< 1-2 µm) | 1 | 100 - 500 | 10 - 100 |
# QC filtering for each dataset
for i in range(len(adata_list)):
# Filter spots with low UMI counts
sc.pp.filter_cells(adata_list[i], min_counts=1) # Only remove DNA nanoballs with 0 UMIs!!
# Filter genes with low counts/unique spots
sc.pp.filter_genes(adata_list[i], min_counts=100)
Normalization¶
Normalization depends on the resolution of your AnnData spatial coordinates.
- Note: If you have cell-segmented imaging / fine resolution data, then you should use the "Cell-sized" normalization method. If you have true individual transcripts or fine resolution data, then you should use the Fine / Imaging normalization method.
| Resolution | Normalization Method |
|---|---|
| "Cell sized" or Binned Data (8-50 micron resolution) | CPT + Log1p normalization |
| Fine / Imaging (<1-2 micron resolution) | Log1p normalization only |
# Normalization for each dataset
for i in range(len(adata_list)):
# # CHOOSE: CPT + Log1p normalization (8-50 micron resolution data)
# sc.pp.normalize_total(adata_list[i], target_sum=1e3)
# sc.pp.log1p(adata_list[i])
# CHOOSE: Log1p normalization only (<1-2 micron resolution data)
sc.pp.log1p(adata_list[i])
Step 1: Run Gaussian Smoothing¶
Here, we run Gaussian smoothing on the spatial dataset address data sparsity and noise. Key parameters for function smoothie.run_parallelized_smoothing:
gaussian_sd: float
- The standard deviation for the Gaussian distribution, which determines smoothing degree. Generally across high-resolution S.T. platforms, a gaussian_sd corresponding to 20-30 microns is ideal.
- gaussian_sd is calculated as target_microns * micron_to_unit_conversion
- For Slide-seq, micron_to_unit_conversion = 1.5456 (1 micron equals 1.5456 Slide-seq spatial units)
- For Stereo-seq, micron_to_unit_conversion = 2 (1 micron equals 2 Stereo-seq spatial units)
- A table of micron_to_unit_conversion values for spatial transcriptomics platforms is included in
/guides/micron_to_unit_conversion_table.md
grid_based_or_not: boolean
- True = grid-based smoothing: smooth only at imposed hexagonal grid points. Good for subcellular to "cell-sized" resolution data (<1-10 micron).
- False = in-place smoothing: smooth at every spatial location in the dataset. Good for "cell-sized" to binned resolution data (10-50 micron).
min_spots_under_gaussian
- Minimum required number of data points within radius (3 * gaussian_sd) for smoothing to occur at a given location.
- Default is 10-100. Raise value if low density regions around the perimeter of tissue appear noisy in modules. Lower value if you have large spot size (e.g. original Visium 100 micron resolution data or binned 20-50 micron resolution).
# Determine gaussian_sd ---------------------------------------------------
target_microns = 20 # 20-30 microns is ideal for smoothing
micron_to_unit_conversion = 2 # IMPORTANT! For Stereo-seq, this value is 2.
# Select other paramters --------------------------------------------------
grid_based_or_not = True # Select True or False
min_spots_under_gaussian = 100 # Adjust as needed
# -------------------------------------------------------------------------
sm_adata_list = []
for i in range(len(adata_list)):
print(f'Smoothing dataset: {dataset_names[i]}')
sm_adata = smoothie.run_parallelized_smoothing(
adata_list[i],
grid_based_or_not=grid_based_or_not,
gaussian_sd=target_microns * micron_to_unit_conversion,
min_spots_under_gaussian=min_spots_under_gaussian
)
sm_adata_list.append(sm_adata)
Smoothing dataset: E13.5 Checking dataset point density... Fitting grid to tissue coordinates... Variation in X-direction: 18399 Variation in Y-direction: 14799 Number of points in fitted grid: 138467 Gaussian smoothing will run in 36 chunks. Smoothing chunk: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, Total runtime for grid-based Gaussian smoothing: 696.46 seconds. Smoothing dataset: E14.5 Checking dataset point density... Fitting grid to tissue coordinates... Variation in X-direction: 22749 Variation in Y-direction: 15649 Number of points in fitted grid: 183814 Gaussian smoothing will run in 36 chunks. Smoothing chunk: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, Total runtime for grid-based Gaussian smoothing: 859.55 seconds. Smoothing dataset: E15.5 Checking dataset point density... Fitting grid to tissue coordinates... Variation in X-direction: 23449 Variation in Y-direction: 15099 Number of points in fitted grid: 203215 Gaussian smoothing will run in 36 chunks. Smoothing chunk: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, Total runtime for grid-based Gaussian smoothing: 825.06 seconds. Smoothing dataset: E16.5 Checking dataset point density... Fitting grid to tissue coordinates... Variation in X-direction: 25549 Variation in Y-direction: 17467 Number of points in fitted grid: 218394 Gaussian smoothing will run in 36 chunks. Smoothing chunk: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, Total runtime for grid-based Gaussian smoothing: 875.32 seconds.
Step 2: Measure and Cluster Gene Co-expression Network Across Samples¶
To measure multi-dataset spatial gene correlations, we concatenate the smoothed spatial datasets into one large count matrix. We then compute the pairwise Pearson correlation coefficient between all multi-dataset smoothed gene surface pairs.
Next, we evaluate optimal parameter choices for network construction and clustering. The function smoothie.select_clustering_params calculates network clustering statistics across different choices for pcc_cutoff and clustering_power.
Choose values for pcc_cutoff and clustering_power that balance the following goals:
- High gene margin scores for high gene module confidence
- Stable number of modules (n_clusters) compared to other pcc_cutoff values
- High gene recall for more gene-rich analyses
Refer to /docs/tutorials for examples on selecting these parameters.
# Create a concatenated smoothed count matrix X_concat, with gene names for each column
sm_adata_X_concat, gene_names_concat = smoothie.concatenate_smoothed_matrices(sm_adata_list)
# Compute Pearson correlation matrix (This step can take a while... depending on sm_adata_X_concat size)
pearsonR_mat_concat, _ = smoothie.compute_correlation_matrix(sm_adata_X_concat)
concatenate_smoothed_matrices running... compute_correlation_matrix running... Total runtime for compute_correlation_matrix: 5854.41 seconds.
# Clustering hyperparameter selection
smoothie.select_clustering_params(
gene_names=gene_names_concat, # don't change
pearsonR_mat=pearsonR_mat_concat, # don't change
output_folder=None,
pcc_cutoffs=[0.3, 0.4, 0.5, 0.6, 0.7, 0.8], # lower pcc_cutoffs require more runtime
clustering_powers=[1, 3, 5, 7, 9],
min_genes_for_module=5
)
Running network clustering with parameters: PCC hard cutoff: 0.3, clustering soft power: 1, 3, 5, 7, 9, PCC hard cutoff: 0.4, clustering soft power: 1, 3, 5, 7, 9, PCC hard cutoff: 0.5, clustering soft power: 1, 3, 5, 7, 9, PCC hard cutoff: 0.6, clustering soft power: 1, 3, 5, 7, 9, PCC hard cutoff: 0.7, clustering soft power: 1, 3, 5, 7, 9, PCC hard cutoff: 0.8, clustering soft power: 1, 3, 5, 7, 9,
Optionally, data shuffling and smoothing can be done to help determine the lower bound for pcc_cutoff selection. Here, use the same parameters as before in smoothie.run_parallelized_smoothing.
- Note: Creating the smoothed, shuffled versions of the datasets will double memory usage temporarily.
# p95, p99, p999 = smoothie.compute_shuffled_correlation_percentiles(
# adata_list,
# grid_based_or_not=True, # Choose same value as in smoothie.run_parallelized_smoothing
# gaussian_sd=target_microns * micron_to_unit_conversion, # Choose same value as in smoothie.run_parallelized_smoothing
# min_spots_under_gaussian=100, # Choose same value as in smoothie.run_parallelized_smoothing
# seed=0
# )
Based on the outputs above, we can select pcc_cutoff and clustering_power hyperparameters for smoothie.make_spatial_network.
pcc_cutoff: The Pearson correlation coefficient (PCC) hard threshold for network construction. Only correlations above this value are retained in the gene network. Higher cutoff values result in smaller, stronger average correlation networks.
clustering_power: A soft thresholding parameter that controls how much weaker PCC values are suppressed. Higher values favor more modules.
# Build network and cluster genes into modules
edge_list, node_label_df = smoothie.make_spatial_network(
pearsonR_mat=pearsonR_mat_concat, # don't change
gene_names=gene_names_concat, # don't change
pcc_cutoff=0.4,
clustering_power=4,
output_folder='./out'
)
# Filter node_label_df for only genes within modules of size 2 or more.
modules_df = node_label_df.groupby('module_label').filter(lambda x: len(x) >= 2)
# Examine spatial gene modules
modules_df
| name | module_label | degree | weighted_degree | rescaled_weighted_degree | clustering_coeff | margin_score | |
|---|---|---|---|---|---|---|---|
| 0 | Rps17 | 1 | 1722 | 937.232806 | 61.987632 | 0.322454 | 0.766591 |
| 1 | Rps5 | 1 | 1707 | 934.802515 | 60.649651 | 0.328016 | 0.813370 |
| 2 | Rps26 | 1 | 1695 | 919.585437 | 59.695018 | 0.326045 | 0.776880 |
| 3 | Rpl11 | 1 | 1742 | 945.924169 | 59.477755 | 0.316680 | 0.785971 |
| 4 | Rps24 | 1 | 1577 | 860.135441 | 59.366024 | 0.369837 | 0.826953 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 6411 | Hmga2 | 296 | 3 | 1.251878 | 0.000004 | 1.000000 | 0.280330 |
| 6412 | Rbp1 | 297 | 52 | 21.813263 | 0.000690 | 0.960030 | 0.949203 |
| 6413 | Aldh1a2 | 297 | 1 | 0.470751 | 0.000193 | 0.000000 | 1.000000 |
| 6414 | Hmga1 | 298 | 46 | 19.375366 | 0.002022 | 0.957488 | 0.994935 |
| 6415 | Hmga1b | 298 | 2 | 0.938142 | 0.001647 | 1.000000 | 0.999582 |
6416 rows × 7 columns
Finally, we can import the saved edge_list.csv and node_labels.csv files into Cytoscape for interactive network visualization! More info in /guides/cytoscape_instructions.md
Visualize Smoothed Gene Expression and Modules¶
# rotate sm_adata as needed
sm_adata_list[0] = smoothie.rotate_spatial(sm_adata_list[0], angle_degrees=-90, flip_x=True)
sm_adata_list[1] = smoothie.rotate_spatial(sm_adata_list[1], angle_degrees=-90)
sm_adata_list[2] = smoothie.rotate_spatial(sm_adata_list[2], angle_degrees=90)
sm_adata_list[3] = smoothie.rotate_spatial(sm_adata_list[3], angle_degrees=-90)
# Plot a gene of interest across datasets
smoothie.plot_gene_multisample(
sm_adata_list,
dataset_names,
gene_name='Upk3b', # marker gene for mesothelial cell monolayer!
output_folder='./gene_plots',
shared_scaling=False, # Dataset independent scaling or shared 0-to-max scaling for gene plotting
spot_size=50, # Adjust for optimal visualization
dpi=600
)
# Plot gene modules across datasets
smoothie.plot_modules_multisample(
sm_adata_list,
dataset_names,
node_label_df,
output_folder='./module_plots',
shared_scaling=False, # Dataset independent scaling or shared 0-to-max scaling for module plotting
min_genes=15, # only showing large modules for github memory's sake. Choose 2 or 3 to see more!
spot_size=50, # Adjust for optimal visualization
dpi=150 # low resolution here for github, better to choose 300-600!
)
Module 1:
Module 2:
Module 3:
Module 4:
Module 5:
Module 6:
Module 7:
Module 8:
Module 9:
Module 10:
Module 11:
Module 12:
Module 13:
Module 14:
Module 15:
Module 16:
Module 17:
Module 18:
Module 19:
Module 20:
Module 21:
Module 22:
Module 23:
Module 24:
Module 25:
Module 26:
Module 27:
Module 28:
Module 29:
Module 30:
Module 31:
Step 3: Compare Gene Patterns Between Datasets (2nd-order Correlation Analysis)¶
In this section, we assess stable vs dynamic gene patterns between datasets (without having to warp/transform/align the mismatched spatial coordinates!)
We construct spatial gene embeddings for each dataset, where a gene's embedding is comprised of its correlations with other genes within the dataset. These gene embedding features are then aligned across datasets for comparison.
The gene stability metric quantifies how consistently genes co-vary with the same partners across datasets. Values range from -1 to 1, where higher values indicate more stable/conserved co-expression patterns. A gene stability of exactly -1.0 designates that a gene was missing in one dataset but had a significant correlation above pcc_cutoff in the other dataset.
This analysis is wrapped within the function smoothie.run_second_order_correlation_analysis. More information is found in the API documentation.
results = smoothie.run_second_order_correlation_analysis(
sm_adata_list=sm_adata_list,
pcc_cutoff=0.4, # Use same cutoff as network
node_label_df=node_label_df, # Module assignments from network
E_max=25, # Max features per module for embeddings.
output_folder='./2nd_order_corr_out',
seed=0
)
print(f"Gene embeddings shape: {results['gene_embeddings_tensor'].shape}")
print(f"Gene stabilities shape: {results['gene_stabilities'].shape}")
Starting second-order correlation analysis pipeline... Number of datasets: 4 ============================================================ [Step 1/5] Computing Pearson correlation matrices for each dataset... Computing correlations for dataset 1/4 (19274 genes, 138467 spots)... compute_correlation_matrix running... Total runtime for compute_correlation_matrix: 929.68 seconds. Computing correlations for dataset 2/4 (20602 genes, 183798 spots)... compute_correlation_matrix running... Total runtime for compute_correlation_matrix: 1391.47 seconds. Computing correlations for dataset 3/4 (20160 genes, 203215 spots)... compute_correlation_matrix running... Total runtime for compute_correlation_matrix: 1467.41 seconds. Computing correlations for dataset 4/4 (20203 genes, 218394 spots)... compute_correlation_matrix running... Total runtime for compute_correlation_matrix: 1579.93 seconds. ============================================================ [Step 2/5] Aligning gene correlation matrices across datasets... Aligned tensor shape: (21290, 21290, 4) ============================================================ [Step 3/5] Creating gene embeddings (PCC cutoff=0.4, E_max=25)... Gene embeddings shape: (7438, 3018, 4) ============================================================ [Step 4/5] Computing gene stability metrics across dataset pairs... Stability tensor shape: (4, 4, 7438) ============================================================ [Step 5/5] Saving results to ./2nd_order_corr_out... Second-order correlation analysis complete. Gene embeddings shape: (7438, 3018, 4) Gene stabilities shape: (4, 4, 7438)
Examine Gene Stabilities Between Datasets
smoothie.plot_dataset_pair_stabilities(
gene_stabilities=results['gene_stabilities'], # don't change
dataset_names=dataset_names, # don't change
output_folder=None
)
Gene stabilities of -1 are not shown for the violin plot pairwise dataset stabilities.
Rank Gene Patterns from Stable to Dynamic Across All Datasets
# Look at distribution of minimum stability for each gene across dataset pairs
stability_df = smoothie.plot_gene_stability_distribution(
gene_stabilities=results['gene_stabilities'], # don't change
robust_gene_names=results['robust_gene_names'], # don't change
node_label_df=node_label_df, # don't change
output_folder=None
)
Plot a Gene and its Stability Between Datasets
GOI = 'Alb' # select your gene of interest
# Plot gene spatial expression
smoothie.plot_gene_multisample(
sm_adata_list, # don't change
dataset_names, # don't change
gene_name=GOI,
output_folder='./gene_stability_plots',
shared_scaling=False, # Dataset independent scaling or shared 0-to-max scaling for gene plotting
spot_size=50 # Adjust for optimal visualization
)
# Plot gene stabilities
smoothie.plot_gene_stability(
gene_name=GOI,
gene_stabilities=results['gene_stabilities'], # don't change
robust_gene_names=results['robust_gene_names'], # don't change
dataset_names=dataset_names, # don't change
output_folder='./gene_stability_plots'
)
GOI = 'Mkrn1' # select your gene of interest
# Plot gene spatial expression
smoothie.plot_gene_multisample(
sm_adata_list, # don't change
dataset_names, # don't change
gene_name=GOI,
output_folder='./gene_stability_plots',
shared_scaling=False, # Dataset independent scaling or shared 0-to-max scaling for gene plotting
spot_size=50 # Adjust for optimal visualization
)
# Plot gene stabilities
smoothie.plot_gene_stability(
gene_name=GOI,
gene_stabilities=results['gene_stabilities'], # don't change
robust_gene_names=results['robust_gene_names'], # don't change
dataset_names=dataset_names, # don't change
output_folder='./gene_stability_plots'
)
GOI = 'Gzma' # select your gene of interest
# Plot gene spatial expression
smoothie.plot_gene_multisample(
sm_adata_list, # don't change
dataset_names, # don't change
gene_name=GOI,
output_folder='./gene_stability_plots',
shared_scaling=False, # Dataset independent scaling or shared 0-to-max scaling for gene plotting
spot_size=50 # Adjust for optimal visualization
)
# Plot gene stabilities
smoothie.plot_gene_stability(
gene_name=GOI,
gene_stabilities=results['gene_stabilities'], # don't change
robust_gene_names=results['robust_gene_names'], # don't change
dataset_names=dataset_names, # don't change
output_folder='./gene_stability_plots'
)
GOI = 'Lor' # select your gene of interest
# Plot gene spatial expression
smoothie.plot_gene_multisample(
sm_adata_list, # don't change
dataset_names, # don't change
gene_name=GOI,
output_folder='./gene_stability_plots',
shared_scaling=False, # Dataset independent scaling or shared 0-to-max scaling for gene plotting
spot_size=50 # Adjust for optimal visualization
)
# Plot gene stabilities
smoothie.plot_gene_stability(
gene_name=GOI,
gene_stabilities=results['gene_stabilities'], # don't change
robust_gene_names=results['robust_gene_names'], # don't change
dataset_names=dataset_names, # don't change
output_folder='./gene_stability_plots'
)
Plot a Gene Module's Average Stability Between Datasets
smoothie.plot_module_stability(
module_label = 5, # Pick module label of interest
gene_stabilities=results['gene_stabilities'], # don't change
robust_gene_names=results['robust_gene_names'], # don't change
node_label_df=node_label_df, # don't change
dataset_names=dataset_names, # don't change
output_folder='./module_stability_plots'
)