Single Dataset Pipeline

Key features:

Apply Gaussian smoothing to address data noise and sparsity
Construct a gene co-expression network across samples using smoothed dataset
Group genes into spatial modules with hard and soft thresholding on the co-expression network

In [ ]:

Copied!





import anndata as ad
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt

import smoothie
smoothie.suppress_warnings() # Suppress a few warnings for cleaner output
import anndata as ad
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt

import smoothie
smoothie.suppress_warnings() # Suppress a few warnings for cleaner output

Load AnnData¶

Smoothie requires data input in the AnnData format. A guide for constructing the AnnData data structure for both imaging-based and sequencing-based spatial transcriptomics outputs can be found in /guides/convert_to_anndata.md

In [ ]:

Copied!

adata = ad.read_h5ad("/path/to/your/data.h5ad")
adata = ad.read_h5ad("/path/to/your/data.h5ad")

Quality Control¶

The table below shows suggested QC values given the resolution of your AnnData spatial coordinates.

Note: If you have cell-segmented imaging / fine resolution data, then you should use the "Cell-sized" QC settings. If you have true individual transcripts or fine resolution data, then you should use the Fine / Imaging QC settings.

Resolution	`min_counts (filter_cells)`	`min_counts (filter_genes)`	`min_cells (filter_genes)`
"Cell-sized" (8-10 µm)	10 - 100	100 - 500	10 - 100
Binned (20-50 µm)	50 - 500	100 - 500	10 - 100
Fine / Imaging (< 1-2 µm)	1	100 - 500	10 - 100

In [ ]:

Copied!





# Filter spots with low UMI counts
sc.pp.filter_cells(adata, min_counts=50)

# Filter genes with low counts/unique spots
sc.pp.filter_genes(adata, min_counts=100)
sc.pp.filter_genes(adata, min_cells=10)
# Filter spots with low UMI counts
sc.pp.filter_cells(adata, min_counts=50)

# Filter genes with low counts/unique spots
sc.pp.filter_genes(adata, min_counts=100)
sc.pp.filter_genes(adata, min_cells=10)

Normalization¶

Normalization depends on the resolution of your AnnData spatial coordinates.

Note: If you have cell-segmented imaging / fine resolution data, then you should use the "Cell-sized" normalization method. If you have true individual transcripts or fine resolution data, then you should use the Fine / Imaging normalization method.

Resolution	Normalization Method
"Cell sized" or Binned Data (8-50 micron resolution)	CPT + Log1p normalization
Fine / Imaging (<1-2 micron resolution)	Log1p normalization only

In [ ]:

Copied!





# CHOOSE: CPT + Log1p normalization (8-50 micron resolution data)
sc.pp.normalize_total(adata, target_sum=1e3)
sc.pp.log1p(adata)

# # CHOOSE: Log1p normalization only (<1-2 micron resolution data)
# sc.pp.log1p(adata)
# CHOOSE: CPT + Log1p normalization (8-50 micron resolution data)
sc.pp.normalize_total(adata, target_sum=1e3)
sc.pp.log1p(adata)

# # CHOOSE: Log1p normalization only (<1-2 micron resolution data)
# sc.pp.log1p(adata)

Step 1: Run Gaussian Smoothing¶

Here, we run Gaussian smoothing on the spatial dataset address data sparsity and noise. Key parameters for function smoothie.run_parallelized_smoothing:

gaussian_sd: float

The standard deviation for the Gaussian distribution, which determines smoothing degree. Generally across high-resolution S.T. platforms, a gaussian_sd corresponding to 20-30 microns is ideal.
gaussian_sd is calculated as target_microns * micron_to_unit_conversion
- For Slide-seq, micron_to_unit_conversion = 1.5456 (1 micron equals 1.5456 Slide-seq spatial units)
- For Stereo-seq, micron_to_unit_conversion = 2 (1 micron equals 2 Stereo-seq spatial units)
- A table of micron_to_unit_conversion values for spatial transcriptomics platforms is included in /guides/micron_to_unit_conversion_table.md

grid_based_or_not: boolean

True = grid-based smoothing: smooth only at imposed hexagonal grid points. Good for subcellular to "cell-sized" resolution data (<1-10 micron).
False = in-place smoothing: smooth at every spatial location in the dataset. Good for "cell-sized" to binned resolution data (10-50 micron).

min_spots_under_gaussian

Minimum required number of data points within radius (3 * gaussian_sd) for smoothing to occur at a given location.
Default is 10-100. Raise value if low density regions around the perimeter of tissue appear noisy in modules. Lower value if you have large spot size (e.g. original Visium 100 micron resolution data or binned 20-50 micron resolution).

In [ ]:

Copied!





# Determine gaussian_sd ---------------------------------------------------
target_microns = 20.0 # 20-30 microns is ideal for smoothing
micron_to_unit_conversion =   # IMPORTANT! Select for your platform
# ------------------------------------------------------------------------

sm_adata = smoothie.run_parallelized_smoothing(
    adata,
    grid_based_or_not=True,  # Select True or False,
    gaussian_sd=target_microns * micron_to_unit_conversion,
    min_spots_under_gaussian=25 # Adjust as needed
)
# Determine gaussian_sd ---------------------------------------------------
target_microns = 20.0 # 20-30 microns is ideal for smoothing
micron_to_unit_conversion =   # IMPORTANT! Select for your platform
# ------------------------------------------------------------------------

sm_adata = smoothie.run_parallelized_smoothing(
    adata,
    grid_based_or_not=True,  # Select True or False,
    gaussian_sd=target_microns * micron_to_unit_conversion,
    min_spots_under_gaussian=25 # Adjust as needed
)

Step 2: Measure and Cluster Gene Co-expression Network¶

First, we compute the pairwise Pearson correlation coefficient between all smoothed gene surface pairs.

Next, we evaluate optimal hyperparameter choices for network construction and clustering. The function smoothie.select_clustering_params calculates network clustering statistics across different choices for pcc_cutoff and clustering_power.

Choose values for pcc_cutoff and clustering_power that balance the following goals:

High gene margin scores for high gene module confidence
Stable number of modules (n_clusters) compared to other pcc_cutoff values
High gene recall for more gene-rich analyses

Refer to /docs/tutorials for examples on selecting these parameters.

In [ ]:

Copied!

# Compute pairwise gene correlations
pearsonR_mat, _ = smoothie.compute_correlation_matrix(sm_adata.X)
# Compute pairwise gene correlations
pearsonR_mat, _ = smoothie.compute_correlation_matrix(sm_adata.X)

In [ ]:

Copied!





# Clustering hyperparameter selection
smoothie.select_clustering_params(
    gene_names=sm_adata.var_names, # don't change
    pearsonR_mat=pearsonR_mat, # don't change
    output_folder=None,
    pcc_cutoffs=[0.3, 0.4, 0.5, 0.6, 0.7, 0.8], # lower pcc_cutoffs require more runtime
    clustering_powers=[1, 3, 5, 7, 9],
    min_genes_for_module=5
)
# Clustering hyperparameter selection
smoothie.select_clustering_params(
    gene_names=sm_adata.var_names, # don't change
    pearsonR_mat=pearsonR_mat, # don't change
    output_folder=None,
    pcc_cutoffs=[0.3, 0.4, 0.5, 0.6, 0.7, 0.8], # lower pcc_cutoffs require more runtime
    clustering_powers=[1, 3, 5, 7, 9],
    min_genes_for_module=5
)

Optionally, data shuffling and smoothing can be done to help determine the lower bound for pcc_cutoff selection. Here, use the same parameters as before in smoothie.run_parallelized_smoothing.

In [ ]:

Copied!





# p95, p99, p999 = smoothie.compute_shuffled_correlation_percentiles(
#     adata,
#     grid_based_or_not=True, # Choose same value as in smoothie.run_parallelized_smoothing
#     gaussian_sd=target_microns * micron_to_unit_conversion, # Choose same value as in smoothie.run_parallelized_smoothing
#     min_spots_under_gaussian=25, # Choose same value as in smoothie.run_parallelized_smoothing
#     seed=0
# )
# p95, p99, p999 = smoothie.compute_shuffled_correlation_percentiles(
#     adata,
#     grid_based_or_not=True, # Choose same value as in smoothie.run_parallelized_smoothing
#     gaussian_sd=target_microns * micron_to_unit_conversion, # Choose same value as in smoothie.run_parallelized_smoothing
#     min_spots_under_gaussian=25, # Choose same value as in smoothie.run_parallelized_smoothing
#     seed=0
# )

Based on the outputs above, we can select pcc_cutoff and clustering_power hyperparameters for smoothie.make_spatial_network.

pcc_cutoff: The Pearson correlation coefficient (PCC) hard threshold for network construction. Only correlations above this value are retained in the gene network. Higher cutoff values result in smaller, stronger average correlation networks.

clustering_power: A soft thresholding parameter that controls how much weaker PCC values are suppressed. Higher values favor more modules.

In [ ]:

Copied!





edge_list, node_label_df = smoothie.make_spatial_network(
    pearsonR_mat=pearsonR_mat, # don't change
    gene_names=sm_adata.var_names, # don't change
    pcc_cutoff=0.4,
    clustering_power=4,
    output_folder='./out'
)

# Filter node_label_df for only genes within modules of size 2 or more.
modules_df = node_label_df.groupby('module_label').filter(lambda x: len(x) >= 2)

# Examine modules
modules_df
edge_list, node_label_df = smoothie.make_spatial_network(
    pearsonR_mat=pearsonR_mat, # don't change
    gene_names=sm_adata.var_names, # don't change
    pcc_cutoff=0.4,
    clustering_power=4,
    output_folder='./out'
)

# Filter node_label_df for only genes within modules of size 2 or more.
modules_df = node_label_df.groupby('module_label').filter(lambda x: len(x) >= 2)

# Examine modules
modules_df

Finally, we can import the saved edge_list.csv and node_labels.csv files into Cytoscape for interactive network visualization! More info in /guides/cytoscape_instructions.md

Visualize Smoothed Gene Expression and Modules¶

In [ ]:

Copied!





# Plot individual gene
smoothie.plot_gene(
    sm_adata=sm_adata,
    gene_name="Gene1",  # Replace with your gene
    output_folder="./gene_plots",
    spot_size=25,  # Adjust for optimal visualization
)
# Plot individual gene
smoothie.plot_gene(
    sm_adata=sm_adata,
    gene_name="Gene1",  # Replace with your gene
    output_folder="./gene_plots",
    spot_size=25,  # Adjust for optimal visualization
)

In [ ]:

Copied!





# Plot all gene modules
smoothie.plot_modules(
    sm_adata,
    node_label_df,
    output_folder='./module_plots',
    min_genes=3, # Minimum number of genes in a module to plot the module
    spot_size=25 # Adjust for optimal visualization
)
# Plot all gene modules
smoothie.plot_modules(
    sm_adata,
    node_label_df,
    output_folder='./module_plots',
    min_genes=3, # Minimum number of genes in a module to plot the module
    spot_size=25 # Adjust for optimal visualization
)

Analyze a Specific Gene or Gene Set Correlations¶

Find top correlated genes to a gene of interest.
Build a network focused on a specific set of genes with a more permissive PCC cutoff.

In [ ]:

Copied!





# Find top correlated genes for a gene of interest (GOI)
GOI_correlations = smoothie.get_correlations_to_GOI(
    pearsonR_mat=pearsonR_mat, # don't change
    gene_names=sm_adata.var_names, # don't change
    GOI="Gene1",
    reverse_order=False,  # False for top correlations, True for anti-correlations
    plot_histogram=True
)
    
print(f"\nTop 10 gene correlations with GOI:")
print(GOI_correlations[:10])
# Find top correlated genes for a gene of interest (GOI)
GOI_correlations = smoothie.get_correlations_to_GOI(
    pearsonR_mat=pearsonR_mat, # don't change
    gene_names=sm_adata.var_names, # don't change
    GOI="Gene1",
    reverse_order=False,  # False for top correlations, True for anti-correlations
    plot_histogram=True
)
    
print(f"\nTop 10 gene correlations with GOI:")
print(GOI_correlations[:10])

In [ ]:

Copied!





# Build targeted network for your gene set of interest
my_gene_list = ['Gene1', 'Gene2', 'Gene3']  # Replace with your genes

# Build targeted network
geneset_edge_list, geneset_node_label_df = smoothie.make_geneset_spatial_network(
    pearsonR_mat=pearsonR_mat, # don't change
    gene_names=sm_adata.var_names, # don't change
    node_label_df=node_label_df, # don't change
    gene_list=my_gene_list,
    low_pcc_cutoff=0.3,  # More permissive pcc_cutoff than main network
    output_folder=None,
    intra_geneset_edges_only=True  # Only edges within gene set
)

print(f"Gene set network: {len(geneset_edge_list)} edges")
print(f"Genes included: {geneset_node_label_df['geneset_member'].sum()}")
# Build targeted network for your gene set of interest
my_gene_list = ['Gene1', 'Gene2', 'Gene3']  # Replace with your genes

# Build targeted network
geneset_edge_list, geneset_node_label_df = smoothie.make_geneset_spatial_network(
    pearsonR_mat=pearsonR_mat, # don't change
    gene_names=sm_adata.var_names, # don't change
    node_label_df=node_label_df, # don't change
    gene_list=my_gene_list,
    low_pcc_cutoff=0.3,  # More permissive pcc_cutoff than main network
    output_folder=None,
    intra_geneset_edges_only=True  # Only edges within gene set
)

print(f"Gene set network: {len(geneset_edge_list)} edges")
print(f"Genes included: {geneset_node_label_df['geneset_member'].sum()}")