Single Dataset Pipeline
Key features:
- Apply Gaussian smoothing to address data noise and sparsity
- Construct a gene co-expression network across samples using smoothed dataset
- Group genes into spatial modules with hard and soft thresholding on the co-expression network
import anndata as ad
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import smoothie
smoothie.suppress_warnings() # Suppress a few warnings for cleaner output
Load AnnData¶
Smoothie requires data input in the AnnData format. A guide for constructing the AnnData data structure for both imaging-based and sequencing-based spatial transcriptomics outputs can be found in /guides/convert_to_anndata.md
adata = ad.read_h5ad("/path/to/your/data.h5ad")
Quality Control¶
The table below shows suggested QC values given the resolution of your AnnData spatial coordinates.
- Note: If you have cell-segmented imaging / fine resolution data, then you should use the "Cell-sized" QC settings. If you have true individual transcripts or fine resolution data, then you should use the Fine / Imaging QC settings.
| Resolution | min_counts (filter_cells) |
min_counts (filter_genes) |
min_cells (filter_genes) |
|---|---|---|---|
| "Cell-sized" (8-10 µm) | 10 - 100 | 100 - 500 | 10 - 100 |
| Binned (20-50 µm) | 50 - 500 | 100 - 500 | 10 - 100 |
| Fine / Imaging (< 1-2 µm) | 1 | 100 - 500 | 10 - 100 |
# Filter spots with low UMI counts
sc.pp.filter_cells(adata, min_counts=50)
# Filter genes with low counts/unique spots
sc.pp.filter_genes(adata, min_counts=100)
sc.pp.filter_genes(adata, min_cells=10)
Normalization¶
Normalization depends on the resolution of your AnnData spatial coordinates.
- Note: If you have cell-segmented imaging / fine resolution data, then you should use the "Cell-sized" normalization method. If you have true individual transcripts or fine resolution data, then you should use the Fine / Imaging normalization method.
| Resolution | Normalization Method |
|---|---|
| "Cell sized" or Binned Data (8-50 micron resolution) | CPT + Log1p normalization |
| Fine / Imaging (<1-2 micron resolution) | Log1p normalization only |
# CHOOSE: CPT + Log1p normalization (8-50 micron resolution data)
sc.pp.normalize_total(adata, target_sum=1e3)
sc.pp.log1p(adata)
# # CHOOSE: Log1p normalization only (<1-2 micron resolution data)
# sc.pp.log1p(adata)
Step 1: Run Gaussian Smoothing¶
Here, we run Gaussian smoothing on the spatial dataset address data sparsity and noise. Key parameters for function smoothie.run_parallelized_smoothing:
gaussian_sd: float
- The standard deviation for the Gaussian distribution, which determines smoothing degree. Generally across high-resolution S.T. platforms, a gaussian_sd corresponding to 20-30 microns is ideal.
- gaussian_sd is calculated as target_microns * micron_to_unit_conversion
- For Slide-seq, micron_to_unit_conversion = 1.5456 (1 micron equals 1.5456 Slide-seq spatial units)
- For Stereo-seq, micron_to_unit_conversion = 2 (1 micron equals 2 Stereo-seq spatial units)
- A table of micron_to_unit_conversion values for spatial transcriptomics platforms is included in
/guides/micron_to_unit_conversion_table.md
grid_based_or_not: boolean
- True = grid-based smoothing: smooth only at imposed hexagonal grid points. Good for subcellular to "cell-sized" resolution data (<1-10 micron).
- False = in-place smoothing: smooth at every spatial location in the dataset. Good for "cell-sized" to binned resolution data (10-50 micron).
min_spots_under_gaussian
- Minimum required number of data points within radius (3 * gaussian_sd) for smoothing to occur at a given location.
- Default is 10-100. Raise value if low density regions around the perimeter of tissue appear noisy in modules. Lower value if you have large spot size (e.g. original Visium 100 micron resolution data or binned 20-50 micron resolution).
# Determine gaussian_sd ---------------------------------------------------
target_microns = 20.0 # 20-30 microns is ideal for smoothing
micron_to_unit_conversion = # IMPORTANT! Select for your platform
# ------------------------------------------------------------------------
sm_adata = smoothie.run_parallelized_smoothing(
adata,
grid_based_or_not=True, # Select True or False,
gaussian_sd=target_microns * micron_to_unit_conversion,
min_spots_under_gaussian=25 # Adjust as needed
)
Step 2: Measure and Cluster Gene Co-expression Network¶
First, we compute the pairwise Pearson correlation coefficient between all smoothed gene surface pairs.
Next, we evaluate optimal hyperparameter choices for network construction and clustering. The function smoothie.select_clustering_params calculates network clustering statistics across different choices for pcc_cutoff and clustering_power.
Choose values for pcc_cutoff and clustering_power that balance the following goals:
- High gene margin scores for high gene module confidence
- Stable number of modules (n_clusters) compared to other pcc_cutoff values
- High gene recall for more gene-rich analyses
Refer to /docs/tutorials for examples on selecting these parameters.
# Compute pairwise gene correlations
pearsonR_mat, _ = smoothie.compute_correlation_matrix(sm_adata.X)
# Clustering hyperparameter selection
smoothie.select_clustering_params(
gene_names=sm_adata.var_names, # don't change
pearsonR_mat=pearsonR_mat, # don't change
output_folder=None,
pcc_cutoffs=[0.3, 0.4, 0.5, 0.6, 0.7, 0.8], # lower pcc_cutoffs require more runtime
clustering_powers=[1, 3, 5, 7, 9],
min_genes_for_module=5
)
Optionally, data shuffling and smoothing can be done to help determine the lower bound for pcc_cutoff selection. Here, use the same parameters as before in smoothie.run_parallelized_smoothing.
# p95, p99, p999 = smoothie.compute_shuffled_correlation_percentiles(
# adata,
# grid_based_or_not=True, # Choose same value as in smoothie.run_parallelized_smoothing
# gaussian_sd=target_microns * micron_to_unit_conversion, # Choose same value as in smoothie.run_parallelized_smoothing
# min_spots_under_gaussian=25, # Choose same value as in smoothie.run_parallelized_smoothing
# seed=0
# )
Based on the outputs above, we can select pcc_cutoff and clustering_power hyperparameters for smoothie.make_spatial_network.
pcc_cutoff: The Pearson correlation coefficient (PCC) hard threshold for network construction. Only correlations above this value are retained in the gene network. Higher cutoff values result in smaller, stronger average correlation networks.
clustering_power: A soft thresholding parameter that controls how much weaker PCC values are suppressed. Higher values favor more modules.
edge_list, node_label_df = smoothie.make_spatial_network(
pearsonR_mat=pearsonR_mat, # don't change
gene_names=sm_adata.var_names, # don't change
pcc_cutoff=0.4,
clustering_power=4,
output_folder='./out'
)
# Filter node_label_df for only genes within modules of size 2 or more.
modules_df = node_label_df.groupby('module_label').filter(lambda x: len(x) >= 2)
# Examine modules
modules_df
Finally, we can import the saved edge_list.csv and node_labels.csv files into Cytoscape for interactive network visualization! More info in /guides/cytoscape_instructions.md
Visualize Smoothed Gene Expression and Modules¶
# Plot individual gene
smoothie.plot_gene(
sm_adata=sm_adata,
gene_name="Gene1", # Replace with your gene
output_folder="./gene_plots",
spot_size=25, # Adjust for optimal visualization
)
# Plot all gene modules
smoothie.plot_modules(
sm_adata,
node_label_df,
output_folder='./module_plots',
min_genes=3, # Minimum number of genes in a module to plot the module
spot_size=25 # Adjust for optimal visualization
)
Analyze a Specific Gene or Gene Set Correlations¶
- Find top correlated genes to a gene of interest.
- Build a network focused on a specific set of genes with a more permissive PCC cutoff.
# Find top correlated genes for a gene of interest (GOI)
GOI_correlations = smoothie.get_correlations_to_GOI(
pearsonR_mat=pearsonR_mat, # don't change
gene_names=sm_adata.var_names, # don't change
GOI="Gene1",
reverse_order=False, # False for top correlations, True for anti-correlations
plot_histogram=True
)
print(f"\nTop 10 gene correlations with GOI:")
print(GOI_correlations[:10])
# Build targeted network for your gene set of interest
my_gene_list = ['Gene1', 'Gene2', 'Gene3'] # Replace with your genes
# Build targeted network
geneset_edge_list, geneset_node_label_df = smoothie.make_geneset_spatial_network(
pearsonR_mat=pearsonR_mat, # don't change
gene_names=sm_adata.var_names, # don't change
node_label_df=node_label_df, # don't change
gene_list=my_gene_list,
low_pcc_cutoff=0.3, # More permissive pcc_cutoff than main network
output_folder=None,
intra_geneset_edges_only=True # Only edges within gene set
)
print(f"Gene set network: {len(geneset_edge_list)} edges")
print(f"Genes included: {geneset_node_label_df['geneset_member'].sum()}")