Persistent Homology vs Moran's I: Benchmarking Spatial Gene Expression on a Mouse Brain Visium Section

The standard spatial transcriptomics pipeline runs Moran's I, finds spatially variable genes, and calls it done. A new end-to-end tutorial challenges that workflow — not by replacing Moran's I, but by running persistent homology from GUDHI alongside it on a 2,688-spot V1 Adult Mouse Brain Visium section and asking a sharper question: what does autocorrelation systematically fail to see?

The answer turns out to be less interesting than the framing. Moran's I and persistent homology are not measuring the same underlying property of spatial gene expression. They are orthogonal instruments. Running one as a "check" on the other exposes a category error that most spatial omics pipelines quietly commit — and fixing that error changes how you should structure any production spatial analysis workflow.

The Autocorrelation Monoculture in Spatial Omics

Spatial transcriptomics, particularly the 10x Genomics Visium platform, captures gene expression at discrete barcoded spots arranged in a hexagonal grid. Since the Visium launch, the computational biology community has converged on a small set of spatial statistics for identifying genes whose expression varies non-randomly across tissue. Moran's I is the dominant choice, and for legitimate reasons: it has a closed-form null distribution under spatial randomness, it outputs a z-score that reviewers recognize, and it integrates naturally into scanpy/squidpy pipelines.

The problem is not that Moran's I is wrong. It is that Moran's I measures one specific thing — local smoothness, defined as the correlation between each spot's expression and the expression of its immediate spatial neighbours. A gene with high Moran's I has similar values in adjacent spots. A gene with low Moran's I varies unpredictably from spot to spot.

What Moran's I does not capture is global topological structure: whether high-expression spots form a closed ring around a ventricle, whether expression defines a nested hollow domain across multiple cortical layers, whether a gene delineates a connected crescent across tissue that only becomes visible at a coarser spatial scale. These are the geometric signatures that topological data analysis is specifically designed to detect, and they can appear in genes with unremarkable Moran's I statistics.

The result is an analytical monoculture. Most published spatial omics analyses use Moran's I (or one of its close relatives, like Geary's C) as the sole criterion for spatial structure, discarding signals that autocorrelation statistics are constitutionally unable to see.

Building the Three-Task Pipeline

The tutorial constructs a pipeline on the V1 Adult Mouse Brain Visium dataset: 2,688 spots, 15 annotated anatomical regions, approximately 18,000 genes. It is a deliberately tractable dataset — small enough to run on a laptop, annotated well enough to ground truth the spatial statistics against known anatomy.

Standard Preprocessing and Graph Construction

The scanpy preprocessing stack is conventional: normalize_total to 1×10⁴ counts per spot, log1p transformation, selection of 3,000 highly variable genes, and 30 PCA components. None of this departs from the spatial omics standard playbook.

The spatial graph construction is where the first landmine sits. squidpy 1.8.x introduced a breaking change that will silently corrupt older pipelines:

# squidpy < 1.8.x (broken in 1.8.x)
sq.gr.spatial_neighbors(adata, coord_type="visium", n_rings=1)

# squidpy 1.8.x (correct)
sq.gr.spatial_neighbors(adata, coord_type="grid", n_rings=1)

The coord_type='visium' argument no longer exists in 1.8.x. In some intermediate versions it silently fell back to a default rather than raising an error — producing a spatially wrong graph with no warning. This is the worst category of bioinformatics bug: the pipeline runs, produces numbers, and nothing tells you those numbers reflect the wrong neighbourhood structure. With coord_type='grid' and n_rings=1, each interior spot on the hexagonal Visium array connects to exactly 6 neighbours. Edge and corner spots get fewer, a boundary artifact worth tracking if your biology of interest concentrates at tissue margins.

Spatially Variable Gene Detection and Neighbourhood Enrichment

With the spatial graph built correctly, the pipeline runs sq.gr.spatial_autocorr in Moran's I mode across the 3,000 highly variable genes. The output is a ranked list of spatially variable genes (SVGs) with autocorrelation scores and permutation-test p-values.

One technical note on those p-values: squidpy's permutation test assumes spatial randomness as its null. On a mouse brain Visium section with 15 annotated anatomical regions, the tissue is profoundly non-random. The effective sample size for the permutation test is far smaller than 2,688 spots once you account for spatial autocorrelation in the null distribution itself. The p-values will be anti-conservative — more significant than they should be. Top SVGs warrant validation against an independent serial section before being treated as confirmed biology.

The second pipeline task is neighbourhood enrichment analysis via sq.gr.nhood_enrichment, which asks which anatomical region pairs co-occur more often than expected in the spatial graph. On the mouse brain dataset, this surfaces expected adjacencies (cortical layers neighbouring each other, hippocampal subfields in contact) but also quantifies the enrichment scores that can drive hypotheses about region-to-region signaling.

The TDA Layer: GUDHI Persistent Homology

The third pipeline task introduces GUDHI's persistent homology as a topological benchmark. The mechanics: for a selected gene, take the spatial coordinates of spots as a point cloud weighted by expression, construct a Vietoris-Rips filtration at increasing epsilon radii, and track the birth and death of topological features — connected components (H0) and loops/tunnels (H1) — as a persistence diagram.

import gudhi

# Build a Rips complex from spot coordinates, filtered by expression
points = adata.obsm["spatial"][high_expr_mask]
rips = gudhi.RipsComplex(points=points, max_edge_length=max_edge)
simplex_tree = rips.create_simplex_tree(max_dimension=2)
simplex_tree.compute_persistence()
diagram = simplex_tree.persistence()

The persistence diagram reveals topological features that survive across a range of scales. A high-persistence H1 feature means a loop in the expression pattern is robust — not a filtration-radius artifact — and that robustness is the topology analog of statistical significance.

The tutorial identifies a specific gene whose spatial structure Moran's I underestimates relative to what the persistence diagram shows. This is a real finding. It is also potentially a filtration artifact: if you shift the epsilon range by 20%, you may surface a different set of topologically prominent features. The tutorial almost certainly does not include sensitivity analysis across filtration parameters, which is the obvious gap in any honest application of this method.

The Orthogonality Problem: Why "Catching What Moran's I Misses" Is the Wrong Frame

Here is where the tutorial's framing deserves pushback — not to dismiss the contribution, but to understand it correctly.

Framing persistent homology as a tool that catches what Moran's I misses implies they are competing estimators of the same underlying spatial structure, with topology as the more powerful one. That framing is wrong, and adopting it will mislead you about when to apply each method.

A gene can have low Moran's I and high H1 persistence simultaneously. Moran's I sees local noise — expression jumps from spot to spot without local smoothness. The persistence diagram sees a large-scale ring of high expression that only coheres at coarser resolution than the immediate neighbourhood. The gene is both non-smooth locally and topologically structured globally. Both statistics are correct. They are describing different properties.

The converse also holds: a gene with very high Moran's I (a smooth gradient across tissue) can have entirely trivial topology — no persistent loops, no significant H0 clusters beyond the obvious connected component. Spatial smoothness and spatial topology are independent axes of spatial gene expression structure.

The tutorial's most durable contribution is not the specific gene it highlights. It is the implicit demonstration that these two axes exist and that a complete spatial analysis pipeline should assess both in parallel, not as competitors but as orthogonal quality dimensions. A gene that scores high on both — smooth locally and topologically structured globally — is a different kind of biological signal than one that scores high on only one axis.

Most published spatial omics pipelines produce a single ranked gene list from Moran's I or a similar statistic. The correct structure is a two-dimensional space of spatial statistics where topological persistence and local autocorrelation each contribute independent information.

Practical Implications for Production Pipelines

Pin squidpy to a minor version immediately. The coord_type API change from 1.6.x to 1.8.x demonstrates that squidpy is still in active breaking-change territory. A loose >=1.6 pin in a shared HPC environment module will silently corrupt spatial graphs for every downstream user who runs after an upgrade. Your environment.yml or requirements.txt should read squidpy==1.8.x with the specific minor version you validated against.

Build a spot-count gate before invoking GUDHI. Persistent homology via Vietoris-Rips complex scales roughly O(n²) in memory and O(n³) in worst-case computation. The 2,688-spot Visium dataset is pedagogically convenient but not representative of production scale. At Visium HD (potentially hundreds of thousands of spots) or Xenium scale, the Rips complex construction will exhaust memory on any reasonable compute node before it finishes. You need a landmark subsampling strategy or an alpha complex approximation before this approach is viable at next-gen platform scale. Ripser++, which is 10–100× faster than GUDHI's Python bindings on point cloud data, should be your default choice for datasets above 5,000 spots.

Validate top SVGs against a held-out serial section. The mouse brain Visium section has strong anatomical structure that makes Moran's I p-values anti-conservative. Before reporting any gene as a confirmed spatially variable hit, verify it replicates in an independent section. This is standard practice in bulk RNA-seq but frequently skipped in spatial omics because serial sections require additional sample cost.

Consider NNSVG or SpatialDE2 for SVG detection in biology papers. For contexts where you need interpretable effect sizes and calibrated p-values under tissue heterogeneity — a methods section that will survive peer review — NNSVG and SpatialDE2 outperform Moran's I on both power and calibration in published benchmarks. The squidpy + GUDHI stack presented here is the right choice when you need a single AnnData-centric workflow for teaching, prototyping, or explicitly comparing TDA to classical spatial statistics as a research question. It is not the right choice when your goal is defensible SVG lists for a biology paper.

Pre-filter awareness matters for the TDA benchmark. Selecting 3,000 highly variable genes upstream of any TDA analysis means you have already performed feature selection that excludes lowly-variable but spatially structured genes. A gene with a ring-shaped expression domain that spans multiple anatomical regions might show low per-spot variance while still having highly significant topology. The feature selection step is biased toward genes that Moran's I already handles well, which means the benchmark comparison in the tutorial is not a clean test of the two methods' relative power. A complete benchmark would run TDA on the full gene set or on a separately defined spatial variability-agnostic feature set.

The Pipeline That Should Replace the Autocorrelation Monoculture

The spatial omics field needs a two-axis spatial characterization framework, not a debate about whether topology beats autocorrelation. The practical pipeline looks like this: run Moran's I (or NNSVG for production biology) to identify locally smooth spatially variable genes, then run persistent homology on a broader gene set to identify topologically structured genes, and treat the intersection and exclusive sets as distinct biological categories worth investigating separately.

A gene in both categories — smooth locally and topologically structured globally — likely marks a well-defined anatomical domain. A gene with high persistence but low Moran's I might delineate a sparse but geometrically coherent cell population, the kind of signal that gets filtered out of every standard spatial omics paper. A gene with high Moran's I but trivial topology might be a gradient across tissue with no discrete domain structure.

The squidpy 1.8.x migration note is not a footnote. It is the kind of breaking-change documentation that determines whether a team's spatial analysis is correct or not. Silent graph corruption from a wrong coord_type argument propagates through every downstream statistic — Moran's I, neighbourhood enrichment, and the TDA features alike. Getting the spatial graph right is foundational; everything else is downstream of that construction.

The tutorial's real contribution is not a new method. It is the demonstration that a single spatial statistic is insufficient, that the tools to run multiple orthogonal analyses already exist in the Python ecosystem, and that the 2,688-spot Visium mouse brain section is exactly the right scale at which to develop intuition for how these methods relate before scaling to platforms where computational costs force hard trade-offs. Start here, understand what each statistic is actually measuring, then decide what your production pipeline needs.

Sources & Editorial Disclosure

This article was researched and written with AI assistance (Claude by Anthropic) as part of StackRadar's automated editorial pipeline. Content was synthesised from the following public developer community sources: Dev.to.

All technical claims, version numbers, benchmarks, and project details should be independently verified against official documentation or the original sources listed above. StackRadar analyses and synthesises publicly available information and does not claim original authorship of the underlying events, projects, or research described. Mention of any project, product, or organisation does not constitute an endorsement by StackRadar. This content is provided for informational purposes only — 2026-06-20.

Persistent Homology vs Moran's I: Spatial Transcriptomics with squidpy 1.8.x and GUDHI

Persistent Homology vs Moran's I: Benchmarking Spatial Gene Expression on a Mouse Brain Visium Section

The Autocorrelation Monoculture in Spatial Omics

Building the Three-Task Pipeline

Standard Preprocessing and Graph Construction

Spatially Variable Gene Detection and Neighbourhood Enrichment

The TDA Layer: GUDHI Persistent Homology

The Orthogonality Problem: Why "Catching What Moran's I Misses" Is the Wrong Frame

Practical Implications for Production Pipelines

The Pipeline That Should Replace the Autocorrelation Monoculture

// rate this post

// comments (0)

Claude Code 400 'No Low Surrogate': Repairing a Broken Session

DeepSeek Open-Sources DeepSpec: Full-Stack Speculative Decoding

Building AI Agents with MCP: Stop Writing Glue Code That Breaks