BGPT: Paper Review: Assessing genome conservation on pangenome graphs with PanSel

Fuel Your Discoveries

Quick Explanation Copied

Quick takeaway: PanSel is a pragmatic, well-documented C++ tool that computes per-window conservation on pangenome graphs by pairing ODGI-style Jaccard similarity with a two‑component mixture fit to identify conserved vs divergent windows; results on the Draft Human Pangenome show concordance with PhyloP/PhastCons, exon density, ChromHMM states, and structural-variant density, and runtime/memory are practical for chromosome-scale graphs (<22 min, <20 GB per chromosome) — code and pipeline are available for reproduction (

Long Explanation

Visual paper analysis — PanSel: per-window conservation on pangenome graphs

Visual first — figures reproduce and summarise reported numeric results from the paper and supporting data; explanations follow each figure. All claims are inline-cited.

Explanation: The manuscript reports PanSel runs on each chromosome in <22 minutes and <20 GB RAM on a single Xeon E5 core (single thread per chromosome) when applied to the Draft Human Pangenome (MiniGraph-Cactus build)

Explanation: Using a 10 kb window and P-value 5%, PanSel identified ~79 Mb labelled conserved and 162 Mb labelled divergent across autosomes; the paper reports 4,861 genes overlapping conserved windows and 12,286 overlapping divergent windows — supporting the claim that divergent windows harbor more genes annotated in variable regions (immune/antigen related)

Explanation: The authors compared PanSel bins to structural-variant overlap (SVs), vertebrate PhyloP scores (100-vertebrate), PhastCons (human pangenome MSA), pan-conserved segment tags, exon coverage (GENCODE), and ChromHMM states. They report consistent trends: divergent bins overlap more SVs and heterochromatin; conserved bins show higher PhyloP/PhastCons, exon coverage, and active transcription marks

Methods — succinct technical summary

Input: GFA pangenome graph (one chromosome per run); PanSel detects 'boundary segments' every s nucleotides on a reference path and extracts sub-paths between boundaries for each haplotype path, then computes weighted pairwise Jaccard indices between sub-paths in each bin (ODGI-style Jaccard)
Statistical model: log-transform of Jaccard scores, then mixture fit: Gaussian component for the conserved peak and log-normal for the divergent heavy tail; significance calls via fitted mixture P-values
Validation: comparisons to PhyloP (100-vertebrate), PhastCons on pangenome MSA, pan-conserved segment tags, GENCODE exons, ChromHMM states; example locus: NBPF20 visualised in Bandage to illustrate SV-rich region and transcript changes
Software & availability: C++11, no external dependencies; repository: https://github.com/mzytnicki/pansel and analysis pipeline at https://github.com/mzytnicki/pansel_paper

Strengths — evidence-based

Graph-native: works directly on GFA graphs avoiding coordinate liftovers or reference-bias inherent to linear references, aligning with the pangenome movement (Minigraph-Cactus/PGGB ecosystems)
Scalability & simplicity: low-dependency C++ implementation; reported chromosome-scale runs in practical time and memory; sliding-window approach avoids needing base-level alignments on graphs (which are currently expensive/ambiguous)
Concordant validation: PanSel scores correlate with independent conservation metrics (PhyloP/PhastCons), exon density, SV density and chromatin states — cross-annotation concordance strengthens biological plausibility of scores

Limitations, blindspots & caveats

Windowed resolution: per-window (e.g. 1 kb) scores trade spatial resolution for graph-compatibility — small conserved elements (short TFBS, short exons, splice sites) may be missed or diluted; authors acknowledge 1 kb is often too small to capture full gene extents and recommend larger windows (10 kb) for gene-level signals
Dependence on graph topology & path set: Jaccard similarity of subpaths is influenced by graph construction (Minigraph‑Cactus vs PGGB) and included haplotypes — authors report only moderate correlation (Pearson 0.42–0.84) between PanSel scores on different graph constructions, indicating sensitivity to graph-building choices
Mixture model assumptions: a two-component mixture (Gaussian + log-normal) is pragmatic but may not capture multi-modal or complex distributions in other species/graphs (e.g., populations with strong substructure or long-branch haplotypes); authors fit mixture empirically and show empirical fits (Supplementary Figures), but further benchmarking on simulated graphs with controlled conserved/divergent regions is not presented in depth
Validation scope: validation relies primarily on concordance with existing annotations (PhyloP, PhastCons, GENCODE, ChromHMM) in human and a single bacterial example (Myxococcus xanthus). Broader cross-species tests (plants, fungi, animals with different divergence and SV spectra) and controlled simulations (known conserved/divergent insertions) would increase confidence in generality
Sequence similarity applicability: authors state PanSel provides reliable results up to ~98% estimated sequence similarity — this limits use in very divergent species comparisons or highly recombinant populations without additional tuning or alternative similarity metrics

Reproducibility & practical use

Code & pipeline publicly released: repository links are provided in the paper (PanSel and pansel_paper) enabling direct reproduction on the same input graphs; authors declare no funding and no COI
Dependencies: minimal (pure C++11), so portability is high; but users must provide GFA graphs and may require pre-processing with ODGI/minigraph/PGGB depending on graph source

How to improve / next steps (practical, testable)

Benchmark on simulated pangenome graphs with implanted conserved/divergent blocks (varying lengths, allele frequencies, and SV densities) to measure recall/precision across window sizes and noise levels (this falsifies the mixture-fit assumptions under controlled conditions).
Systematically compare PanSel on graphs built by multiple graph builders (Minigraph-Cactus, PGGB, PGGB parameters, wfmash/seqwish smoothing) across identical input assemblies to quantify sensitivity to graph topology and path sampling (authors present some PGGB comparison but further parameter sweeps are valuable).
Explore alternative or hierarchical statistical models — e.g., Gaussian mixture with more components or non-parametric density estimation (kernel mixture, empirical null) — to better model multimodal or population-structured Jaccard distributions observed in some graphs.
Integrate base-level mapping where available (e.g., for orthogonally aligned regions) to complement window scores and flag small conserved elements (exons, TFBS) that may be diluted in large bins.

Conclusions — balanced assessment

PanSel provides a practical, graph-native approach for assessing intra-species conservation on pangenome graphs. It is well-engineered (C++11, public code), computationally tractable at chromosome scale, and produces biologically plausible results that align with established conservation and functional annotations in human. Major limitations are resolution trade-offs from windowing, sensitivity to graph construction and included paths, and model assumptions in mixture fitting — all of which are acknowledged in the paper. The tool is useful for consortia and groups building pangenomes who need a quick, interpretable per-region conservation metric, but should be complemented by simulation benchmarks, cross-graph validations, and smaller-scale base-resolution analyses when interpreting short regulatory elements or when working with more divergent datasets

Key references used and cited in-line:

Feedback:

Updated: March 13, 2026