BGPT: Paper Review: Multi-domain rule-based phenotyping algorithms enable improved GWAS signal

Fuel Your Discoveries

Quick Explanation Copied

Quick appraisal — multi-domain rule-based phenotyping improves GWAS signal (UKBB, n=405,811)

Concise verdict: authors provide strong, well-documented evidence that higher-complexity, multi-domain rule-based EHR phenotypes (ADO / some OHDSI) increase GWAS effective sample size, power, and yield more coding / eQTL-colocalized hits without reducing heritability or PRS performance — but generalizability beyond UK Biobank, mapping loss from ICD→SNOMED, and phenotype-evaluation limits (PheValuator coverage) leave important blind spots and require external replication

Long Explanation

Visual paper analysis — Multi-domain rule-based phenotyping algorithms enable improved GWAS signal

Dataset

UK Biobank (OMOP CDM) — 405,811 unrelated samples after QC

(UKBB application 100316)

Phenotypes compared

2+ condition (low), Phecode (medium), OHDSI (variable), ADO (high where available)

Diseases

Alzheimer's, Asthma, COPD, MI, RA, SLE, T2D

Figure note: authors report high-complexity algorithms yielded the largest number of cases and greatest number of unique cases not captured by other algorithms; visualization is normalized qualitative mapping of results presented in Fig.2 and Supplementary Figures of the paper

Figure note: GAS power curves in paper (Fig.2b) show higher power for high-complexity cohorts at the same effect size; authors used GAS with disease prevalences and case/control counts from UKBB cohorts for each phenotype

Key evidence & methods evaluation

Power, hits, and unique hits: high complexity algorithms produced the largest number of genome-wide significant hits and the most unique hits (paper Fig.2c; Supplementary Tables) .
Heritability & genetic correlation: liability-scale h2 (LDSC) showed small differences across algorithms (max range ≈6% within disease) and genetic correlation between algorithms per disease >0.93 — implies phenotyping definition impacts discovery but not broad SNP-heritability estimates .
Functional annotation & colocalization: high complexity algorithms produced more novel coding hits overlapping exons and more GWAS-eQTL colocalizations (eCAVIAR across GTEx tissues), suggesting improved biological interpretability (paper Fig.4) .
PRS performance: PRS AUROC differences across algorithms were minimal (≤5% within disease), i.e., improved discovery did not translate into systematically better PRS performance in within-UKBB cross-validated tests (paper Fig.5) .

Critical appraisal — strengths and limitations

Strengths

Large sample (UKBB n=405,811), consistent genotype QC and relatedness filtering reported .
Multiple phenotype algorithms implemented reproducibly in OMOP CDM with code availability (GitHub repos), enabling replication where UKBB access is granted .
Comprehensive downstream annotation pipeline (ANNOVAR, GTEx eQTLs, eCAVIAR, MAGMA, PGRM) — allows both statistical and biological evaluation of hits.

Limitations & blindspots

ICD→SNOMED mapping for Phecode / ADO conversions risks information loss; authors explicitly note unmapped codes could bias results against ICD-origin algorithms (ADO/Phecode) .
PheValuator evaluation was incomplete (could not evaluate Alzheimer's, RA, SLE; T2D model non-convergent), leaving PPV/NPV estimates and dilution adjustments partial for several diseases .
All analyses are in the UK Biobank (largely European ancestry); authors correctly caution on coding heterogeneity and generalizability to other EHR systems and ancestries — replication in other biobanks is needed.
Colocalization interpretation: eQTL colocalization in non-disease tissues (e.g., artery aorta for Alzheimer’s) can arise from sample-size-driven stronger eQTLs and not necessarily disease biology; authors note this confounder and the need for tissue-relevant interpretation .

Practical recommendations for researchers

Use multi-domain, clinician-curated phenotyping (OHDSI/ADO) where OMOP-compatible multi-domain data exist to increase GWAS power and functional hits; but always report mapping steps (ICD→SNOMED) and quantify unmapped code loss.
Validate phenotype PPV/NPV where possible (PheValuator or chart review) before large-scale GWAS; include dilution-adjusted effective sample size in power calculations .
Interpret eQTL colocalization with tissue sample-size awareness (prefer tissue with biological plausibility and sufficient GTEx sample size) and complement with TWAS or single-cell eQTL when available.
Replicate findings in independent biobanks (All of Us, FinnGen, other OMOP-converted EHR resources) to test generalizability across coding practice and ancestry.

What would falsify the main claim?

The claim that high-complexity, multi-domain phenotypes improve GWAS discovery would be falsified if independent biobank analyses (with similar OMOP conversions and careful ICD→SNOMED mapping) showed that simpler algorithms (2+ condition or Phecode) yield equal or greater numbers of true disease-specific functional hits after proper dilution adjustment and tissue-aware colocalization — especially if the additional hits from complex algorithms systematically map to non-disease processes or arise from false positives due to misclassification. The authors provide partial mitigation (PheValuator, replication metrics, shared effect-size correlations >0.95), but cross-biobank replication is necessary .

Notes: This visual review focuses strictly on evidence reported in the paper, caveats the authors themselves raise (ICD→SNOMED mapping, PheValuator coverage, UKBB-specific coding heterogeneity), and recommends replication and tissue-aware functional follow-up. All claims above cite the primary paper.

Feedback:

Updated: March 10, 2026