Download raw tables, code snippets, and datasets from full texts to power analyses.
Press Enter β΅ to solve
Fuel Your Discoveries
"We cannot solve our problems with the same thinking we used when we created them."
- Albert Einstein
Long Response
Copied
Reproduce ESM2 UMAP Scatter for P02766 (Transthyretin)
Goal: show how to reproduce the 2D UMAP scatter of ESM2 residue embeddings for transthyretin P02766, provide runnable visualization scaffold, precise citations, and a way to run a bioinformatics agent to compute the real embeddings and regenerate the exact figure.
Key points and prerequisites
ESM2 pretrained models (ESM2 T36 used in the referenced workflow) produce 1280 dimensional residue embeddings that the authors reduced to 2D with UMAP for visualization and classification of variants in P02766 (method summary from the preprint)
ESM2 model family and weights are available from FAIR ESM resources and the ESM GitHub; authors recommend esm2_t36 model family for residue embeddings
UMAP algorithm and the umap learn implementation are the standard tool used to project 1280 dim embeddings to 2D for plotting and clustering; parameters strongly affect layout (n_neighbors min_dist metric)
Variant data labeling used ClinVar as the variant source in the preprint (authors accessed ClinVar June 2025)
What I am delivering now
An interactive, working Plotly 2D scatter scaffold that reproduces the look of an ESM2 UMAP scatter for P02766 using placeholder example points so you can inspect the exact plotting code and style. This scaffold will plot points for wild type residue embedding positions pathogenic variants benign variants and VUS regions as described in the preprint.
Clear step by step runnable instructions and the exact Python pipeline you should run to compute true results (ESM2 embeddings β UMAP β plot). The bioinformatics agent button below can run that pipeline on request (recommended because ESM2 embeddings require model weights and GPU resources).
Accurate inline citations for every methodological claim above and notes on limitations and reproducibility.
Interactive visualization (live in browser)
Notes The plot above is a styling faithful scaffold not computed from the original ESM2 residue embeddings. To reproduce the exact published figure you need the per variant residue embeddings produced by ESM2 and the same UMAP parameters used in the paper (min_dist 0.0 metric euclidean variable n_neighbors). The following code and agent will compute the exact embeddings and remake the plot.
Exact pipeline to reproduce the ESM2 UMAP scatter for P02766
High level steps
Obtain the canonical UniProt sequence for P02766 and a curated list of missense variants mapped to residue positions (ClinVar OR other curated sources)
For each single amino acid variant create the mutated sequence and run ESM2 model (esm2_t36 or the exact model used by authors) to extract the residue level 1280 dimensional embedding for the mutated residue (no fine tuning)
Collect the 1280 dimension vectors for all variants and optionally the wild type residue embedding. Standardize or normalize vectors if desired.
Run UMAP with the exact parameters used in the study min_dist 0.0 metric euclidean and set n_neighbors to a value appropriate for your dataset (authors used variable n_neighbors to mitigate class imbalance) to embed 1280 dimension vectors to 2D
Plot the 2D embedding and color by ClinVar label benign pathogenic VUS and annotate wild type position(s). Evaluate separability and optionally compute distances from wild type for rule based classification as done in the preprint.
Exact minimal Python recipe (runnable locally if you have GPU and model weights)
Click to expand the full Python script used to compute embeddings UMAP and produce the Plotly scatter
import torch from esm import pretrained import umap import numpy as np import plotly.graph_objs as go from sklearn.preprocessing import StandardScaler # 1 load canonical sequence and variants you must provide clinvar derived variant list seq = load_uniprot_sequence('P02766') variants = load_variant_list_from_clinvar('P02766') # variant list example format [{'pos':49,'aa':'A','label':'pathogenic'}, ...] # 2 load ESM2 model model, alphabet = pretrained.esm2_t36_3B_UR50D() model.eval().to('cuda') batch_converter = alphabet.get_batch_converter() def get_residue_embedding(sequence, target_pos): # build batch and run model # target_pos is 1 indexed tokens = [("id", sequence)] batch_labels, batch_strs, batch_tokens = batch_converter(tokens) batch_tokens = batch_tokens.to('cuda') with torch.no_grad(): results = model(batch_tokens, repr_layers=[model.num_layers], return_contacts=False) # results contains representations layer = results['representations'][model.num_layers] # shape batch_len embedding = layer[0, target_pos, :].cpu().numpy() return embedding # 3 compute embeddings embeddings = [] labels = [] for v in variants: mut_seq = apply_point_mutation(seq, v['pos'], v['aa']) emb = get_residue_embedding(mut_seq, v['pos']) embeddings.append(emb) labels.append(v['label']) embeddings = np.vstack(embeddings) # 4 optional scaling scaler = StandardScaler() embeddings_scaled = scaler.fit_transform(embeddings) # 5 UMAP reducer = umap.UMAP(n_neighbors=15,min_dist=0.0,metric='euclidean',random_state=42) umap_coords = reducer.fit_transform(embeddings_scaled) # 6 plotly scatter fig = go.Figure() for label,color in [('benign','#2ca02c'),('pathogenic','#d62728'),('VUS','#ff7f0e')]: mask = [l==label for l in labels] fig.add_trace(go.Scatter(x=umap_coords[mask,0],y=umap_coords[mask,1],mode='markers',marker=dict(color=color,size=8),name=label)) fig.update_layout(title='ESM2 UMAP for P02766',xaxis_title='UMAP1',yaxis_title='UMAP2') fig.write_html('p02766_esm2_umap.html')
Limitations caveats and reproducibility notes
The preprint treated likely benign likely pathogenic as benign pathogenic which simplifies labels but introduces label noise; ClinVar submissions vary in quality and multiple submitters can disagree. See ClinVar policy and disclaimers for usage guidance
UMAP layouts are stochastic and sensitive to n_neighbors min_dist and random_state; to replicate the exact figure you must use the same random seed and the same preprocessing (scaling mean pooling etc) and the identical ESM2 model variant and layer used to extract representations.
ESM2 large models require GPU and significant memory; use the exact model (ESM2 T36) or the authors reported variant to match results. Loading different checkpoints or layers can change embedding geometry substantially
Next step: compute authentic ESM2 embeddings and regenerate exact figure
If you want me to compute the real ESM2 residue embeddings for every ClinVar missense variant in P02766 run UMAP with the paper parameters and return a publication quality Plotly figure I can run the analysis for you with available compute and the official model weights. This is the recommended action to fully reproduce the published scatter.
Relevant resources and implementations
ESM GitHub and model weights usage documented by FAIR (use esm2_t36 model as in the preprint)
UMAP implementation and parameter guidance umap learn repository
ClinVar for variant annotations mapping
Conclusions and confidence
I provided a faithful interactive plotting scaffold styling and the exact computational recipe to reproduce the ESM2 UMAP scatter for P02766; however producing the authentic reproduction requires computing real ESM2 residue embeddings which I can run if you allow the agent to execute the pipeline via the Run AI Biology Analysis button above. Reproducibility depends on using the same model variant layer preprocessing and UMAP seed stated in the preprint
Computing per variant ESM2 residue embeddings from ClinVar for P02766 then running UMAP and outputting a publication quality matplotlib/plotly figure; using ESM2 T36 umap learn and ClinVar dataset.
Get emailed when your analysis is done!
We'll email you the results when your analysis is finished.
Science Art
Science Movie
Make a narrated HD Science movie for this answer ($32 per minute)