Why BGPT?
logo

Built for bioinformatics workflows

Download raw tables, code snippets, and datasets from full texts to power analyses.







Press Enter ↡ to solve



    Fuel Your Discoveries




     Quick Analysis Plan



    Concise analysis plan

    Plan to quality control demultiplexed paired Illumina reads from the RdCas12n HEK293T amplicon sequencing, align to reference amplicons, call and aggregate indel frequencies per locus+guide, run CRISPResso2-like QC, compute positional indel spectra and frameshift rates, and generate publication-ready plots and a reproducible Snakemake pipeline. Key references: RdCas12n engineering and amplicon methods




     Long Analysis Plan



    Full bioinformatics and code plan to analyze RdCas12n HEK293T indel amplicon sequencing

    Overview

    You likely have dual 8-nt barcoded 200 bp amplicons generated per locus sequenced as Illumina PE150 and demultiplexed (per the RdCas12n study). The goal is to compute per locus per biological replicate: total reads, aligned reads, percent reads with any indel, positional indel spectra relative to cut site, frameshift fraction, substitution rates, and statistical comparison between sgRNA WT and engineered sgRNA T19; generate figures and produce a reproducible Snakemake pipeline and a Jupyter notebook for traceability. All suggested steps are evidence grounded in the RdCas12n methods and amplicon analysis best practices

    Required inputs

    • Demultiplexed fastq paired end files per sample (R1 R2) or raw fastq plus barcode sample sheet
    • Reference amplicon FASTA for each locus (exact PCR amplicon sequence including primer-derived bases)
    • Cut site coordinate within each amplicon (0-based or 1-based; indicate strand)
    • Sample metadata table with sample id, biological replicate, condition (sgRNA WT or T19), locus id, barcode sequences
    • Optional: raw read counts from sequencing facility and negative control sample(s)

    Recommended software environment

    Create reproducible conda environment or container. Minimum tools and Python libraries:

    • Conda, Snakemake (for workflow reproducibility)
    • Fastp for adapter trimming and QC
    • BBMap or BWA MEM or bowtie2 for short amplicon alignment
    • CRISPResso2 or CRISPRessoBatch for indel quantification (authors used CRISPResso2)
    • Python libraries: pandas numpy scipy matplotlib seaborn plotly biopython pysam scikit-learn
    • Optional for deeper analyses: CRISPRessoPooled, ampliconDIVider, and custom pysam-based callers

    Stepwise analysis plan (detailed)

    1. Organize data and metadata
      Create sample sheet mapping barcodes to sample ids and loci. Confirm amplicon sequences match primer design used in experiments; obtain cut site offsets from the lab or deduce from target spacer position in amplicon. Document all versions.
    2. Quality control and trimming
      Run fastp on each paired sample for adapter trimming, low quality base trimming, per base quality plots, and per-sample read counts. Flag samples with low coverage (eg below 1000 reads per sample for amplicon experiments) β€” thresholds depend on desired sensitivity (1000 reads approximate detection limit ~0.1% with caveats). Record per-sample FastQC metrics.
    3. Demultiplexing verification
      If starting from pooled reads with inline dual 8-nt barcodes, re-demultiplex by exact barcode matching (allow zero mismatches initially) and report barcode collision rates and unassigned read fraction. The RdCas12n study used dual 8-nt barcodes; follow same strictness to avoid index hopping artefacts .
    4. Align reads to amplicon references
      Use a local aligner tuned for short high-identity reads (BBMap or BWA MEM with -k 15 -T 19) aligning paired reads to the expected amplicon sequence (not the whole genome). Capture both perfect matches and reads with indels. Produce SAM/BAM sorted files and alignment statistics (mapped fraction, softclip rates).
    5. Call indels and substitutions
      Option A (recommended, reproducible): Run CRISPResso2 in amplicon mode per sample using the amplicon FASTA, specifying the expected cleavage site and desired read quality filters; extract indel frequency, position-resolved indel spectra, and frameshift fraction for coding loci. Option B (custom): Use pysam to parse CIGAR strings and compute per-position insertion and deletion counts relative to reference; compute substitution mismatches from MD tags. Save per-sample tables.
    6. Aggregate and normalize
      Aggregate counts across biological replicates and conditions. For each locus and sample compute: total reads, reads passing QC, percent reads with indel, percent reads with substitution only, top indel classes and their sequence, frameshift proportion (for coding targets), and 95% confidence intervals by binomial proportion or bootstrap where N small.
    7. Comparisons and statistics
      Compare sgRNA WT vs sgRNA T19 using appropriate tests: paired t test or Wilcoxon signed-rank for matched replicates; logistic regression for binary indel presence controlling for read depth and locus; multiple testing correction across loci (Benjamini-Hochberg). Report effect sizes (difference in median indel fraction) and p values.
    8. Positional visualization
      Produce per-locus waterfall plots and positional heatmaps showing indel frequency at each base relative to cut site and stacked barplots of insertion vs deletion composition. Compute cumulative indel spectra and highlight dominant deletion junctions and microhomology. Use Plotly interactive plots for web reports, and static matplotlib/seaborn for publication figures.
    9. Quality controls specific to amplicon editing
      Check for: strand bias in indels, read start/end trimming artefacts, primer-dimers, likely PCR chimera events (look for improbable long deletions), and cross-sample index hopping. Include negative control samples to estimate background indel rates; apply a conservative threshold (e.g., require >2x negative control indel rate and absolute >0.5% for claiming low-frequency editing) and document thresholds.
    10. Annotation of functional impact
      For coding targets, annotate predicted frameshift vs in-frame indels, predict premature stop codons and apply SIFT Indel or other heuristic to estimate likely loss of function when desired .
    11. Reporting and reproducibility
      Produce an automated HTML report (Jupyter + nbconvert or MultiQC style) summarizing per-sample QC, alignment stats, indel tables, interactive plots, and Snakemake logs. Include exact command lines, versions, and the conda env YAML or container image digest.

    Expected/benchmark numbers and interpretation

    Based on the RdCas12n data, engineered sgRNA T19 substantially improved editing across many loci with examples up to ~40% indel at HEXA-4 in HEK293T; unengineered RdCas12n often gave modest ~10% at several loci

    Datasets to include or cross-check

    • RdCas12n sequencing project PRJNA1261697 to retrieve raw/processed reads and confirm mapping and sample metadata
    • CRISPResso2 documentation and examples for amplicon analysis (use same settings authors used where possible)
    • Optional external datasets for method benchmarking: Cas12f orthologs editing datasets and kinetics for context, see Cas12f-MG119-28 study (MG119-28 benchmarking in K562 and in vitro)
    • Biostudies and GEO datasets from HEK293T for possible background editing/noise comparison (e.g., E-GEOD-63812) to check generic biases in HEK293T sequencing libraries.

    Visualizations to produce (unique div ids provided)

    1. Per locus indel frequency boxplot across replicates div id=plot_indel_boxplot_1
    2. Per locus positional heatmap of deletion frequency relative to cut site div id=plot_positional_heatmap_1
    3. Stacked bar of top 10 indel alleles per locus div id=plot_top_indels_1
    4. Frameshift versus in-frame stacked bar per locus div id=plot_frameshift_1
    5. Interactive table of allele sequences and counts using DataTables div id=table_alleles_1

    Reproducible implementation (Snakemake outline)

    High-level Snakemake DAG tasks

    • rule all: produce final report HTML and figures
    • rule qc: run fastp
    • rule demux: verify barcodes and split reads if needed
    • rule align: align reads to amplicon references
    • rule call_indels: run CRISPResso2 per sample
    • rule aggregate: collect CRISPResso2 outputs into master table
    • rule stats: perform statistical tests and generate summary
    • rule plots: create Plotly/Matplotlib figures
    • rule report: render Jupyter notebook to HTML

    Common pitfalls and how to avoid them

    • Index hopping and barcode misassignment: use strict barcode matching and include negative controls.
    • PCR chimera and polymerase slippage producing artifactual deletions: minimize PCR cycles, include technical replicates, and inspect long deletion sequences for microhomology patterns.
    • Mis-specified amplicon reference causing false positive indels: ensure the exact primer-trimmed amplicon reference is used.
    • Low read depth inflating variance: set minimum read depth thresholds and report confidence intervals.

    Deliverables

    • Snakemake workflow repository with conda environment YAML or Dockerfile
    • Jupyter notebook with step-by-step analysis and figures
    • Interactive HTML report with Plotly visualizations and DataTables allele table
    • Master CSV with per sample per locus alleles and counts and aggregated statistics
    • Short methods text block suitable for Methods section, documenting software versions and parameters

    Caveats and limitations

    RdCas12n editing efficiencies vary across loci and PAM contexts; the original study notes modest editing at several loci compared to SpCas9/AsCas12a and warns about limited PAM scope (A-rich PAMs) and unresolved structural flexibility zones, which may affect generalizability of results to other loci or cell types

    Optional advanced analyses

    • Off-target analysis: align reads to genome to detect potential off-target amplicons (requires whole-genome sequencing or targeted off-target panels)
    • Microhomology analysis: identify MH-mediated deletions indicative of alternative end joining pathways
    • Machine learning allele clustering: cluster indel allele patterns to detect systematic PCR or library prep artefacts
    • Integration with structural data: map positional cleavage patterns onto RdCas12n structural insights to hypothesize mechanism for locus variability using PDB 9J09 and 9UDI

    Quick actions

    Key citations

    The plan above is ready to implement. If you want, I can generate the Snakemake workflow, conda environment YAML, and an executable Jupyter notebook that implements the full pipeline and visualizations using your uploaded fastq files and sample sheet. Click Run AI Biology Analysis to start an automated, iterative analysis using your files.



    Feedback:   

    Updated: December 25, 2025



     Top Data Sources ExportMCP



     Analysis Wizard



    Preparing reproducible Snakemake workflow and Jupyter notebook to QC fastq, align to amplicon references, call indels, aggregate per locus allele counts, and produce interactive Plotly figures using RdCas12n PRJNA1261697 and provided sample sheet.



     Hypothesis Graveyard



    RdCas12n inefficiency is solely due to poor delivery: rejected because engineered sgRNA T19 improves editing substantially while delivery remained constant in experiments.


    All A-rich PAMs are equivalently permissive: rejected because the study shows 5'-AAC-3' preferred but other 5'-AAH-3' PAMs have context-dependent cleavage, so PAM tolerance is nuanced.

     Science Art


    Analyze Data: RdCas12n HEK293T indel amplicon sequencing Science Art

     Science Movie



    Make a narrated HD Science movie for this answer ($32 per minute)




     Discussion








    Get Ahead With Science Insights

    Custom summaries of the latest cutting edge Science research. Every Friday. No Ads.


    My BGPT