Easy-Prime: Machine learning based pegRNA design

Contents:

Easy-Prime Installation steps

Summary

Installation of Easy-Prime is really easy via conda, however, you might experience errors due to lower conda version problem. Please make sure that you have conda installed and conda version >= 4.9.

Note

Easy-Prime is only available on Linux or Mac. For installation via conda, make sure conda version >= 4.9.

Steps

The installation may take 20 min.

Stage 1. Type the installation command
conda create -n easy_prime -c cheng_lab easy_prime

Please note that -n ENV_NAME, the ENV_NAME can be anything strings without space. -c cheng_lab easy_prime means installation the compiled conda package (namely easy_prime) from cheng_lab channel.

_images/step1.png
Stage 2. Type y to start installation

Once you have typed in the conda create command, the conda program will start to gather information, for example, informing you about new conda version. Then it tells you a “Package Plan”, for new packages to be downloaded and installed.

_images/step2.1.png _images/step2.2.png _images/step2.3.png

Now, type y and enter.

Stage 3. Waiting for installation, may take 20 min
_images/step3.png
Stage 4. Installation is completed
_images/step4.png

The terminal says, “To activate, use conda activate easy_prime”.

To use conda activate or source activate depends on the operating system. In Mac and Linux, please use source activate easy_prime.

Stage 5. Print Easy_prime help message
_images/step5.png

Type, easy_prime -h

FAQ

Can Easy-Prime be installed in Windows?

No. It is currently impossible because the ViennaRNA package is not available in Windows. We might develop a Docker version for Easy-Prime in the future so that users in any OS can use Easy-Prime.

Can Easy-Prime be installed via lower conda version?

Yes. It is possible but can be time-consuming. You can install the following dependencies via conda (some may still need higher conda version)and then install Easy-Prime via pip install easy-prime.

- python
- bedtools
- matplotlib
- pandas
- xgboost
- scikit-learn
- viennarna
- joblib
- pyyaml
- scikit-bio
- biopython
- mechanize
- dna_features_viewer
- dash
- dash-bio
- dash-core-components
- jupyter_dashboards
- plotly

Easy-Prime Web server tutorial

Welcome to Easy-Prime

Easy-Prime is a machine learning based tool for prime editing gRNA (pegRNA) design. Please input your desire edits in VCF format or FASTA format and click start. Additionally, you can play with the pegRNA/ngRNA searching parameters. Outputs include a bed-like table and genome-browser visualization.

This web server is based on Dash. URL is: http://easy-prime.cc/

Note

Currently, this web portal only supports hg19.

Note

We had it before that the Easy-Prime server is done due to some AWS issue. If so, just let us know, we will fix it.

Get Started

Go to the easy-prime web portal, the webpage looks like below:

_images/easy_prime_web_portal_init.png

Here, you can find areas to input target mutations, to choose different searching parameters, and output visualizations, including a bed-like table and a genome-browser visualization.

For starter, you can first click Examples to automatically load input examples for the 4 acceptable formats.

If you experience error (very likely due to incorrect input format), you can click the check running status button for error messages. Note that it may not be able to capture all kinds of errors.

Note

If you do experience error and everything seems not working, please refresh the browser and start over. If the issue is still there, please email us.

Input formats

The program accepts 4 types of formats. The first two are VCF-like formats. Basically we need 5 types of information, which are: chr, pos, ID, ref, alt, specified in the first 5 columns in a vcf file.

_images/accept_formats.png

The last two are fasta-like formats. Basically users can input DNA sequences and the program will automatically determine the target mutation and optimize pegRNA/ngRNA design.

VCF format
## comment line, will be ignored
chr9    110184636       FIG5G_HEK293T_HEK3_6XHIS        G       GCACCATCATCACCATCAT
chr1    185056772       FIG5E_U2OS_RNF2_1CG     G       C
chr1    173878832       rs5878  T       C
chr11   22647331        FIG3C_FANCF_7AC_PE3B    T       G
chr19   10244324        EDFIG5B_DNMT1_dPAM      G       T

The VCF tab is used for single target mutation and the VCF batch tab is used for any number of target mutations (prefer less than 10 mutations). The server prohibits output file size > 50M. If you want to design pegRNAs for large number of mutations, please download the command line program.

Note that this format is a tsv format, please do not confuse the program with space or comma. You can first create the input in excel and then copy and paste it to the text box.

FASTA format
>rs2251964_ref
GTTACCAAAGCAAATGACATCTTGTGAAAGGGGAGGTCTGAAAAAAAAAAACAAGTGGGTGGGTTTTTTCAAAGTAGGCCACCGGGCCTGAGATGACCAGAATTCAAATTAGGATGACAGTGTAGTAGGGGAAGCAACCAGAATCGGACCT
>rs2251964_alt
GTTACCAAAGCAAATGACATCTTGTGAAAGGGGAGGTCTGAAAAAAAAAAACAAGTGGGTGGGTTTTTTCAAAGTAGGCCACCGGGCCTGAGATAACCAGAATTCAAATTAGGATGACAGTGTAGTAGGGGAAGCAACCAGAATCGGACCT

We use a keyword to recognize the reference and mutated sequences and they are _ref and _alt. In this example, variant name is rs2251964, but it can be string without spaces.

We suggest the input sequence length is at least 100bp.

PrimeDesign format
>test_SNV
GCCTGTGACTAACTGCGCCAAAACGGCCTGTGACTAACTGCGCCAGCCTGTGACTAACTGCGCCAAAACGAAACG(T/A)GCCTGGCCTGTGACTAACTGCGCCAAAACGTGACTAACTGCGCCAAAACGCTTCCAATCCCCTTATCCAATTTA
>test_insertion
GCCTGTGCCTGTGACTAACTGCGCCAAAACGGAGCCTGTGACTAACTGCGCCAAAACGCTAACTGCGCCAAAACGT(+CTT)CTTCCGCCTGGCCTGTGACTAACTGCGCCAAAACGTGACTAACTGCGCCAAAACGAATCCCCTTATCCAATTTA
>test_deletion
GCCTGTGACTAGCCTGTGACTAACTGCGCCAAAACGACTGCGCGCCTGTGACTAACTGCGCCAAAACGCAAAAC(-GTCT)TCCAATCGCCTGTGACTAACTGCGCCAAAACGCCCTTATCCGCCTGTGACTAACTGCGCCAAAACGAATTTA

Please see https://github.com/pinellolab/PrimeDesign#primedesign-input-sequence-format for more information.

We use PrimeDesign format as a FASTA format, the fasta header is used as the variant name.

Please note that the Combinatorial edits format is not supported, e.g., GC(G/T)CCA(+ATCG)AAA

Searching Parameters

Here users can change RTT length, PBS length, and nick-gRNA distance. We suggest users just use the default settings.

Output pegRNA/ngRNA design tables

Once easy-prime is finished, default sgRNA, PBS, RTT, ngRNA selection is set to be the one with the highest predicted editing efficiency.

Users can click on each tab (e.g., PBS table tab) to choose other sequences. Selection of sgRNA triggers updates of PBS, RTT, and ngRNA table, since there 3 components are unique for each sgRNA. Each selection triggers the genome browser visualization in the bottom.

To download all results for current Easy-Prime prediction, click the Download all prediction button. This will download all prediction in a bed-like format as a zip file. Remember that Easy-Prime exhaustively searches all combinations, this is a big file.

To download your current selection, click “Download current selection”. This is a bed-like format containing the 4 components of a pegRNA/ngRNA, which are sgRNA, PBS, RTT, and ngRNA.

_images/easy_prime_output_vis.png

Output pegRNA/ngRNA genome browser visualization

Genome browser view is powered by Protein Paint (https://pecan.stjude.cloud/proteinpaint). You can zoom in to actually see the DNA bases.

However, we only support hg19 in the tracks. So then the second visualization, will be better if your input is in FASTA format (e.g., if you have hg38 variant, you can first extract +/- 100bp sequence and input here).

Ask questions here

https://github.com/YichaoOU/easy_prime

Summary

PE design involves carefully choosing a standard sgRNA, a RT template that contains the desired edits, a PBS that primes the RT reaction, and a ngRNA that nicks the non-edit strand. Usually thousands of combinations are available for one single disired edit. Therefore, it is overwhelming to select the most likely high-efficient candidate from the huge number of combinations.

Easy-Prime applies a machine learning model (i.e., XGboost) that learned important PE design features from public PE amplicon sequencing data to help researchers selecting the best candidate.

Installation

conda create -n genome_editing -c cheng_lab easy_prime

source activate genome_editing

easy_prime -h

For detailed installation with screenshots, see: Installation

Input

  1. vcf input example

VCF headers will be ignored. Only the first 5 columns from the vcf file will be used; they are: chr, pos, name/id, ref, alt.

## comment line, will be ignored
chr9    110184636       FIG5G_HEK293T_HEK3_6XHIS        G       GCACCATCATCACCATCAT
chr1    185056772       FIG5E_U2OS_RNF2_1CG     G       C
chr1    173878832       rs5878  T       C
chr11   22647331        FIG3C_FANCF_7AC_PE3B    T       G
chr19   10244324        EDFIG5B_DNMT1_dPAM      G       T
  1. fasta input example

To specify reference and alternative allele, you need two fasta sequences; _ref is a keyword that will be recognized as the reference allele and _alt is a keyword for target mutations.

>test_ref
AAAAAAAAAAAAAAAAAAAAAAAAAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>test_alt
AAAAAAAAAAAAAAAAAAAAAAAAAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Config file

Default values are shown in the following yaml files.

genome_fasta: /path/to/genome.fa
scaffold: GTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC
debug: 0
n_jobs: 4
min_PBS_length: 8
max_PBS_length: 17
min_RTT_length: 10
max_RTT_length: 25
min_distance_RTT5: 3
max_ngRNA_distance: 100
max_target_to_sgRNA: 10
sgRNA_length: 20
offset: -3
PAM: NGG

Output

The output folder contains:

  • topX_pegRNAs.csv

  • rawX_pegRNAs.csv.gz

  • X_p_pegRNAs.csv.gz

  • summary.csv

The top candidates are provided in topX_pegRNAs.csv. This is a rawX format file.

rawX format

X means the input to machine learning models. Here, rawX basically means the file before machine learning featurization. Specifically, rawX contains 11 + 1 columns. The first 5 columns are from the input vcf file: sample_ID, chr, pos, ref, alt, where sample_ID ends with _candidate_xxx, this indicates the N-th combination. The next 6 columns are genomic coordinates: type, seq, chr, start, end, strand, where the type could be sgRNA, PBS, RTT, or ngRNA. Since for one PE design, it has to have these 4 components, which means that for one unique sample_ID, it has 4 rows specifying the sequences for each of them. The 12-th column, which is optional, is the predicted efficiency; in other words, the Y for machine learning.

Both topX_pegRNAs.csv and rawX_pegRNAs.csv.gz use this format.

X format

X format is the numeric representation of rawX. X_p format appends the predicted efficiency to the last column of X.

Main results

The main results, which is the top condidates, is provided in topX_pegRNAs.csv.

PE design visualization

Users can visualize the predicted combinations using:

easy_prime_vis -f topX_pegRNAs.csv -s /path/to/genome_fasta.fa

This will output pdf files to a result dir.

Usage

git clone https://github.com/YichaoOU/easy_prime

cd easy_prime/test

easy_prime -h

easy_prime --version

## Please update the genome_fasta in config.yaml

easy_prime -c config.yaml -f test.vcf

## Will output results to a folder

DASH application

Easy-Prime also provides a dash application.

Please have dash installed before running the dash application.

git clone https://github.com/YichaoOU/easy_prime

cd easy_prime/dash_app

python main.py