Easy-Prime: Machine learning based pegRNA design¶
Contents:¶
Easy-Prime Installation steps¶
Summary¶
Installation of Easy-Prime is really easy via conda, however, you might experience errors due to lower conda version problem. Please make sure that you have conda installed and conda version >= 4.9.
Note
Easy-Prime is only available on Linux or Mac. For installation via conda, make sure conda version >= 4.9.
Steps¶
The installation may take 20 min.
Stage 1. Type the installation command¶
conda create -n easy_prime -c cheng_lab easy_prime
Please note that -n ENV_NAME
, the ENV_NAME
can be anything strings without space. -c cheng_lab easy_prime
means installation the compiled conda package (namely easy_prime) from cheng_lab
channel.

Stage 2. Type y to start installation¶
Once you have typed in the conda create
command, the conda program will start to gather information, for example, informing you about new conda version. Then it tells you a “Package Plan”, for new packages to be downloaded and installed.



Now, type y
and enter.
Stage 3. Waiting for installation, may take 20 min¶

Stage 4. Installation is completed¶

The terminal says, “To activate, use conda activate easy_prime”.
To use conda activate
or source activate
depends on the operating system. In Mac and Linux, please use source activate easy_prime
.
Stage 5. Print Easy_prime help message¶

Type, easy_prime -h
FAQ¶
Can Easy-Prime be installed in Windows?¶
No. It is currently impossible because the ViennaRNA package is not available in Windows. We might develop a Docker version for Easy-Prime in the future so that users in any OS can use Easy-Prime.
Can Easy-Prime be installed via lower conda version?¶
Yes. It is possible but can be time-consuming. You can install the following dependencies via conda (some may still need higher conda version)and then install Easy-Prime via pip install easy-prime
.
- python
- bedtools
- matplotlib
- pandas
- xgboost
- scikit-learn
- viennarna
- joblib
- pyyaml
- scikit-bio
- biopython
- mechanize
- dna_features_viewer
- dash
- dash-bio
- dash-core-components
- jupyter_dashboards
- plotly
Easy-Prime Web server tutorial¶
Welcome to Easy-Prime¶
Easy-Prime is a machine learning based tool for prime editing gRNA (pegRNA) design. Please input your desire edits in VCF format or FASTA format and click start. Additionally, you can play with the pegRNA/ngRNA searching parameters. Outputs include a bed-like table and genome-browser visualization.
This web server is based on Dash. URL is: http://easy-prime.cc/
Note
Currently, this web portal only supports hg19.
Note
We had it before that the Easy-Prime server is done due to some AWS issue. If so, just let us know, we will fix it.
Get Started¶
Go to the easy-prime web portal, the webpage looks like below:

Here, you can find areas to input target mutations, to choose different searching parameters, and output visualizations, including a bed-like table and a genome-browser visualization.
For starter, you can first click Examples
to automatically load input examples for the 4 acceptable formats.
If you experience error (very likely due to incorrect input format), you can click the check running status
button for error messages. Note that it may not be able to capture all kinds of errors.
Note
If you do experience error and everything seems not working, please refresh the browser and start over. If the issue is still there, please email us.
Input formats¶
The program accepts 4 types of formats. The first two are VCF-like formats. Basically we need 5 types of information, which are: chr, pos, ID, ref, alt, specified in the first 5 columns in a vcf file.

The last two are fasta-like formats. Basically users can input DNA sequences and the program will automatically determine the target mutation and optimize pegRNA/ngRNA design.
VCF format¶
## comment line, will be ignored
chr9 110184636 FIG5G_HEK293T_HEK3_6XHIS G GCACCATCATCACCATCAT
chr1 185056772 FIG5E_U2OS_RNF2_1CG G C
chr1 173878832 rs5878 T C
chr11 22647331 FIG3C_FANCF_7AC_PE3B T G
chr19 10244324 EDFIG5B_DNMT1_dPAM G T
The VCF
tab is used for single target mutation and the VCF batch
tab is used for any number of target mutations (prefer less than 10 mutations). The server prohibits output file size > 50M. If you want to design pegRNAs for large number of mutations, please download the command line program.
Note that this format is a tsv format, please do not confuse the program with space or comma. You can first create the input in excel and then copy and paste it to the text box.
FASTA format¶
>rs2251964_ref
GTTACCAAAGCAAATGACATCTTGTGAAAGGGGAGGTCTGAAAAAAAAAAACAAGTGGGTGGGTTTTTTCAAAGTAGGCCACCGGGCCTGAGATGACCAGAATTCAAATTAGGATGACAGTGTAGTAGGGGAAGCAACCAGAATCGGACCT
>rs2251964_alt
GTTACCAAAGCAAATGACATCTTGTGAAAGGGGAGGTCTGAAAAAAAAAAACAAGTGGGTGGGTTTTTTCAAAGTAGGCCACCGGGCCTGAGATAACCAGAATTCAAATTAGGATGACAGTGTAGTAGGGGAAGCAACCAGAATCGGACCT
We use a keyword to recognize the reference and mutated sequences and they are _ref
and _alt
. In this example, variant name is rs2251964
, but it can be string without spaces.
We suggest the input sequence length is at least 100bp.
PrimeDesign format¶
>test_SNV
GCCTGTGACTAACTGCGCCAAAACGGCCTGTGACTAACTGCGCCAGCCTGTGACTAACTGCGCCAAAACGAAACG(T/A)GCCTGGCCTGTGACTAACTGCGCCAAAACGTGACTAACTGCGCCAAAACGCTTCCAATCCCCTTATCCAATTTA
>test_insertion
GCCTGTGCCTGTGACTAACTGCGCCAAAACGGAGCCTGTGACTAACTGCGCCAAAACGCTAACTGCGCCAAAACGT(+CTT)CTTCCGCCTGGCCTGTGACTAACTGCGCCAAAACGTGACTAACTGCGCCAAAACGAATCCCCTTATCCAATTTA
>test_deletion
GCCTGTGACTAGCCTGTGACTAACTGCGCCAAAACGACTGCGCGCCTGTGACTAACTGCGCCAAAACGCAAAAC(-GTCT)TCCAATCGCCTGTGACTAACTGCGCCAAAACGCCCTTATCCGCCTGTGACTAACTGCGCCAAAACGAATTTA
Please see https://github.com/pinellolab/PrimeDesign#primedesign-input-sequence-format for more information.
We use PrimeDesign format as a FASTA format, the fasta header is used as the variant name.
Please note that the Combinatorial edits
format is not supported, e.g., GC(G/T)CCA(+ATCG)AAA
Searching Parameters¶
Here users can change RTT length, PBS length, and nick-gRNA distance. We suggest users just use the default settings.
Output pegRNA/ngRNA design tables¶
Once easy-prime is finished, default sgRNA, PBS, RTT, ngRNA selection is set to be the one with the highest predicted editing efficiency.
Users can click on each tab (e.g., PBS table tab) to choose other sequences. Selection of sgRNA triggers updates of PBS, RTT, and ngRNA table, since there 3 components are unique for each sgRNA. Each selection triggers the genome browser visualization in the bottom.
To download all results for current Easy-Prime prediction, click the Download all prediction
button. This will download all prediction in a bed-like format as a zip file. Remember that Easy-Prime exhaustively searches all combinations, this is a big file.
To download your current selection, click “Download current selection”. This is a bed-like format containing the 4 components of a pegRNA/ngRNA, which are sgRNA, PBS, RTT, and ngRNA.

Output pegRNA/ngRNA genome browser visualization¶
Genome browser view is powered by Protein Paint (https://pecan.stjude.cloud/proteinpaint). You can zoom in to actually see the DNA bases.
However, we only support hg19 in the tracks. So then the second visualization, will be better if your input is in FASTA format (e.g., if you have hg38 variant, you can first extract +/- 100bp sequence and input here).
Ask questions here¶
Summary¶
PE design involves carefully choosing a standard sgRNA, a RT template that contains the desired edits, a PBS that primes the RT reaction, and a ngRNA that nicks the non-edit strand. Usually thousands of combinations are available for one single disired edit. Therefore, it is overwhelming to select the most likely high-efficient candidate from the huge number of combinations.
Easy-Prime applies a machine learning model (i.e., XGboost) that learned important PE design features from public PE amplicon sequencing data to help researchers selecting the best candidate.
Installation¶
conda create -n genome_editing -c cheng_lab easy_prime
source activate genome_editing
easy_prime -h
For detailed installation with screenshots, see: Installation
Input¶
vcf input example
VCF headers will be ignored. Only the first 5 columns from the vcf file will be used; they are: chr, pos, name/id, ref, alt.
## comment line, will be ignored
chr9 110184636 FIG5G_HEK293T_HEK3_6XHIS G GCACCATCATCACCATCAT
chr1 185056772 FIG5E_U2OS_RNF2_1CG G C
chr1 173878832 rs5878 T C
chr11 22647331 FIG3C_FANCF_7AC_PE3B T G
chr19 10244324 EDFIG5B_DNMT1_dPAM G T
fasta input example
To specify reference and alternative allele, you need two fasta sequences; _ref is a keyword that will be recognized as the reference allele and _alt is a keyword for target mutations.
>test_ref
AAAAAAAAAAAAAAAAAAAAAAAAAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>test_alt
AAAAAAAAAAAAAAAAAAAAAAAAAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Config file¶
Default values are shown in the following yaml files.
genome_fasta: /path/to/genome.fa
scaffold: GTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC
debug: 0
n_jobs: 4
min_PBS_length: 8
max_PBS_length: 17
min_RTT_length: 10
max_RTT_length: 25
min_distance_RTT5: 3
max_ngRNA_distance: 100
max_target_to_sgRNA: 10
sgRNA_length: 20
offset: -3
PAM: NGG
Output¶
The output folder contains:
topX_pegRNAs.csv
rawX_pegRNAs.csv.gz
X_p_pegRNAs.csv.gz
summary.csv
The top candidates are provided in topX_pegRNAs.csv. This is a rawX format file.
rawX format¶
X means the input to machine learning models. Here, rawX basically means the file before machine learning featurization. Specifically, rawX contains 11 + 1 columns. The first 5 columns are from the input vcf file: sample_ID, chr, pos, ref, alt, where sample_ID ends with _candidate_xxx, this indicates the N-th combination. The next 6 columns are genomic coordinates: type, seq, chr, start, end, strand, where the type could be sgRNA, PBS, RTT, or ngRNA. Since for one PE design, it has to have these 4 components, which means that for one unique sample_ID, it has 4 rows specifying the sequences for each of them. The 12-th column, which is optional, is the predicted efficiency; in other words, the Y for machine learning.
Both topX_pegRNAs.csv and rawX_pegRNAs.csv.gz use this format.
X format¶
X format is the numeric representation of rawX. X_p format appends the predicted efficiency to the last column of X.
Main results¶
The main results, which is the top condidates, is provided in topX_pegRNAs.csv.
PE design visualization¶
Users can visualize the predicted combinations using:
easy_prime_vis -f topX_pegRNAs.csv -s /path/to/genome_fasta.fa
This will output pdf files to a result dir.
Usage¶
git clone https://github.com/YichaoOU/easy_prime
cd easy_prime/test
easy_prime -h
easy_prime --version
## Please update the genome_fasta in config.yaml
easy_prime -c config.yaml -f test.vcf
## Will output results to a folder
DASH application¶
Easy-Prime also provides a dash application.
Please have dash installed before running the dash application.
git clone https://github.com/YichaoOU/easy_prime
cd easy_prime/dash_app
python main.py