MEGASAT

From Bioinformatics Software
Jump to navigationJump to search

This page is currently under development. Please check back for release information about MEGASAT and updated documentation.

Overview

The current version of MEGASAT is 1.0. The Perl scripts in MEGASAT should work with any relatively recent version of Perl and have been tested with versions 5.18.2 and 5.16.3.

The GUI in MEGASAT should work with any relatively recent version of Java and have been tested with versions 1.6.0_65 and 1.8.0_40.

The R script in MEGASAT should work with any relatively recent version of R and have been tested with version 3.2.0.

Installing MEGASAT

Windows

In this “MEGASAT_1.0 for Windows” folder, there are two Perl executable files,one executable Jar files and one R script. MEGASAT_GUI is the graphical user interface for running those two Perl executable files (“MEGASAT_Genotype” and “MEGASAT_Update”) and the Rscript "Mplot.R". "Mplot.R" can generate bar plots to display length distribution for each individual and each locus.

In order to run "MEGASAT_GUI.jar", Java needs to be installed in your computer. Here is the link of downloading Java: https://java.com/en/download/.

If Perl is already installed in your computer, you can just click the Start button and go to your Perl interpreter to run the Perl scripts. If you don’t have Perl installed but still want to run the Perl scripts, here are the two main distributions for Windows: ActivePerl (http://www.activestate.com/activePerl) and Strawberry Perl (http://strawberryPerl.com/). We have used the latter in development and testing of MEGASAT, and recommend its use. The script “MEGASAT_Genotype.pl” and "MEGASAT_Update.pl" uses no complicated library functions.

In order to run "Mplot.R", R interpreter needs to be installed in your computer. Since "Mplot.R" has command line arguments and RStudio cannot access command line arguments, one way to run "Mplot.R" is to call it from command line, the other way is to use "MEGASAT_GUI" to invoke "Mplot.R". Here is the link of downloading R: http://cran.r-project.org/bin/windows/base/.

Macintosh

If you are using a Macintosh system, Perl should already be installed; type “Perl –v” at the command line to ensure this is the case. So Perl scripts can be easily invoked from the terminal on Mac system. But if you don’t want to run scripts in terminal, a simple GUI is also offered to invoke those two Perl scripts and R script.

In order to run "MEGASAT_GUI.jar", Java needs to be installed in your computer. Here is the link of downloading Java: https://java.com/en/download/.

To run "Mplot.R", R interpreter also needs to be installed in your computer. Here is the link of downloading R: http://cran.r-project.org/bin/macosx/.

Linux

Perl should already be installed on Linux system. So it’s easy to go to terminal to invoke Perl scripts.

On Linux system, R interpreter also needs to be installed to run "Mplot.R". Here is the link of downloading R: http://cran.r-project.org

Input file formats

MEGASAT_Genotype.pl

“MEGASAT_Genotype.pl” requires an input file with information about PCR primers, and a set of .fastq or .fasta files representing reads from each sampled locus.

Primer file

The primer file must be in a tab-separated format, with the following headers: - Column1: locus name - Column2: 5' microsatellite primer - Column3: reverse-complement of 3' microsatellite primer - Column4: 3’ flank - Column5: 5’ flank - Column6: the repeat_unit_sequence - Column7: the ratios group (You don't need to write this column if you want to use all the default ratios of MEGASAT)

In this text file, a header line is required to specify the column name. If one locus doesn’t have 3’ flank and 5' flank, a character “X” needs to be written in the 3’ flank column and 5' flank column in that text file. Column 7 has six ratios that are separated by comma. Here explains how to use these six ratios. We assume A1 is the largest allele peak, A2 is the second largest allele peak, R1&R2&R3&R4&R5&R6 represent six ratios respectively.

When A1 is smaller than A2, first calculate the quotient of depth(A2) and depth(A1) to see if it is larger than or equal to R1. If not, it will be homozygous, which means that A1 A1 will be scored as real alleles. If so, check If there is a stutter peak A3 or A4 that is larger than A2 and depth(A3 or A4) >=R2. If so, check if A3 or A4 is one repeated unit larger than A2 and depth(A3) or depth(A4)/depth(A2)> = R3. If this condition is met, A1 A3 or A1 A4 will be real genotype. But if not, it will be scored as "unscorable unscorable" since it may have three alleles. If depth(A3 or A4)/depth(A2) is smaller than R2, A1 A2 will be real genotype.

When A1 is larger than A2, we need to check if A1-A2>=3 and depth(A2)/depth(A1)>=R4 or A1-A2<=2 and depth(A2)/depth(A1)>=R5 (requirement 1). If so, check if there is a stutter peak A3 > A1 and depth(A3)/depth(A1) > R6. If this requirement is met, it will be scored as "unscorable unscorable". If not, A2 A1 will be real genotype. If requirement 1 is not met,we need to check if A3>A1 and depth(A3)/depth(A1) >= R6(requirement 2). If requirement 2 is met, check if there is a stutter peak A4 that is one repeated unit larger than A3 and depth(A4)/depth(A3)> = R3. If so, A1 A4 will be real genotype. If not, A1 A3 will be real genotype. If requirement 1 and requirement 2 are both not met, it will be homozygous, which means that A1 A1 will be scored as real alleles.

If you want to use all the default ratios which is (0.15,0.4,0.7,0.6,0.8,0.2) to predict genotypes, you don't need to write this column in the primer file. But if you want to change part of these six ratios, you can write your own ratios in the corresponding positions in column7. For other ratios you don't want to change, a space can be used in the corresponding position. For example, if the user just want to change the first ratio to 0.3, the column7 format will be (0.3, , , , , ). In the column7, you don't need to write brackets.

Here is an example primer file.

Input sequence file

Input sequence read files could be in standard FASTQ format or FASTA format.

Here is an example of a short FASTQ file that will work with the primer file above.

MEGASAT_Update.pl

Scores file

“MEGASAT_Update.pl” requires an input file with original genotyping and new genotyping information you want to update to. This scores txt file must be in comma-separated format, with the following headers: - Column1: locus name - Column2 & Column3: original genotype - Column4 & Column5: new genotype

Original genotype file

Another input file is your original genotyping txt file that generated by "MEGASAT_Genotype.pl". This "Genotype.txt" file contains genotyping information for different individuals and loci.

Mplot.R

"Mplot.R" requires an input folder that is the output folder generated by "MEGASAT_Genotype.pl".

Running MEGASAT

Running “MEGASAT_Genotype.pl”

If you don’t want to use command line to invoke scripts, a simple GUI is provided for Windows, Mac users. Double click the “MEGASAT_GUI” will display a pop-up page. In the tab page "main GUI", you can click the first “Open” button to open your input primers file. The text field under the “Open” button will display the directory of your primers file. The second small text field is for typing the number of mismatches that gives the error tolerance to forward primers and reverse primers. For five prime flank and three prime flank, the number of mismatches is set based on their lengths. The next small text field is for typing the minimum depth threshold (we set it to 20 in our experiment). The second “Open” button is to open the data set folder that contains the input sequence read files. The “Choose” button is to choose the directory to save your output folder. Two radio buttons in this page offer two options- compress the output folder or not compress the output folder. After all these parameters are filled, click the “Run the program” to run the Perl scripts.

If you want to run the scripts from command line, for Windows users, make sure you already have Perl installed in your system. We assume that “MEGASAT_Genotype.pl” and your primers txt file “primers.txt” are saved in the directory “C:\Users\Andy\Downloads”. And the data set folder “dataset” is also saved in the directory “C:\Users\Andy\Downloads”. In order to run the Perl script, first step is go back to the command prompt and type “perl C:\Users\Andy\Downloads\ MEGASAT_Genotype.pl C:\Users\Andy\Downloads\primers.txt 2 20 C:\Users\Andy\Downloads\dataset C:\Users\Andy\Desktop”.

The first command-line argument is the directory of primers txt file. The second command-line argument is the number of mismatches (2 is a good choice to set). The third command-line argument is the minimum depth threshold. Next argument is the directory of data set folder that contains input sequence read files. The last command-line argument specifies the directory where you want to save your output. After this script is completed, an output folder called “Output_dataset” will be in the saving directory you type in the command line.

Running “MEGASAT_Update.pl”

Double click the “MEGASAT_GUI” will display a pop-up page. In the tab page "update GUI", the first “Open” button is to open your scores txt file that contains the original genotyping and new genotyping information you want to update to. The second “Open” button is to open the original genotyping txt file generated by "MEGASAT_Genotype.pl". The “Choose” button is to choose the directory to save the new excel file. After all these parameters are filled, click the “Run the program” to run “MEGASAT_Update.pl”.

If you want to run the scripts from command line, for Windows users, make sure you already have Perl installed in your system. We assume that “MEGASAT_Update.pl” and your scores txt file “Scores.txt” are saved in the directory “C:\Users\Andy\Downloads”. And the original genotyping txt file is also saved in the directory “C:\Users\Andy\Downloads”. In order to run the Perl script, first step is go back to the command prompt and type “perl C:\Users\Andy\Downloads\ MEGASAT_Update.pl C:\Users\Andy\Downloads\Scores.txt C:\Users\Andy\Downloads\Genotype.txt C:\Users\Andy\Desktop”.

The first command-line argument is the directory of scores txt file. The second command-line argument is the directory of original genotyping txt file "Genotype.txt" generated by "MEGASAT_Genotype.pl". The last command-line argument specifies the directory where you want to save your new txt file. After this script is completed, a new tab-separated txt file called “NewGenotype.txt” will be in the saving directory you type in the command line.

Running “Mplot.R”

You can type "R" in command prompt or terminal (For MAC users) to check if R is correctly installed on your computer. If R is correctly installed, it will show the version information of "R". For Windows users, don't forget to add the path of "R" to environment variables.

Double click "MEGASAT_GUI" will display a pop-up page. In the tab page "R GUI", the "Open" button is to open the output folder that generated by "MEGASAT_Genotype.pl". The "Choose" button is to choose the directory to save the histogram output folder. After all these parameters are filled, click the “Run the program” to run “Mplot.R”.

If you want to run the R scripts from command line, then type the following command to invoke the R script: rscript /Users/Alex/Documents/Mplot.R /Users/Alex/Desktop/Output /Users/Alex/Documents. We assume that the "Mplot.R" is saved in "/Users/Alex/Documents". The first argument is the directory of input folder which is the output folder generated by "MEGASAT_Genotype.pl". The second argument is the directory to save your output plot folder. A folder called "Plots_Output" will be generated in the directory "/Users/Alex/Documents" when the program is completed.

Output folder

MEGASAT_Genotype.pl

The output folder generated by “MEGASAT_Genotype.pl” has three types of files. In this folder, “Genotype.txt” is a tab-separated txt file that gives all the genotype information for all the individuals and loci. In this Genotype.txt file, “X X” means that this locus doesn’t occur in this individual. “0 0” means that the depth of alleles is too small to score. “Unscored Unscored” means that there are three possible real alleles, which makes the genotype difficult to be determined. "Number_Discarded.txt" is a tab-separated txt file that counts the number of discarded sequences for all the individuals and loci. Those discarded sequences are sequences that have 5' microsatellite primers but have no flank, repeat_unit_sequence and reverse-complement of 3' microsatellite primers. In "Number_Discarded.txt", "X" means that there is no discarded sequences for this individual at this locus. You can use Microsoft Excel to open these tab-separated txt files, which makes these files more easier to read.

Those txt files whose names start with “Genotype” and follow by the individual names show the sequence length distribution for each microsatellite locus. In those txt files, the first row illustrates the different sequence length for all the loci in one individual. Each row under the first row shows the count of length variants for each locus. The last column is the genotype information for all loci in one individual. The txt file whose name start with "Ratios" and follow by the data set name shows the ratios group for each microsatellite locus.

And those split files whose names start with “Sorted” and follow by the individual names and loci names contain all the non-trimmed(just trim off 5' microsatellite primers) sequences for one individual & one locus. Obviously, those split files whose names start with “Trimmed” have all the trimmed sequences. All the other split files whose names start with "Discarded" have all the discarded sequences(5' microsatellite primers are trimmed off but cannot find 3' flank,5' flank, repeat_unit_sequence or reverse-complement of 3' microsatellite primers).

MEGASAT_Update.pl

For another Perl script “MEGASAT_Update.pl” that helps to update the Genotype.txt very fast, the output is a tab-separated txt file called “NewGenotype.txt”. This “NewGenotype.txt” has all the updated genotyping information.

Mplot.R

The output folder of "Mplot.R" contains many pdfs and each pdf contains many histograms of sequence length variations for each single locus genotype. One pdf shows the sequence-length frequency distributions of all individuals for one locus.