Difference between revisions of "MEGASAT"

From Bioinformatics Software
Jump to navigationJump to search
Line 49: Line 49:
  
 
=== MEGASAT_Update.pl===
 
=== MEGASAT_Update.pl===
 +
 +
==== Scores file ====
 
“MEGASAT_Update.pl” requires an input file with original genotyping and new genotyping information you want to update to. This scores txt file must be in comma-separated format, with the following headers:
 
“MEGASAT_Update.pl” requires an input file with original genotyping and new genotyping information you want to update to. This scores txt file must be in comma-separated format, with the following headers:
 
- Column1: locus name
 
- Column1: locus name
Line 54: Line 56:
 
- Column4 & Column5: new genotype
 
- Column4 & Column5: new genotype
  
 +
==== Original genotype file ====
 
Another input file is your original genotyping txt file that generated by "MEGASAT_Genotype.pl". This "Genotype.txt" file contains genotyping information for different individuals and loci.
 
Another input file is your original genotyping txt file that generated by "MEGASAT_Genotype.pl". This "Genotype.txt" file contains genotyping information for different individuals and loci.
  

Revision as of 13:58, 4 May 2015

This page is currently under development. Please check back for release information about MEGASAT and updated documentation.

Overview

The current version of MEGASAT is 1.0. The MEGASAT scripts should work with any relatively recent version of Perl and have been tested with versions 5.18.2 and 5.16.3.

Installing MEGASAT

Windows

In this “MEGASAT_1.0 for Windows” folder, there are two Perl executable files,two Executable Jar Files and one R script. MEGASAT_GUI is the graphical user interface for running the main Perl executable file called “MEGASAT_Genotype”. “update_GUI” is the graphical user interface for running the Perl executable file called “MEGASAT_Update”. The R script "Mplot.R" can generate bar plots to display length distribution for each individual and each locus.

If Perl is already installed in your computer, you can just click the Start button and go to your Perl interpreter to run the Perl scripts. If you don’t have Perl installed but still want to run the Perl scripts, here are the two main distributions for Windows: ActivePerl (http://www.activestate.com/activePerl) and Strawberry Perl (http://strawberryPerl.com/). We have used the latter in development and testing of MEGASAT, and recommend its use. The script “MEGASAT_Genotype.pl” and "MEGASAT_Update.pl" uses no complicated library functions.

In order to run "Mplot.R", R interpreter needs to be installed in your computer. Since "Mplot.R" has command line arguments and RStudio cannot access command line arguments, the only way to run "Mplot.R" is to call it from command line. Here is the link of downloading R: http://cran.r-project.org/bin/windows/base/.

Macintosh

If you are using a Macintosh system, Perl should already be installed; type “Perl –v” at the command line to ensure this is the case. So Perl scripts can be easily invoked from the terminal on Mac system. But if you don’t want to run scripts in terminal, two simple GUIs are also offered to invoke those two executable Perl scripts.

To run "Mplot.R", R interpreter also needs to be installed in your computer. Here is the link of downloading R: http://cran.r-project.org/bin/macosx/.

Linux

Perl should already be installed on Linux system. So it’s easy to go to terminal to invoke Perl scripts.

On Linux system, R interpreter also needs to be installed to run "Mplot.R". Here is the link of downloading R: http://cran.r-project.org

Input file formats

MEGASAT_Genotype.pl

“MEGASAT_Genotype.pl” requires an input file with information about PCR primers, and a set of .fastq files representing reads from each sampled locus.

Primer file

The primer file must be in a tab-separated format, with the following headers: - Column1: locus name - Column2: forward primers - Column3: reverse primers - Column4: 3’ flank - Column5: 5’ flank - Column6: the repeat array

In this txt file, a header line is required to specify the column name. If one locus doesn’t have 3’ flank, a character “A” needs to be written in the 3’ flank column in that txt file. But if it doesn’t have 5’ flank, nothing needs to be written in the 5’ flank column.

Here is an example primer file.

Input sequence file

Input sequence read files must be in standard FASTQ format.

Here is an example of a short FASTQ file that will work with the primer file above.

MEGASAT_Update.pl

Scores file

“MEGASAT_Update.pl” requires an input file with original genotyping and new genotyping information you want to update to. This scores txt file must be in comma-separated format, with the following headers: - Column1: locus name - Column2 & Column3: original genotype - Column4 & Column5: new genotype

Original genotype file

Another input file is your original genotyping txt file that generated by "MEGASAT_Genotype.pl". This "Genotype.txt" file contains genotyping information for different individuals and loci.

Mplot.R

"Mplot.R" requires an input folder that contains all the cvs files whose names start with “Genotype” and follow by the individual names that are generated by "MEGASAT_Genotype.pl".

Running MEGASAT

Running “MEGASAT_Genotype.pl”

If you don’t want to use command line to invoke scripts, a simple GUI is provided for Windows, Mac and Linux users. Double click the “MEGASAT_GUI” will display a pop-up page. In this page, you can click the first “Open” button to open your input primers file. The text field under the “Open” button will display the directory of your primers file. The second small text field is for typing the number of mismatches. The second “Open” button is to open the data set folder that contains the input sequence read files. The “Choose” button is to choose the directory to save your output folder. Two radio buttons in this page offer two options- compress the output folder or not compress the output folder. After all these parameters are filled, click the “Run the program” to run the Perl scripts.

If you want to run the scripts from command line, for Windows users, make sure you already have Perl installed in your system. We assume that “MEGASAT_Genotype.pl” and your primers txt file “primers.txt” are saved in the directory “C:\Users\Andy\Downloads”. And the data set folder “dataset” is also saved in the directory “C:\Users\Andy\Downloads”. In order to run the Perl script, first step is go back to the command prompt and type “perl C:\Users\Andy\Downloads\ MEGASAT_Genotype.pl C:\Users\Andy\Downloads\primers.txt 2 C:\Users\Andy\Downloads\dataset C:\Users\Andy\Desktop”.

The first command-line argument is the directory of primers txt file. The second command-line argument is the number of mismatches (2 is a good choice to set). Next argument is the directory of data set folder that contains input sequence read files. The last command-line argument specifies the directory where you want to save your output. After this script is completed, an output folder called “Output_dataset” will be in the saving directory you type in the command line.

Running “MEGASAT_Update.pl”

A simple GUI “update_GUI” is also provided for Windows, Mac and Linux users. Double click the “update_GUI” will display a pop-up page. In this page, the first “Open” button is to open your scores txt file that contains the original genotyping and new genotyping information you want to update to. The second “Open” button is to open the original genotyping txt file generated by "MEGASAT_Genotype.pl". The “Choose” button is to choose the directory to save the new excel file. After all these parameters are filled, click the “Run the program” to run “MEGASAT_Update.pl”.

If you want to run the scripts from command line, for Windows users, make sure you already have Perl installed in your system. We assume that “MEGASAT_Update.pl” and your scores txt file “Scores.txt” are saved in the directory “C:\Users\Andy\Downloads”. And the original genotyping txt file is also saved in the directory “C:\Users\Andy\Downloads”. In order to run the Perl script, first step is go back to the command prompt and type “perl C:\Users\Andy\Downloads\ MEGASAT_Update.pl C:\Users\Andy\Downloads\Scores.txt C:\Users\Andy\Downloads\Genotype.txt C:\Users\Andy\Desktop”.

The first command-line argument is the directory of scores txt file. The second command-line argument is the directory of original genotyping txt file "Genotype.txt" generated by "MEGASAT_Genotype.pl". The last command-line argument specifies the directory where you want to save your new txt file. After this script is completed, a new tab-separated txt file called “NewGenotype.txt” will be in the saving directory you type in the command line.

Running “Mplot.R”

Output folder

MEGASAT_Genotype.pl

The output folder generated by “MEGASAT_Genotype.pl” has three types of files. In this folder, “Genotype.txt” is a tab-separated txt file that gives all the genotype information for all the individuals and loci. In this Genotype.txt file, “X X” means that this locus doesn’t occur in this individual. “0 0” means that the depth of alleles is too small to score. “Unscorable Unscorable” means that there are three possible real alleles, which makes the genotype difficult to be determined. You can use Microsoft Excel to open this tab-separated txt file, which make this file more easily to read.

Those txt files whose names start with “Genotype” and follow by the individual names show the length distribution of each microsatellite locus. In those txt files, the first row illustrates the different length for all the loci in one individual. Each row under the first row shows the number of the occurrences of different lengths for each locus. The last column is the genotype information for all loci in one individual.

And those split files whose names start with “Sorted” and follow by the individual names and loci names contain all the non-trimmed sequences for one individual & one locus. Obviously, those split files whose names start with “Trimmed” have all the trimmed sequences.

MEGASAT_Update.pl

For another Perl script “MEGASAT_Update.pl” that helps to update the Genotype.txt very fast, the output is a tab-separated txt file called “NewGenotype.txt”. This “NewGenotype.txt” has all the updated genotyping information.

Mplot.R

The output folder of "Mplot.R" contains many pdfs and each pdf contains many bar plots that represent the length distribution of each individual and each locus. One pdf shows the length distribution of all individuals for one locus.