Difference between revisions of "MEGASAT"

From Bioinformatics Software
Jump to navigationJump to search
Line 35: Line 35:
 
In this txt file, a header line is required to specify the column name. If one locus doesn’t have 3’ flank, a character “A” needs to be written in the 3’ flank column in that txt file. But if it doesn’t have 5’ flank, nothing needs to be written in the 5’ flank column.
 
In this txt file, a header line is required to specify the column name. If one locus doesn’t have 3’ flank, a character “A” needs to be written in the 3’ flank column in that txt file. But if it doesn’t have 5’ flank, nothing needs to be written in the 5’ flank column.
  
'''Put example primer file here'''
+
Here is an [[Media:Guppy primers.txt|example primer file]].
  
 
=== Input sequence file ===
 
=== Input sequence file ===

Revision as of 16:27, 1 May 2015

This page is currently under development. Please check back for release information about MEGASAT and updated documentation.

Overview

The current version of MEGASAT is 1.0. The MEGASAT scripts should work with any relatively recent version of Perl and have been tested with versions 5.18.2 and 5.16.3.

Installing MEGASAT

Windows

In this “MEGASAT_1.0 for Windows” folder, there are two Perl executable files and two Executable Jar Files. MEGASAT_GUI is the graphical user interface for running the main Perl executable file called “MS”. “update_GUI” is the graphical user interface for running the Perl executable file called “updateMatrix”.

If Perl is already installed in your computer, you can just click the Start button and go to your Perl interpreter to run the Perl scripts. If you don’t have Perl installed but still want to run the Perl scripts, here are the two main distributions for Windows: ActivePerl (http://www.activestate.com/activePerl) and Strawberry Perl (http://strawberryPerl.com/). We have used the latter in development and testing of MEGASAT, and recommend its use. The script “SatGenotype.pl” uses no complicated library functions, while the script “updateMatrix.pl” uses two packages called “Spreadsheet::ParseExcel” and “Spreadsheet::WriteExcel”. Here is the link of the installation instructions for “Spreadsheet::WriteExcel”: http://www.j-tsurugashima.com/cgi/lib/Spreadsheet/WriteExcel/doc/install.html.

Macintosh

If you are using a Macintosh system, Perl should already be installed; type “Perl –v” at the command line to ensure this is the case. So Perl scripts can be easily invoked from the terminal on Mac system. But if you don’t want to run scripts in terminal, two simple GUIs are also offered to invoke those two executable Perl scripts. In order to run “updateMatrix.pl”, two packages “Spreadsheet::ParseExcel” and “Spreadsheet::WriteExcel” should be installed. The link of installation instructions is: http://www.j-tsurugashima.com/cgi/lib/Spreadsheet/WriteExcel/doc/install.html.

Linux

Perl should already be installed on Linux system. So it’s easy to go to terminal to invoke Perl scripts. For running “updateMatrix.pl”, follow the installation instructions as above.

Input file formats

“SatGenotype.pl” requires an input file with information about PCR primers, and a set of .fastq files representing reads from each sampled locus.

Primer file

The primer file must be in a tab-separated format, with the following headers: - Column1: locus name - Column2: forward primers - Column3: reverse primers - Column4: 3’ flank - Column5: 5’ flank - Column6: the repeat array

In this txt file, a header line is required to specify the column name. If one locus doesn’t have 3’ flank, a character “A” needs to be written in the 3’ flank column in that txt file. But if it doesn’t have 5’ flank, nothing needs to be written in the 5’ flank column.

Here is an example primer file.

Input sequence file

Input sequence read files must be in standard FASTQ format

Put example sequence file here

“updateMatrix.pl” requires an input file with original genotyping and new genotyping information you want to update to. This scores txt file must be in comma-separated format, with the following headers: - Column1: locus name - Column2 & Column3: original genotype - Column4 & Column5: new genotype

Another input file is your original genotyping excel file that contains genotyping information for different individuals and loci.

Running MEGASAT

Running “SatGenotype.pl”

If you don’t want to use command line to invoke scripts, a simple GUI is provided for Windows, Mac and Linux users. Double click the “MEGASAT_GUI” will display a pop-up page. In this page, you can click the first “Open” button to open your input primers file. The text field under the “Open” button will display the directory of your primers file. The second small text field is for typing the number of mismatches. The second “Open” button is to open the data set folder that contains the input sequence read files. The “Choose” button is to choose the directory to save your output folder. Two radio buttons in this page offer two options- compress the output folder or not compress the output folder. After all these parameters are filled, click the “Run the program” to run the Perl scripts.

If you want to run the scripts from command line, for Windows users, make sure you already have Perl installed in your system. We assume that “SatGenotype.pl” and your primers txt file “primers.txt” are saved in the directory “C:\Users\Andy\Downloads”. And the data set folder “dataset” is also saved in the directory “C:\Users\Andy\Downloads”. In order to run the Perl script, first step is go back to the command prompt and type “perl C:\Users\Andy\Downloads\ SatGenotype.pl C:\Users\Andy\Downloads\primers.txt 2 C:\Users\Andy\Downloads\dataset C:\Users\Andy\Desktop”.

The first command-line argument is the directory of primers txt file. The second command-line argument is the number of mismatches (2 is a good choice to set). Next argument is the directory of data set folder that contains input sequence read files. The last command-line argument specifies the directory where you want to save your output. After this script is completed, an output folder called “Output_dataset” will be in the saving directory you type in the command line. Running “updateMatrix.pl” A simple GUI “update_GUI” is also provided for Windows, Mac and Linux users. Double click the “update_GUI” will display a pop-up page. In this page, the first “Open” button is to open your scores txt file that contains the original genotyping and new genotyping information you want to update to. The second “Open” button is to open the original genotyping excel file. The “Choose” button is to choose the directory to save the new excel file. After all these parameters are filled, click the “Run the program” to run “updateMatrix.pl”.

If you want to run the scripts from command line, for Windows users, make sure you already have Perl installed in your system. We assume that “updateMatrix.pl” and your scores txt file “Scores.txt” are saved in the directory “C:\Users\Andy\Downloads”. And the original genotyping excel file is also saved in the directory “C:\Users\Andy\Downloads”. In order to run the Perl script, first step is go back to the command prompt and type “perl C:\Users\Andy\Downloads\ updateMatrix.pl C:\Users\Andy\Downloads\Scores.txt C:\Users\Andy\Downloads\output.xls C:\Users\Andy\Desktop”.

The first command-line argument is the directory of scores txt file. The second command-line argument is the directory of original genotyping excel file (This excel file comes from the output.txt in the output folder generated by “SatGenotype.pl”, you can save the output.txt into excel file). The last command-line argument specifies the directory where you want to save your new excel file. After this script is completed, a new excel file called “Newoutput.xls” will be in the saving directory you type in the command line.

Output folder

The output folder generated by “SatGenotype.pl” has three types of files. In this folder, “output.txt” is a comma-separated txt file that gives all the genotype information for all the individuals and loci. In this output.txt file, “X X” means that this locus doesn’t occur in this individual. “0 0” means that the depth of alleles is too small to score. “Unscorable Unscorable” means that there are three possible real alleles, which makes the genotype difficult to be determined. You can use Microsoft Excel to open this csv file, which make this txt file more easily to read.

Those txt files whose names start with “output” and follow by the individual name show the length distribution of each microsatellite locus. In those txt files, the first row illustrates the different length for all the loci in one individual. Each row under the first row shows the number of the occurrences of different lengths for each locus. The last column is the genotype information for all loci in one individual.

And those split files whose names start with “Sorted” and follow by the individual name and locus name contain all the non-trimmed sequences for one individual & one locus. Obviously, those split files whose names start with “Trimmed” have all the trimmed sequences.

For another Perl script “updateMatrix.pl” that helps to update the output.xls very fast, the output is an excel file called “Newoutput.xls”. This “Newoutput.xls” has all the updated genotyping information.