QDD

Program for microsatellite selection and primer design

Naming output files in command line version

All output files are found in the output folder (-out_folder). If the -outfile_string is not specified, the name of the input file is used for naming the output files followed by a complementary string to refer to the content of the file. (e.g. input : sample.fas output: sample_pipe1_log.txt). If -outfile_string is set (e.g. -outfile_string test) a complementary string to referring to the content of the output file is attached to the -outfile_string (e.g. test_pipe1_log.txt).

If the output file name already exists, files will be numbered e.g  sample_pipe1_log_v1.txt

Back to Top

Output files of PIPE1

File names in Galaxy are in parentheses
  1. xxx_pipe1_log.txt (pipe1 Log file)=> Parameters of the run and summary
  2. xxx_pipe1_for_pipe2.fas (Input for pipe2) => Fasta file with microsatellite containing sequences, longer than a preset limit (-length_limit) after adapter clipping; It is the input file for pipe2.pl
  3. xxx_pipe1_wov.fas (pipe1 Length selected fasta) => Fasta file with sequences longer than -lenght_limit after the adapter clipping
  4. xxx_pipe1_ms.csv (pipe1 Microsatellite positions) => File with information on microsatellite motifs and positions in each sequences. Semicolons are used for separating columns
    • Column1: Sequence code
    • Column2: number of microsatellites in the sequence
    • Column3: length of the sequence
    • Column4: motif of the first microsatellite
    • Column5: first position of the microsatellite
    • Column6: last position of the microsatellite
    • Column7: number of repeats of the microsatellite
    • Columns4-7 are repeated for all microsatellites
  5. xxx_pipe1_length_info.tabular (pipe1 Length info) => Information on sequence length and adapter clipping; tab separated columns
    • Column1: Sequence code
    • Column2: Original length of the sequence
    • Column3: Number of bases cut from the beginning of the sequence
    • Column4: Number of bases cut from the end of the sequence
    • Column5: Length of the sequence after cutting adapter/vector

Back to Top

Output files of PIPE2

File names in Galaxy are in parentheses
  1. xxx_pipe2_log.txt (pipe2 Log file) => Parameters of the run and summary
  2. xxx_pipe2_for_pipe3.fas (Input for pipe3) => Fasta file with all unique sequences (singletons and consensus); Input file for pipe3.pl
  3. xxx_pipe2_singleton.fas (pipe2 Singleton) => Fasta file with singletons (the only BLAST hit is autohit)
  4. xxx_pipe2_nohit_css.fas (pipe2 Nohit CSS) => Fasta file with low complexity sequences (no BLAST hit to itself)
  5. xxx_pipe2_multihit_css.fas (pipe2 Multihit) => Fasta file with putatif minisatellites (more than one hit (local alignement) between a pair of sequences)
  6. xxx_pipe2_grouped.fas (pipe2 Grouped) => Fasta file with sequences (including consensuses) that had BLAST hit to other sequences, with bellow limit identity of the overlapping region. It can be either a partial similarity (only a region of the two sequences can be aligned), or a the two sequences are aligned in their entire length, but the percentage of similarity is bellow limit. Regions covered by BLAST hits are masked by lower case letters
  7. xxx_pipe2_consensus .fas (pipe2 Consensus) => Fasta file with all unique (no hit to grouped sequences) consensus sequences;
    Sequence code is a format of cons_grX_Y, where X is the identifier of a contig, and Y is the number of sequences in the contig. If microsatellite polymorphism is detected the sequence identifier is followed by space and the microsatellite motif and its first and last position.
  8. xxx_pipe2_cons_subs.fas (pipe2 Cons+reads) => Fasta file with consensus + aligned reads to make a consensus

Back to Top

Output files of PIPE3

File names in Galaxy are in parentheses
  1. xxx_pipe3_log.txt (pipe3 Log file) => Parameters of the run and summary
  2. xxx_pipe3_targets.fas (Sequence with primers) => Sequences with successful primer design
  3. xxx_pipe3_primers.tabular (Table with primers) => Information on primers, target regions, primers...
    • SEQUENCE_CODE: Original codes for singletons and cons_grX_Y codes for consensus sequences
    • NUMBER_OF_READS: The number of reads underlying the sequence. 1 for singletons, >1 for consensus.
    • TARGET_REGION_FIRST_POS: First position of target region in the sequence
    • TARGET_REGION_LENGTH_IN_BP: Length of the target region in base pairs. If there is only one microsatellite targeted, the target region covers the microsatellite (compound or pure). Otherwise the target includes the two most distant target microsatellites and the sequence between them.
    • TARGET_MS_LENGTH_IN_REPEAT_NUMBER: Length of the target microsatellite in repeat number. If microsatellite is compound, it is the number of repetition in the longest uninterrupted stretch. If there are more than one microsatellites in the target region, target MS info refers to the longest (in repeat numbers) of the target microsatellites.
    • NUMBER_OF_MS: The number of microsatellites in the target region. 1 for one pure microsatellite, 1.5 for one compound microsatellite, >1.5 the number of microsatellites (regardless whether pure and compound)
    • MOT_TRANS: Repeat motif type, where circular permutations and their reverse complementary sequences are pooled (e.g. AC refers to AC, CA, TG, GT).  If there are more than one microsatellite in the target region, it refers to the longest (in repeat numbers) of the target microsatellites.
    • TARGET_REGION_SEQ: Sequence of the target region as found in the read/consensus
    • POLYMORPH: If polymorphism is detected, then the repeat motif and its position is indicated. NA for singletons, NO if the MS has the same length in all reads of a consensus.
    • ONE_PRIMER_FOR_EACH_SEQ: Only one primer is selected for each sequence. Selecting lines with 1 in this column gives the total number of sequences with primers. The selection between the “best primer pairs” of each target region is based on the number of microsatellites in the target region (NUMBER_OF_MS; the lowest the better) and the length of the microsatellite (TARGET_MS_LENGTH_IN_REPEAT_NUMBER; the highest the better). This ordering is based on lab tests of PCR success rate and polymorphism of different primers (Meglecz et al. Submitted).
    • ONE_PRIMER_FOR_EACH_TARGET_REGION: Only one primer is selected for each target region. Selecting lines with 1 in this column gives the total number of target regions with primers. There can be more than one target region per sequence, so some of the markers are strongly linked. The selection is based on the alignment score between the primers and the amplicon (PCR_PRIMER_ALIGNSCORE; the lowest the better), on the distance between primer and the target region (MIN_PRIMER_TARGET_DIST; the highest the better) and the size of the PCR product (PCR_PRODUCT_SIZE; the lowest the better). This ordering is based on lab tests of PCR success rate of different primers (Meglecz et al. Submitted)
    • PCR_PRIMER_ALIGNSCORE: The maximum alignment score between the primers and the sequence (excluding primers; from version 3.1.2) The maximum alignment score between the primers and the amplicon excluding primers (versions 3.1 and 3.1.1)
    • MIN_PRIMER_TARGET_DIST: The smallest distance between the 3' end of the two primers and the target region. If primer(s) match more than once the sequence, the distance is calculated for the annealing site closer to the target region.
    • PCR_PRODUCT_SIZE: PCR product size in bp including primers. If primer(s) match more than once the sequence, the size of the longest PCR product size is given here.
    • PCR_PRODUCT_SEQ: Sequence of amplicon including primers. If primer(s) match more than once the sequence, the longest PCR product is given here.
    • PRIMER_LEFT_SEQUENCE:
    • PRIMER_RIGHT_SEQUENCE:
    • PRIMER_LEFT_DIST_FROM_MS: Distance between the target MS and the left primer in bp.
    • PRIMER_RIGTH_DIST_FROM_MS: Distance between the target MS and the right primer in bp
    • PRIMER_LEFT_FIRST_POS: 5’ end position of the left primer in the sequence
    • PRIMER_LEFT_LENGTH: in bp
    • PRIMER_RIGHT_FIRST_POS: 5’ end position of the right primer in the sequence
    • PRIMER_RIGHT_LENGTH: in bp
    • PRIMER_LEFT_TM: Annealing temperature of the left primer; see documentation of Primer3
    • PRIMER_RIGHT_TM: Annealing temperature of the right primer; see documentation of Primer3
    • PRIMER_LEFT_END_STABILITY: see documentation of Primer3
    • PRIMER_RIGHT_END_STABILITY: see documentation of Primer3
    • PRIMER3_PENALTY: Primer pair penalty (see documentation of Primer3)
    • DESIGN: guides for target region complexity

      Table of PCR primer pair design strategies

      + allowed, but not necessarily present; - not allowed

    • SEQUENE_LENGTH: Length of the read or consensus
    • SEQUENCE: the whole sequence with homopolymers micro- and nanosatellites printed in lower case
    • CONTIG_CODE (If contig = 1): id of the contig
    • FIRST_POS_ON_CONTIG (If contig = 1): First position of the extracted fragment on its contig. These last two columns help to avoid choosing markers too close to each other on the same contig

Back to Top

Output files of PIPE4

File names in Galaxy are in parentheses
  1. xxx_pipe4_log.txt (pipe4 Log file) => Parameters of the run and summary
  2. xxx_pipe4_primers.tabular (Table with primers, RepeatMasker and NCBI BLAST hit info) => Same information as in xxx_pipe3_primers.tabular, completed by
    • If check_contamination = 1 => Information on the best hit against the nt database of NCBI (accesion, description, e-value, score); Classification taxonomic of the species of the best hit (Kingdom, Phylum, Class, Family, Genus, Species)
    • If rm = 1 => Information on the best hit to the interspersed elements library.

Back to Top