motiph [options] <alignment file> <tree file>
        <motif file>
    Scan a DNA multiple sequence alignment for occurrences of given
      motifs, taking into account the phylogenetic tree relating the
      sequences. Any sequences containing nothing but gaps are removed from
      the alignment. 
      motiph supports several evolutionary models that it uses to compute
      the score for aligning the motif to each possible region of the
      multiple alignment.
      Read about the algorithm for more details.
The name of a file containing a DNA multiple alignment in ClustalW format. Alternatively, if the --list option is used, this file may contain a list of alignment files. In this case each of the alignment files will be scanned for each of the motifs.
The name of a file containing a phylogenetic tree in Phylip Newick format. This tree may contain additional species not represented in the alignment.
The name of a file containing MEME formatted motifs.
Output is created as a table in HTML (.html) and text (.txt) formats. Each line in the table represents a match of the motif to a "window" in the alignment. Matches to the motif and its reverse complement are listed separately. The output columns contain the motif name, the alignment name, the start position of the match in the alignment, the end position of the match, the match score, the match p-value, the match q-value, and the matched portion of the first sequence in the alignment, respectively.
Output is also created in GFF format (.gff) and in CISML (.xml) format.
| Option | Parameter | Description | Default Behavior | |||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| General Options | ||||||||||||||||||||||||||||||
| --bg | mutation rate | The mutation rate for sites in the background model. | A background mutation rate of 1 is used. | |||||||||||||||||||||||||||
| --bfile | background file | Read background frequencies from
            background file. The file should be in 
            MEME background file format.
            If the argument is the keyword motif-file, then the frequencies will be taken from
            the motif file. If both strands are being scored, the background model 
	    will be modified by averaging the frequencies of letters and their reverse complements. | Use frequencies embedded in the application from an old version of the NCBI non-redundant DNA database. | |||||||||||||||||||||||||||
| --fg | mutation rate | The mutation rate for sites in the foreground model(s). | A foreground mutation rate of 1 is used. | |||||||||||||||||||||||||||
| --gap | skip|fixed|wildcard|minimum|model | Specifies the gap handling strategy. 
 | Behaves as if --gap skip had been specified. | |||||||||||||||||||||||||||
| --gap-cost | cost | Specifies the costs for gaps when using the fixed gap handling strategy. | A gap cost of zero is used. | |||||||||||||||||||||||||||
| --hb | Use the Halpern-Bruno modification to the evolutionary model. | |||||||||||||||||||||||||||||
| --list | Treat the second required input as a list of alignments, rather than a single alignment. | The second required argument is treated as a single alignment. | ||||||||||||||||||||||||||||
| --max-stored-scores | storage | Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate. | The maximum number of stored matches is set to 100,000. | |||||||||||||||||||||||||||
| --model | single|average|jc|k2|f81|f84|hky|tn | The evolutionary model to use. 
 | Behaves as if --model f81 was specified. | |||||||||||||||||||||||||||
| --motif | ID | Use only the motif identified by ID. This option may be repeated. | All motifs are used. | |||||||||||||||||||||||||||
| --no-qvalue | Do not compute a q-value for each p-value. The q-value calculation is that of Benjamini and Hochberg (1995). | The program computes q-values. | ||||||||||||||||||||||||||||
| --norc | Do not score the reverse complement DNA strand. | Both strands are scored. | ||||||||||||||||||||||||||||
| --output-pthresh | p-value threshold | The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If both the --output-pthresh and --output-qthresh options appear on the command line, whichever appears later on the command line will be applied. | The p-value threshold is set to 1e-4. | |||||||||||||||||||||||||||
| --output-qthresh | q-value threshold | The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. | No q-value threshold is applied. | |||||||||||||||||||||||||||
| --pseudocounts | pseudocount | A pseudocount to be added to each count in the motif matrix, weighted by the background frequencies for the nucleotides (Dirichlet prior), before converting the motif to probabilities. | A pseudocount of 0.1 is added. | |||||||||||||||||||||||||||
| --pur-pyr | ratio | The ratio of the purine transition rate to pyrimidine transition rate. This parameter is used by the Tamura-nei model. | The ratio is set to 1.0. | |||||||||||||||||||||||||||
| --seed | n | Seed for random number generator. | The seed is set to the current time. | |||||||||||||||||||||||||||
| --text | Limits output to plain text sent to standard out. This mode allows the program to search an arbitrarily large database, because results are not stored in memory. | |||||||||||||||||||||||||||||
| --transition-transversion | ratio | The ratio of the transition rate to the transversion rate. This parameter is used by the Kimura 2-parameter, F84, HKY, and Tamura-nei models. | The ratio is set to 0.5. | |||||||||||||||||||||||||||
The algorithm is as follows:
A = a multiple alignment
T = a phylogenetic tree
M = a motif frequency matrix
B = the background frequencies of the four bases in A
U = a background evolutionary model with equilibrium frequencies B
// Build evolutionary models.
for j in positions of M {
  E[j] = an evolutionary model with equilibrium frequencies 
    from the jth position of M
}
// Build matrix of all possible scores
for i in all possible alignment columns {
  for j in positions of M {
    foreground = site_likelihood(E[j], A[:][i], T)
    background = site_likelihood(U, A[:][i] T)
    score_matrix[i][j] = log_2(foreground / background)
  }    
}
// Calculate p-values of all possible scores for motif sized windows
// windows in the alignment.
pvalues = get_pv_lookup(score_matrix, B)
// Calculate score for each motif sized window in the alignment.
for i in columns of A {
  score = 0
  for j in positions of M {
    index = calculate the index of A[:][i + j] in the array
      of all possible alignment columns
    score = score + score_matrix[index][j]
  }    
  print pvalues[score]
}
      
      The core of the algorithm is a routine (site_likelihood)
      for scoring a particular column of the multiple alignment using a
      given evolutionary model and a given phylogenetic tree.  The alignment
      site provides the observed nucleotides at the base of the tree, and we
      sum over the unobserved nucleotides in the rest of the tree,
      conditioning on the equilibrium distribution from the evolutionary
      model at the root of the tree (Felsenstein Pruning Algorithm).  The
      tree must be a maximum likelihood tree, of the kind generated by DNAml
      from Phylip or by FastDNAml.  Branch lengths in the tree are converted
      to conditional probabilities using the specified evolutionary
      model.