motiph

Usage: motiph [options] <alignment file> <tree file> <motif file>

Description:

Scans multiple sequence alignments for occurrences of given motifs, taking into account the phylogenetic tree relating the sequences.

The algorithm is as follows:

A = a multiple alignment
T = a phylogenetic tree
M = a motif frequency matrix
B = the background frequencies of the four bases in A
U = a background evolutionary model with equilibrium frequencies B

// Build evolutionary models.
for j in positions of M {
  E[j] = an evolutionary model with equilibrium frequencies 
    from the jth position of M
}

// Build matrix of all possible scores
for i in all possible alignment columns {
  for j in positions of M {
    foreground = site_likelihood(E[j], A[:][i], T)
    background = site_likelihood(U, A[:][i] T)
    score_matrix[i][j] = log_2(foreground / background)
  }    
}
// Calculate p-values of all possible scores for motif sized windows
// windows in the alignment.
pvalues = get_pv_lookup(score_matrix, B)

// Calculate score for each motif sized window in the alignment.
for i in columns of A {
  score = 0
  for j in positions of M {
    index = calculate the index of A[:][i + j] in the array
      of all possible alignment columns
    score = score + score_matrix[index][j]
  }    
  print pvalues[score]
}

The core of the algorithm is a routine (site_likelihood) for scoring a particular column of the multiple alignment using a given evolutionary model and a given phylogenetic tree. The alignment site provides the observed nucleotides at the base of the tree, and we sum over the unobserved nucleotides in the rest of the tree, conditioning on the equilibrium distribution from the evolutionary model at the root of the tree (Felsenstein Pruning Algorithm). The tree must be a maximum likelihood tree, of the kind generated by DNAml from Phylip or by FastDNAml. Branch lengths in the tree are converted to conditional probabilities using the specified evolutionary model.

Input:

Output:

The output is in GFF (version 2) format. The output contains two types of lines: one line per sequence, and one line per motif occurrence. Sequences receive no score. For motif occurrences, the score is an unadjusted p-value. In addition, for motif lines, the optional "attribute" field contains two attributes: the motif number from the MEME file, and the sequence that the motif matches.

Options:

Warning messages: None

Bugs:

Meta-MEME program documentation

Meta-MEME home