motiph
Usage:
motiph [options] <alignment file> <tree file> <motif file>Description:
Scans multiple sequence alignments for occurrences of given motifs, taking into account the phylogenetic tree relating the sequences.The algorithm is as follows:
A = a multiple alignment T = a phylogenetic tree M = a motif frequency matrix B = the background frequencies of the four bases in A U = a background evolutionary model with equilibrium frequencies B // Build evolutionary models. for j in positions of M { E[j] = an evolutionary model with equilibrium frequencies from the jth position of M } // Build matrix of all possible scores for i in all possible alignment columns { for j in positions of M { foreground = site_likelihood(E[j], A[:][i], T) background = site_likelihood(U, A[:][i] T) score_matrix[i][j] = log_2(foreground / background) } } // Calculate p-values of all possible scores for motif sized windows // windows in the alignment. pvalues = get_pv_lookup(score_matrix, B) // Calculate score for each motif sized window in the alignment. for i in columns of A { score = 0 for j in positions of M { index = calculate the index of A[:][i + j] in the array of all possible alignment columns score = score + score_matrix[index][j] } print pvalues[score] }The core of the algorithm is a routine (
site_likelihood) for scoring a particular column of the multiple alignment using a given evolutionary model and a given phylogenetic tree. The alignment site provides the observed nucleotides at the base of the tree, and we sum over the unobserved nucleotides in the rest of the tree, conditioning on the equilibrium distribution from the evolutionary model at the root of the tree (Felsenstein Pruning Algorithm). The tree must be a maximum likelihood tree, of the kind generated by DNAml from Phylip or by FastDNAml. Branch lengths in the tree are converted to conditional probabilities using the specified evolutionary model.Input:
<alignment file>is a multiple alignment in ClustalW format. Alternatively, if the--listoption is used, this file may contain a list of alignment files. In this case each of the alignment files will be scanned for each of the motifs.<tree file>is a phylogenetic tree in Phylip Newick format. This tree may contain additional species not represented in the alignment.<motif file>is a list of motifs, in MEME format.Output:
The output is in GFF (version 2) format. The output contains two types of lines: one line per sequence, and one line per motif occurrence. Sequences receive no score. For motif occurrences, the score is an unadjusted p-value. In addition, for motif lines, the optional "attribute" field contains two attributes: the motif number from the MEME file, and the sequence that the motif matches.
Options:
--hb Use the Halpern-Bruno modification to the evolutionary model.
--list Treat the alignment file as a list of alignment files.
Each alignment file in the list will be scanned for each motif.
--model [single|average|jc|k2|f81|f84|hky|tn] The evolutionary model to use.
The available models are:
single - Scores the first sequence in the alignment using only the position specific
scoring matrix (PSSM) for the motif. This model ignores the phylogenetic tree.
average - Scores the alignment by taking the column average of the PSSM
for each position in the alignment. This model ignores the phylogenetic tree.
jc - Jukes-Cantor: equilibrium base frequencies are all 1/4;
the only free parameter is the mutation rate.k2 - Kimura 2-parameter: equilibrium base frequencies are all 1/4;
the free parameters are the mutation rate and the transition/transversion rate ratio.f81 - Felsenstein 1981 (F81): equilibrium base frequencies are taken from the motif file;
the only free parameter is the mutation rate.f84 - Felsenstein 1984 (F84): equilibrium base frequencies are taken from the motif file;
the free parameters are the mutation rate and the transition/transversion rate ratio.
The ratio of purine-purine to pyrimidine->pyrimidine transitions is assumed to be 1.hky - Hasegawa-Kishino-Yano (HKY): equilibrium base frequencies
are taken from the motif file;
the free parameters are the mutation rate and the transition/transversion rate ratio.
The ratio of purine-purine to pyrimidine-pyrimidine transitions is assumed to be equal to the
ratio of purines to pyrimidines.tn - Tamura-Nei: equilibrium base frequencies are taken from the motif file;
the free parameters are the mutation rate, the transition/transversion rate ratio,
and the ratio of purine-purine transitions to pyrimidine-pyrimidine transitions.
--pur-pyr <float>
Ratio of purine-purine to pyrimidine-pyrimidine transitions.
This parameter is used by the Tamura-Nei model.
The default value is 1.
--transition-transversion <value>
The ratio of the transition rate to the transversion rate.
This parameter is used by the Kimura 2-parameter, F84, HKY,
and Tamura-Nei models.
The default value is 0.5.--fg <value> The mutation rate for sites in the foreground model.
The default value is 1.--bg <value> The mutation rate for sites in the background model.
The default value is 1.--gap <method> Specifies the gap handling strategy.
Allowed values for method are:
skip - Skip those sites where any position in the alignment
window contains a gap. This is the default gap handling strategy.
fixed - Sites containing gaps are assigned a fixed
score, specified by --gap-cost.
wildcard - The gap character matches any base, and the
score is the product of the corresponding probabilities.
minimum - The gap character is assigned the score
corresponding to the least likely letter at the given position.
--gap-cost <float> Specifies the costs for gaps when
using the fixed gap handling strategy. Default is 0.0.
--gap-freq <float> Specifies the background frequency
for gaps. Default is derived from the alignment.
--bgfile <background file>
The name of a file specifying background frequencies for each
of the nucleotides.
--column-freqs [simulated|empirical]
The way to compute the frequencies of all possible columns.
simulated - Use the evolutionary model to compute the frequency of each
possible column of letters.
empirical - Count the numbers of each column in the input multiple
alignments. All alignments using the same (sub)set of species are
counted together. Frequencies are computed by dividing each by the
total counts for that (sub)set of species.
simulated. These frequencies are used for
computing p-values for scores. If simulated is used,
the accuracy of the p-values depends strongly on the accuracy of the
evolutionary model.
--motif <int>
Use only the specified motif from the motif file.
The default behavior is to scan using each motif from the file in turn.
--motif-name <string> The motif ID to appear in the
<feature>
field of the output. The default value is "motif".--pthresh <float> The P-value threshold for displaying features.
If the p-value of a feature is greater then this value, the feature will not be printed.
The default value is 1e-6.
--sequence-name <string> The sequence ID to appear in the
<seqname> field of the output. The default value is
"<file name>.<sequence ID>", where
"<file name>" is the name of the alignment file and
"<sequence ID>" is the ID of the first sequence in the
alignment.--pseudocounts <float>
A pseudocount to be added to each count in the motif matrix,
weighted by the background frequencies for the nucleotides (Dirichlet prior),
before converting the motif to probabilities.
The default value is 0.1.
--verbosity [1|2|3|4|5|6] Set the verbosity of status
reports to standard error. The default value is 2.Warning messages: None
Bugs:
--motif option should allow multiple motifs
to be selected from the motif file.
# species_list_1 AAA f_AAA ... TTT f_TTT # species_list_2 AAAAA f_AAAAA ... TTTTT f_TTTTT ...