fimo
Usage:
fimo [options] <sequence file> <motif file>Description:
Scan a sequence for occurrences of given motifs.The algorithm is as follows:
S = a DNA sequence M = a motif frequency matrix B = the background frequencies of the four bases in S // Build matrix of all possible scores for i in nucleotide bases { for j in positions of M { foreground = M[j][i] background = B[j] score_matrix[i][j] = log_2(foreground / background) } } // Calculate p-values of all possible scores for motif sized windows // windows in the alignment. pvalues = get_pv_lookup(score_matrix, B) // Calculate score for each motif sized window in the alignment. for i in columns of A { score = 0 for j in positions of M { index = calculate the index of S[i + j] in the nucleotide alphabet score = score + score_matrix[index][j] } print pvalues[score] }Input:
<sequence file>is a sequence in FASTA format. Only the first sequence in the file is used.<motif file>is a list of motifs, in MEME format.Output:
The output is in GFF (version 2) format. The output contains two types of lines: one line per sequence, and one line per motif occurrence. Sequences receive no score. For motif occurrences, the score is an unadjusted p-value. In addition, for motif lines, the optional "attribute" field contains two attributes: the motif number from the MEME file, and the sequence that the motif matches.Options:
--motif <int>Use only the specified motif from the motif file. The default behavior is to scan using each motif from the file in turn.--motif-name <string>The motif ID to appear in the<feature>field of the output. The default value is "motif".--pthresh <float>The P-value threshold for displaying features. If the p-value of a feature is greater then this value, the feature will not be printed. The default value is 1e-6.--bgfile <background file>The name of a file specifying background frequencies for each of the nucleotides.--max-seq-length <int>The maximum length of an input sequence. Default=1e6.--sequence-name <string>The sequence ID to appear in the<seqname>field of the output. The default value is is ID of the sequence from the FASTA file.--wiggleProduce wiggle, rather than GFF, output format. The wiggle file contains one track per motif, sequence and strand. Note that, because wiggle format does not allow overlapping occurrences of motifs, the output assigns each motif to the single left-most base in the occurrence. Also, to allow reasonable visualization in the UCSC Genome Browser, scores in the wiggle output are reported as negative log (base 10) p-values.--pseudocounts <float>A pseudocount to be added to each count in the motif matrix, weighted by the background frequencies for the nucleotides (Dirichlet prior), before converting the motif to probabilities. The default value is 0.1.--verbosity [1|2|3|4|5|6]Set the verbosity of status reports to standard error. The default value is 2.Warning messages: None
Bugs and future enhancements:
- The
--motifoption should allow multiple motifs to be selected from the motif file.- Print motif consensus as part of feature properties.
- Currently, the program prints one sequence line every time it encounters a new motif, resulting in multiple lines per sequence.