The FASTA program package Introduction This documentation describes the version 2.0x of the FASTA program package (see W. R. Pearson and D. J. Lipman (1988), "Improved Tools for Biological Sequence Analysis", PNAS 85:2444- 2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence Comparison with FASTP and FASTA" Methods in Enzymology 183:63- 98). Version 2.0 modifies version 1.8 to include explicit statistical estimates for similarity scores based on the extreme value distribution. In addition, FASTA protein alignments now use the Smith-Waterman algorithm with no limitation on gap size. FASTA and SSEARCH now use the BLOSUM50 matrix by default, with options to change gap penalties on the command line. Version 1.7 replaces rdf2 and rss with prdf and prss, which use the extreme- value distribution to calculate accurate probability estimates. The FASTA sequence comparison programs on this disk are improved versions of the FASTP program, originally described in Science (Lipman and Pearson, (1985) Science 227:1435-1441). We have made several improvements. First, the library search programs use a more sensitive method for the initial comparison of two sequences which allows the scores of several similar regions to be combined. As a result, the results of a library search are now given with three scores, initn (the new initial score which may include several similar regions), init1 (the old fastp initial score from the best initial region), and opt (the old fastp optimized score allowing gaps in a 32 residue wide band). 3. Using the FASTA Package 3.1. Overview The FASTA sequence comparison programs all require similar information, the name of a query sequence file, a library file, and the ktup parameter. All of the programs can accept arguments on the command line, or they will prompt for the file names and ktup value. To use FASTA, simply type: FASTA and you will be prompted for : the name of the test sequence file the name of the library file and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences) ktup of 2 is about 5 times faster than ktup = 1. For a 200 aa sequence against a 10,000,000 aa library, the program takes about 30 min with ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286 IBM-PC. The program can also be run by typing FASTA test.aa /lib/bigfile.lib ktup (1 or 2) Included with the package are the test files, MUSPLFM.AA, LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain that everything is working, you can try: fasta musplfm.aa lcbo.aa and tfasta musplfm.aa bovprl.seq To test the local similarity programs LFASTA and PLFASTA, try: lfasta mchu.aa mchu.aa and plfasta mchu.aa mchu.aa MCHU (calmodulin) has four duplicated calcium binding sites that are clearly detected by LFASTA. For a more complicated example, try MWRTC1.aa, myosin heavy chain. 3.2. Sequence files The FASTA programs know about three kinds of sequence files (four under VMS): (1) plain sequence files that can only be used as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard library files. These are the same as plain sequence files, each sequence is preceded by a comment line with a '>' in the first column. (3) distributed sequence libraries (this is a broad class that includes the NBRF/PIR VMS and blocked ascii formats, Genbank flat-file format, EMBL flat-file format, and Intelligenetics format. All of the files that you create should be of type (1) or (2). Type (2) files (ones with a be used as query or library sequence files by all of the programs. I have included several sample test files, *.AA. The first line may begin with a '>' or ';' followed by a comment. The text after ';' in other lines will be ignored. Spaces and tabs (and anything else that is not an amino-acid code) are ignored. Library files should have the form: >Sequence name and identifier A F A S Y T .... actual sequence. F S S .... second line of sequence. >Next sequence name and identifier This is often referred to as "FASTA" or "Pearson" format. You can build your own library by concatenating several sequence files. Just be sure that each sequence is preceded by a line beginning with a '>' with a sequence name. The test file should not have lines longer than 120 characters, and sequences entered with word processors should use a document mode, with normal carriage returns at the end of lines. Program Summary 3.3. Sequence search programs FASTA universal sequence comparison. Defaults to comparing protein sequences; if the sequences are > 85% A+C+G+T or the -n option is used, a DNA sequence is assumed. FASTX Search a protein sequence library using amino acid sequence comparison to the forward three frames of a translated DNA query sequence. (The reverse frames are specified with the -i option.) Alignment scores allow frameshifts; the final alignment uses a Smith-Waterman type alignment routine (no limit on gaps) that allows frameshifts. TFASTA Search DNA library for a protein sequence by translating the DNA sequence to protein in all six frames (three forward frames with the -3 command line option). TFASTA with ktup=2 is about as fast as a DNA FASTA with ktup=4, and is substantially more sensitive. (also reads the GENBANK library) TFASTX Search DNA library for a protein sequence by translating the DNA sequence to protein in all six frames (three forward frames with the -3 command line option) calculating similarity scores that allow frameshifts. TFASTX produces an optimal Smith-Waterman alignment of the query and translated-library sequence. SSEARCH Universal sequence comparison using the Smith-Waterman algorithm ( T. F. Smith and M. S. Waterman (1981) J. Mol. Biol. 147:195-197). This program uses code developed by Huang and Miller (X. Huang, R. C. Hardison, W. Miller (1990) CABIOS 6:373-381) for calculating the local similarity score and code from the ALIGN program (see below) for calculating the local alignment. SSEARCH is about 50-times slower than FASTA with ktup=2 (for proteins). ALIGN optimal global alignment of two sequences with no short-cuts. This program is a slightly modified version of one taken from E. Myers and W. Miller. The algorithm is described in E. Myers and W. Miller, "Optimal Alignments in Linear Space" (CABIOS (1988) 4:11-17). 3.5. Statistical Significance With version 2.0 of the FASTA program distribution, FASTA, TFASTA, and SSEARCH now provide estimates of statistical significance for library searches. Work by Altschul, Arratia, Karlin, Mott, Waterman, and others (see Altschul et al. (1994) Nature Genetics 6:119 for an excellent review) suggests that local sequence similarity scores follow the extreme value distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are the lengths of the query and library sequence. This formula can be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that the average score for an unrelated library sequence increases with the logarithm of the length of the library sequence. FASTA and SSEARCH use simple linear regression against the the log of the library sequence length to calculate a normalized "z-score" with mean 50, regardless of library sequence length, and variance 10. These z-scores can then be used with the extreme value distribution and the poisson distribution (to account for the fact that each library sequence comparison is an independent test) to calculate the number of library sequences to obtain a score greater than or equal to the score obtained in the search. The original idea and routines to do the linear regression on library sequence length were provided Phil Green, U. Washington. This version of FASTA and SSEARCH uses a slightly different strategy for fitting the data than those originally provided by Dr. Green. The expected number of sequences is plotted in the histogram using an "*". Since the parameters for the extreme value distribution are not calculated directly from the distribution of similarity scores, the pattern of "*'s" in the histogram gives a qualitative view of how well the statistical theory fits the similarity scores calculated by FASTA and SSEARCH. For FASTA, if optimized scores are calculated for each sequence in the database (the default), the agreement between the actual distribution of "z-scores" and the expected distribution based on the length dependence of the score and the extreme value distribution is usually very good. Likewise, the distribution of SSEARCH Smith- Waterman scores typically agrees closely with the actual distribution of "z-scores." The agreement with unoptimized scores, ktup=2, is often not very good, with too many high scoring sequences and too few low scoring sequences compared with the predicted relationship between sequence length and similarity score. In those cases, the expectation values may be overestimates. The statistical routines assume that the library contains a large sample of unrelated sequences. If this is not the case, then the expectation values are meaningless. Likewise, if there are fewer than 20 sequences in the library, the statistical calculations are not done. For protein searches, library sequences with E() values < 0.01 for searches of a 10,000 entry protein database are almost always homologous. Frequently sequences with E()-values from 1 - 10 are related as well. Remember, however, that these E() values also reflect differences between the amino acid composition of the query sequence and that of the "average" library sequence. Thus, when searches are done with query sequences with "biased" amino-acid composition, unrelated sequences may have "significant" scores because of sequence bias. The programs below, PRDF and PRSS, can address this problem by calculating similarity scores for random sequences with the same length and amino acid composition. If optimization is not used ("-o"), E-values for DNA sequences overestimate the significance of the scores that are obtained and unrelated sequences frequently have E()-values < 0.0005. With optimization, the agreement between E()-value compares favorably with protein sequence comparison. This is in part due to the use of more stringent gap penalties for DNA sequence comparison, -16, -4 rather than -12, -2. With the latter penalties, many unrelated sequences appear to have significant similarity. Nevertheless, since protein sequence comparison is much more sensitive, DNA sequence comparison should not be used to identify sequences that encode protein. Even with ktup=6, optimization rarely increases run-times more than 50% with mRNA-size query sequences. Optimization should be used whenever possible. Similar comments apply to TFASTA, where higher gap penalties (-16,-4) are required for accurate statistical estimates. Because TFASTA produces so many artificial "coding" sequences with atypical amino acid compositions, the statistical estimates with TFASTA are often over estimates. With optimized scores, ktup=1, and gap penalties of -16, -4, unrelated sequences will sometimes have E() values of 0.1. If initn scores are used, unrelated sequences may have have E() values < 0.01. PRDF improved version of RDF program that includes accurate probability estimates for all three scoring methods (includes local or window shuffle routine) PRSS A version of PRDF that uses the rigorous Smith-Waterman calculation used by SSEARCH. RANDSEQ produces a randomly shuffled sequence from a query sequence. RELATE significance program described by Dayhoff (Atlas of Protein Sequence and Structure, Vol. 5, Supplement 3). Each chunk of 25 residues in one sequence is compared to every 25 residue fragment of the second sequence. Sequences which are genuinely related will have a large number of scores greater than 3 standard deviations above the mean score of all of the comparisons. 3.8. Command line options It is now possible to specify several options on the command line, instead of using environment variables. The command line options are preceded by a dash; the following options are available: -a same as showall=1 -A force Smith-Waterman alignments for DNA sequences and TFASA. By default, only FASTA protein sequence comparisons use Smith-Waterman alignments. -b # Number of sequence scores to be shown on output. In the absence of this option, fasta (and tfasta and ssearch) display all library sequences obtaining similarity scores with expectations less than 10.0 if optimized score are used, or 2.0 if they are not. The -b option can limit the display further, but it will not cause additional sequences to be displayed. -c # Threshold score for optimization (OPTCUT). Set "-c 1" to optimize every sequence in a database. (This slows the program down about 5-fold). -E # Limit the number of scores and alignments shown based on the expected number of scores. Used to override the expectation value of 10.0 used by default. When used with -Q, -E 2.0 will show all library sequences with scores with an expectation value <= 2.0. -d # Number of alignments to be reported by default. (Used in conjunction with -Q). No longer necessary, see "-b" above. -f Penalty for the first residue in a gap (-12 by default for proteins, -16 for DNA or for TFASTA). -g Penalty for additional residues in a gap (-2 by default for proteins, -4 for DNA and TFASTA ). -h Penalty for frameshift (FASTX, TFASTX only). -H Omit histogram. -i Invert (reverse complement) the query sequence if it is DNA. For TFASTX, search the reverse complement of the library sequence only. -k # Threshold for joining init1 segments to build an initn score (GAPCUT). -l file Location of library menu file (FASTLIBS). -L Display more information about the library sequence in the alignment. -m # MARKX = # (0, 1, 2, 3, 4, 10) -n Force the query sequence to be treated as a DNA sequence. This is particularly useful for query sequences that contain a large number of ambiguous residues, e.g. transcription factor binding sites. -O Send copy of results to "filename." Helpful for environments without STDOUT. -o Turn off default optimization of all scores greater than OPTCUT. Sort results by "initn" scores. -Q,-q Quiet - does not prompt for any input. Writes scores and alignments to the terminal or standard output file. -r file Save a results summary line for every sequence in the sequence library. The summary line includes the sequence identifier, superfamily number (if available) position in the library, and the similarity scores calculated. This option can be used to evaluate the sensitivity and selectivity of different search strategies (see W. R. Pearson (1991) Genomics 11:635- 650.) -s file SMATRIX is read from file. Several SMATRIX files are provided with the standard distribution. For protein sequences: codaa.mat - based on minimum mutation matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250 matrix developed by Dayhoff et al (Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, 1978); pam120.mat - a PAM120 matrix. The default scoring matrix is BLOSUM50, PAM250 is available with "-s 250", BLOSUM62 ("-s BL62") is also available. -v (LINEVAL) values used for line styles in plfasta -w # Line length (width) = number (<200) -x Specifies offsets for the beginning of the query and library sequence. For example, if you are comparing upstream regions for two genes, and the first sequence contains 500 nt of upstream sequence while the second contains 300 nt of upstream sequence, you might try: fasta -x "-500 -300" seq1.nt seq2.nt If the -x option is not used, FASTA assumes numbering starts with 1. This option will not work properly with the translated library sequence with tfasta. (You should double check to be certain the negative numbering works properly.) -y Set the width of the band used for calculating "optimized" scores. For proteins and ktup=2, the width is 16. For proteins with ktup=1, the width is 32 by default. For DNA the width is 16. -z Turn off statistical calculations. -1 sort output by init1 score (as FASTP used to do). -3 (TFASTA, TFASTX only) translate only three forward frames For example: fasta -w 80 -a seq1.aa seq.aa would compare the sequence in seq1.aa to that in seq2.aa and display the results with 80 residues on an output line, showing all of the residues in both sequences. Be sure to enter the options before entering the file names, or just enter the options on the command line, and the program will prompt for the file names. Not all of these options are appropriate for all of the programs. The options above are used by FASTA and TFASTA. RELATE uses the -s option, ALIGN uses the -w, -m, and -s options, and the PRDF program uses -c, -f, -k, and -s. 4. Environment variable summary Environment variables allow you to set search parameters that will be used frequently when you run a program; for example, if you prefer to use the PAM250 scoring matrix, you might "set SMATRIX=250." Command line parameters, if used, always override environment variable settings. The following environment variables are used by this program: AABANK the file name of the default sequence library. FASTLIBS the location of the file which contains the list of library files to be searched. GAPCUT threshold used for joining init1 regions in the second step of FASTA. Normally set based on sequence length and ktup. LIBTYPE used to specify the format of the library sequence for FASTA and TFASTA. LINLEN output line length - can go up to 200 LINEVAL used by plfasta to determine the relationship between line style and similarity score (-v). This should be a string of three numbers, e.g. "200 100 50" MARKX symbol for denoting matches, mismatches. Note that this symbol is only used across the optimized local region; sequences that are outside this region are not marked. OPTCUT Set the threshold to be used for optimization in a band around the best initial region. Normally the OPTCUT value is calculated from the length of the sequence and the ktup value (for a 200 residue sequence, it is about 28). If OPTCUT=1, every sequence in the database will be optimized. This is the most sensitive option. PAMFACT This version of fasta uses a more sensitive method for identifying initial regions. Instead of using a constant factor (fact) for each match in a ktup, it uses the scoring matrix (PAM) scores. While this works well for protein sequences, it has not been as carefully tested for DNA sequences, so by default, this modification is used for proteins but not for DNA. Setting the PAMFACT environment variable to 1 forces the option on; PAMFACT=0 turns it off. SHOWALL on output, show the complete sequence instead of just the overlap of the two aligned sequences. SMATRIX alternative scoring matrix file. As always, please inform me of bugs as soon as possible. William R. Pearson Department of Biochemistry Box 440, Jordan Hall U. of Virginia Charlottesville, VA wrp@virginia.EDU