It is probably the only sequence simulation program capable of natively using the power-law model for indel lengths. Each simulation performed by Dawg started with a random sequence of nucleotides. For each ancestral sequence, a single descendant sequence was evolved by Dawg based on the branch length separating the ancestor from the descendant.
Because sequences were to be aligned using equal costs for each mismatch type, the sequences were evolved under the Jukes-Cantor substitution model [ 30 ]. Indels were created at a rate of 15 per substitutions [ 29 ], and their lengths were distributed via a truncated power-law with parameter of 1.
The observed distribution of gaps was checked to see if it obeyed a power-law, and the power-law parameter was estimated using maximum likelihood [ 31 ]. Dawg recorded the actual alignment of each sequence pair making it possible to measure the accuracy of alignments generated through dynamic programming. Pairwise, global alignments were done with Ngila [ 32 ], an implementation of the candidate-list dynamic programming algorithm of Miller and Myers [ 11 ] for logarithmic and affine gap costs.
Because gap costs are usually optimized for specific substitution costs [ 9 ], the cost of a match was chosen to be 0 and the cost of a mismatch to be 1. The alignment identity Equation 1 of each of these 2. The statistical software, R [ 33 ], was used to analyze the alignment identities and produce most figures. Fitting affine gap costs to the optimal gap costs was done via the method of least squares for gap sizes 1 to The squared error was minimized separately using the optimization procedures in PopTools 2.
In this appendix we will develop a statistical model for alignment similar to [ 17 ] but simpler. To find the most likely alignment we need a measurement of the likelihood of an alignment given the observed pair of sequences, A and B , and predetermined evolutionary parameters.
This likelihood is proportional to the density of the alignment given the sequence pair [ 36 ] p9 :. To calculate Equation 6 completely for two sequences related by a common ancestor, one would have to consider all sequences that could be the most recent common ancestor of A and B and all possible branch lengths between this ancestor and A and B.
However, our simulations assumed that the tree relating A and B was unrooted, and thus A was considered to be descended from B , eliminating the need to consider the set of all possible progenitors for both sequences.
We calculate Equation 6 based on the evolutionary distance or branch length t between sequences A and B :. It is possible to derive Equation 7 from an evolutionary process. Specifically the probability that B gave rise to A over evolutionary distance t with indels to produce alignment Aln. If L a , M , and R are respectively the length of sequence A , the number of matches in the alignment, and the number of replacements, then under the Jukes-Cantor model,. Putting this all together,.
For simplicity we will not integrate Equation 7 to find f Aln , A , B. Instead, we will approximate it based on the mean value of t :. The alignment is quantified by the number of matches M , the number of mismatches R , and the set of gap lengths, k 1. Furthermore, the log-likelihood is.
As developed by Smith et al. A minimum cost analog of Equation 11 is. Using this relationship, Equation 12 can be expressed as. Applying this method to Equation 10 such that the cost of a match is 0 and the cost of a mismatch is 1 produces the following equation for a gap cost:. Google Scholar. Systematic Biology , — Article Google Scholar. J Mol Biol , — Nucleic Acids Res , — Nucl Acids Res , 31 13 — Genome Research , — BMC Bioinformatics , 7: Cambridge University Press; Book Google Scholar.
Gotoh O: An improved algorithm for matching biological sequences. Bull Math Biol , 97— Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Evol , 38— Waterman MS: Efficient sequence alignment algorithms. J Theor Biol , — Advances in Mathematics , — Journal of Computational Biology , 5 3 — In Statistical Methods in Molecular Evolution.
Edited by: Nielsen R. Springer Verlag; — Mol Biol Evol , — Article PubMed Google Scholar. J Mol Evol , — J Mol Evol , 3— Chang MSS, Benner SA: Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments.
Science , — Gu X, Li WH: The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. Zhang Z, Gerstein M: Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes.
Oxford University Press, New York; Bioinformatics , 22 Suppl 3 :iiiiii In Mammalian Protein Metabolism. Volume 3. Edited by: Munro HN. Academic Press, New York; — Chapter Google Scholar. Eur Phys J B , — Hood G: PopTools. Wolfram Research, Inc: Mathematica 5. Wolfram Research, Inc. Edwards AWF: Likelihood. Download references. You can also search for this author in PubMed Google Scholar. Correspondence to Reed A Cartwright. This article is published under license to BioMed Central Ltd. Reprints and Permissions.
Cartwright, R. Logarithmic gap costs decrease alignment accuracy. BMC Bioinformatics 7, Download citation. Received : 24 August Accepted : 05 December Published : 05 December Anyone you share the following link with will be able to read this content:.
Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative. Skip to main content. Search all BMC articles Search. Download PDF. Abstract Background Studies on the distribution of indel sizes have consistently found that they obey a power law. Results A group of simulated sequences pairs were globally aligned using affine, logarithmic, and log-affine gap costs. Conclusion In contrast to initial expectations, logarithmic gap costs produce poor alignments and are actually not implied by the power-law behavior of gap sizes, given typical match and mismatch costs.
Background Sequence alignments are essential to the study of molecular biology and systematics because they purport to reveal regions in sequences that are homologous. The idea of using the generalized profile is that the same residue types or gap states on a column are treated together. The generalized profile consists of four vectors calculated from each column of a group: frequency, residue profile, and two kinds of static gap profile vectors.
These vectors can be obtained in advance of the DP process. With the frequency and residue profile vectors, S a i , b j is calculated by. A static gap consists of consecutive static nulls that already exist in each group, while a dynamic gap denotes a run of dynamic nulls that are inserted into each group during the DP process.
Note that we distinguish static and dynamic gaps for convenience of the description of our algorithm, while they contribute to the total alignment score in the same way. Let us consider an example of the gap opening penalty calculation when a 8 is aligned with b 12 Figure 1.
To keep the discussion simple, we consider A 1 and B 2 only. From simple observation, we find that a gap between A 1 and B 2 has already opened before a 8 is aligned with b For correct detection of gap opening, we need 'running gap states', each of which represents the sum of the numbers of consecutive static and dynamic nulls up to the current position.
For the example shown in Figure 1 , the running gap state for A 2 at column position a 7 is 11, which is composed of 7 static nulls and 4 dynamic nulls. Example of gap extension penalty calculation. This figure shows an example of columns a 8 and b 12 being aligned.
In what follows, we consider the non-trivial calculation of the gap extension penalty with respect to the gap of B 1 , the target gap. Consequently, the total number of nulls aligned with null columns of dynamic gaps to be removed is 2. By subtracting 5 from 8 the end position of the segment , the starting position of A , 3, is obtained. Note that A 1 is not involved in the gap extension penalty because a 1,8 is a null. Static gap states are compactly represented as a static gap profile.
A static gap profile is obtained by gathering static gap states with the same values at a column. Like a static gap profile, a running gap profile can represent running gap states compactly. In order to obtain a running gap profile, another data structure called a gap mediation profile is required. A gap mediation profile is defined for each path in the DP process, and records the total number of dynamic nulls inserted into static gaps that are currently open.
Each element of the gap mediation profile is expressed as s , d , where s is the length of a static gap and d is the summed length of dynamic gaps inserted within or after the static gap. We first consider the case where a i and b j are aligned.
The other gap mediation profile vectors are constructed in a similar way. In this section, we describe a novel group-to-group sequence alignment algorithm with a piecewise linear gap cost.
After explanation of the piecewise linear gap cost, we describe these algorithms in detail. The piecewise linear gap cost consists of several linear functions [ 18 ]:. This cost could alleviate over-penalizing long indels, because the inclination of g x , which corresponds to a gap extension penalty, u l , becomes small as gap length increases.
In other words, this cost calculates gap extension penalties based on gap length. As mentioned above, the calculation of gap extension penalties must be separated from the calculation of the substitution score S a i , b j in order to use the piecewise linear gap cost. Therefore, S a i , b j is expressed as:. Therefore, a term for gap extension penalty is added to the right hand side of equation 8.
Let us consider an example of gap extension penalty calculation for the null on b 12 aligned with the residues on a 8 the last column of Figure 1. The gap state at b 1,12 is 8.
With omission of nulls that are aligned with other nulls, the lengths of the gap on B 1 relative to A 2 , A 3 , and A 4 are 1, 4, and 6, respectively. This example indicates two important points for the exact calculation of gap extension penalty. First, each gap length is obtained by counting the residues on each row of a specific range in A called 'relevant segment', A 3, 8 in this example.
Second, the two nulls on B 1 at columns b 5 and b 11 are aligned with two separate dynamic gaps inserted into A. The first observation suggests that F 1 and F 2 for any segment may be calculated before the DP process. However, the second observation indicates that the number of null columns of dynamic gaps aligned with static nulls on a target gap must be subtracted from the gap state of the target gap for correct assignment of the relevant segment; without this subtraction, the relevant segment would be assigned as A 1, 8 in the above example.
We initially consider the first point, neglecting the presence of dynamic gaps for a moment. Without loss of generality, we assume that the residues on a i are aligned with a null of a target gap on b j whose gap state is g.
F 0 a i is the same as the null component of the frequency profile at a i , i. For actual calculations, we use an 'insertion length profile' associated with each column of a group. For example, dynamic null columns aligned with b 5 and b 11 in Figure 1 are removed.
To keep track of the dynamic gaps to be removed, we newly introduce the 'dynamic gap information' list. Each element of dynamic gap information is represented by p , l , where p and l indicate the position and the length of a dynamic gap, respectively.
The other dynamic gap information lists are obtained in a similar way. The information held in dynamic gap information is relevant to the group aligned with a gap in question group A in the present example , while that maintained in a gap mediation profile is relevant to the group containing the gap group B in the above example. Note that step 2c in this algorithm examines whether or not the dynamic gap in question is partially aligned with the target gap.
The gap extension penalty, u , summed over all the nulls on column b j can easily be calculated with the following algorithm:. To obtain the total gap extension penalty at each DP step, we must also consider the opposite situation where residues on b j are aligned with gaps on a i in a similar way.
The doubly nested randomized iterative strategy involves refinement of alignment, phylogenetic tree, and pair weights until these are mutually consistent [ 9 ]. After preparation of an initial alignment with such progressive methods as the oligomer counting based method [ 8 , 10 ], this strategy refines the initial alignment as follows:.
Repeat steps 1 to 4 until the weighted SP score of the alignment does not improve anymore at step 4. However, PRIME employs our proposed algorithm with a piecewise linear gap cost in contrast to Prrn that uses an affine gap cost. Another algorithmic difference between PRIME and Prrn is that the latter uses the candidate list paradigm in the group-to-group sequence alignment algorithm and the anchoring method, whereas the former adopts a simpler DP method without anchoring heuristics.
Initial MSAs are constructed using a simple progressive method with the proposed group-to-group sequence alignment algorithm based on a distance matrix calculated from pairwise sequence alignment. PSA piecewise and PSA affine use the piecewise linear gap cost and the affine gap cost, respectively. Reference 1 is further divided into two sub-references based on sequence identities. Each sequence in the full length set is not trimmed off non-homologous regions, whereas the homologous region set consists of alignments of trimmed sequences and hence corresponds to the previous BAliBASE.
However, reference 4 is excluded from the homologous region set due to its objective. The sum-of-pairs score SPS is defined as the proportion of correctly aligned residue pairs:. The column score CS represents the proportion of correctly aligned columns:.
Each alignment of the long gap set, a subset of the main set, contains one or more gaps whose lengths are more than The weighting set involves alignments each of which includes more sequences of one sub-family than that of the other sub-families. Note that if the reference alignment is pairwise alignment, quality score, sum-of-pairs score, and column score have the same value. The average sum-of-pairs and column scores of the full length set are shown in Tables 3 and 4 , respectively.
The last columns of both tables are the rank sum of the Friedman test. The program with the smallest rank sum means that the program consistently constructs the most accurate alignments even if it does not achieve the largest average score.
The p -values of the Friedman test are shown in the additional file [see Additional file 2 ]. The Friedman test indicates that the tested programs are classified into three groups according to their performances.
The average sum-of-pairs and column scores of the homologous region set are shown in Tables 5 and 6 , respectively. The p -values of the Friedman test are shown in the additional file [see Additional file 3 ]. Figure 3 shows the performance difference between them. Because the terminal non-homologous regions trimmed off in the homologous region set are usually long, for the full length set, the piecewise linear gap cost treats the long terminal gaps more effectively than the affine gap cost, and hence PRIME piecewise shows better performance than PRIME affine.
The relative performance of the nine programs examined is nearly the same as that of the full length set in a statistical sense. We examine the effects of non-homologous regions in more detail.
Therefore, the difference in alignment scores obtained by the same program for the corresponding members in the full length and homologous region sets indicates to what extent the program properly deals with terminal gaps. Figures 4 and 5 show the average difference of sum-of-pairs and column scores between the full length and homologous region sets.
Each difference score is calculated by subtracting the alignment score of the full length set from that of the homologous region set. A positive difference score means that the non-homologous regions adversely affect alignment accuracy, whereas a negative score indicates improvement in alignment accuracy due to the presence of such regions.
If the score difference is close to 0, the program is considered to be robust against the non-homologous regions. These observations indicate that PRlME piecewise deals with terminal gaps better than conventional global MSA programs, although not as well as those incorporating local alignment information.
Average sum-of-pairs score differences between full length and homologous region sets. Each point means average sum-of-pairs score difference in respective alignments on each reference of the full length and homologous region sets.
The smaller absolute value of a score indicates that the introduction of long terminal indels less affects the alignment accuracy of a program. Average column score differences between full length and homologous region sets.
Each point means average column score difference in respective alignments on each reference of the full length and homologous region sets. Waterman and W. Comparative biosequence metrics. Taylor, P. A fast homology program for aligning biological sequences. Ukkonen, E. On approximate string matching. Lectures Notes Comp. Waterman, M. Efficient sequence alignment algorithms.
MathSciNet Google Scholar. Line geometries for sequence comparisons. Smith and W. Some biological sequence metrics. Wilbur, W. Rapid similarity searches of nucleic acid and protein data banks.
Zwieb, C. Glotz and R. Secondary structure comparisons between small subunit ribosomal RNA molecules from six different species. Download references. You can also search for this author in PubMed Google Scholar. Reprints and Permissions. Optimal sequence alignment allowing for long gaps. Bltn Mathcal Biology 52, — Download citation. Received : 25 April Revised : 17 April Issue Date : May Anyone you share the following link with will be able to read this content:.
Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative. Skip to main content. Search SpringerLink Search. Abstract A new algorithm for optimal sequence alignment allowing for long insertions and deletions is developed.
Literature Aho, A. Google Scholar Anderson, S. Article Google Scholar Altschul, S. Google Scholar Fickett, J. Google Scholar Fitch, W.
0コメント