This website (http://www.santafe.edu/~btk/science-paper/bette.html) provides details concerning the phylogenetic tree building code, evolutionary models and phylogenetic trees used in: B Korber, M Muldoon, J Theiler, F Gao, R Gupta, A Lapedes, B Hahn, S Wolinsky, T Bhattacharya. Timing the Ancestor of the HIV-1 Pandemic Strains , Science (288:1757-1759 2000). The level of detail at this site was provided to respond to requests for additional information that came from the reviewers of the manuscript.

Versions of the code used for the study are also available here, although they are not always user friendly.

CONTENTS:

I) A description of the four alignments used in this study.

II) A description of the parallel version of the maximum likelihood code.

III) A description of the method used to assign different rates to different positions code -- a modified version of DNArates by G. Olsen.

IV) A likelihood comparison of three evolutionary models: F84, F84 plus rate-variation at different positions, and REV plus rate-variation at different positions.

V) The relationship of evolutionary rates within a tree using the method that allows the rate of evolution in a tree to evolve.

VI) A summary of the steps taken for the analyses of each of the four data sets studied.

VII) Details concerning the optimization of the trees used as a basis for the molecular clock estimations of ancestral points in the tree.

VIII) Available code and data.


I) A description of the four alignments used in this study.

Four HIV-1 sequence alignments were used for this study. The two basic alignments were the full-length env and gag alignments from the 1998 HIV-1 sequence database supplemented with several additional sequences newly available in 1999; these comprise the most comprehensive datasets available at the time the study was initiated. We included one complete gene sequence per individual with known year of sampling and no indication of intersubtype recombination. Full-length genes were used because longer sequences improve the quality of phylogenetic reconstructions. 159 full-length env and 64 full-length gag sequences were used to give two independent estimates, although the gag dataset was much more limited. After gapstripping, the gp160 alignment had 142 sequences, 2038 sites, and 1460 distinct data patterns. The env subregion alignment, including the fragment from the 1959 sample, had 143 sequences, 403 sites (limited to the available sequence from the 1959 sample), and 317 distinct data patterns. The env subregion-alignment including the E-like region of the CRF AE sequences had 159 sequences (parts of the evelope gene were excluded to avoid including gene regions with overt recombination), 1512 sites, and 1089 distinct data patterns. The gag alignment had 64 sequences, 1363 sites, and 685 distinct data patterns.

II) A description of the parallel version of the maximum likelihood code -- a modified version of fastDNAml by G. Olsen.

The parallel version of the fastDNAml code that we used was implemented in C with the internode communication done using MPI. It follows G. Olsen's fastDNAml very closely. In extending it to use the REV model, we needed a fast method for exponentiating a general reversible transition matrix T. This was done in two steps: first we construct a real symmetric matrix U related to T by the similarity transformation


where the are the equilibrium base frequencies. The matrix U was then diagonalized easily, and the log likelihood calculated from the eigenvalues and eigenvectors.

III) A description of the method used to assign different rates to different positions code -- a modified version of DNArates by G. Olsen.

The assignment of different rates to different positions has most commonly been done by estimating rate variation according to a gamma distribution with the shape determined by the data and the tree. The rates are then discretized into a user-specified number of categories, and each site is assigned a rate. (For more information about this strategy, see the documentation for PAML, a program which estimates the gamma distribution assigned rates, http://abacus.gene.ucl.ac.uk/software/paml.html .) The maximum likelihood strategy that we have used instead maximizes the likelihood of the rate for each site in an alignment, given the tree, adding as many parameters as there are positions in the alignment. Maximum likelihood estimates of rate variation are optimal only when adequate data are available to avoid over-parameterization.

For our estimates, we extended G. Olsen's DNArates program to use the REV model using the same strategy as for fastDNAml. To the estimation of the site-dependent rates, we added the calculation of the likelihood of the tree given the input model parameters. We iteratively maximized this likelihood over the model parameters using conjugate-gradient.

We have not implemented gamma distribution assigned rates into our parallel maximum likelihood code, and thus it was only possible to compare a gamma distribution with the maximum likelihood rate model directly for the gag tree, with a small enough number of taxa to test using PAUP* (Swofford, D. L. 1999. PAUP* Phylogenetic Analysis Using Parsimony (*and Other Methods). Sinauer Associates, Sunderland, MA http://abacus.gene.ucl.ac.uk/software/paml.html ). The env trees had too many taxa to use either PAUP or PAML. All trees incorporated a maximum likelihood estimate of base frequencies. For PAUP gag maximum likelihood trees including 66 taxa with 1363 sites: the F84 model with no rate variation at sites gave a log likelihood of -19349.7; a REV model gave -19294.0, and a REV model with gamma rates using 8 categories gave -17805.9 and estimated alpha to be 0.39659. Using our parallelized version of fastDNAml, the log likelihood using the F84 model was -19356.8, comparable to the log likelihood estimated using PAUP. However, the value using the REV model with maximum likelihood estimated rates assigned to 8 categories is -16328.5. The difference in the log likelihoods between gamma rates and maximum likelihood estimated rates, both tested with a REV model, was 1477, justifying the use of maximum likelihood estimated rates.

IV) A comparison of three evolutionary models:

We compared all four of the trees included in this study with three models that were available in the parallelized maximum likelihood tree building code: (i) F84, (ii) F84 with maximum likelihood rate variation for sites with 8 categories, and (iii) a REV model with maximum likelihood rate variation for sites with 8 categories. The more complicated model was justified in each case, as the likelihood difference was always much greater than the increase in the number of parameters.

Tree gp160 gag Esubtype 1959
Number of positions 2038 1363 1512 403
log Likelihood difference (i) vs. (ii) 9166 1775 7041 2488
log Likelihood difference (ii) vs. (iii) 586 204 442 49


V) The relationship of evolutionary rates using the method that allows the rate of evolution in a tree to evolve.

The method we used was adapted from the method used by J. Thorne and colleagues ( J. L. Thorne, H. Kishino, I. S. Painter, Mol Biol Evol 15:1647 (1998) ). Our version allows different times of origin for the sequences. We use as input a Hessian created from our parallel maximum likelihood trees, so we could incorporate rate variation at different positions and so we could use a REV model; Thorne et al. used an F84 model maximum likelihood tree. So, the output of our parallel maximum likelihood code is a tree, complete with branch lengths, that maximizes the likelihood, the gradient (first derivative with respect to the branch lengths, which is zero except when the optimal branch length is zero), and the Hessian (second derivative) of the log likelihood at the maximum. This information is used by our code that allows the rate of evolution to evolve, to make a quadratic approximation to the log likelihood or, equivalently, a multivariate Gaussian approximation to the likelihood.

The relationship between rates of evolution on neighboring branches in a tree is defined as follows:   say a child node has age Tc , its parent Tp and the parent's parent (the child's grandparent) Tg , then the rate associated with the child branch is expected to be drawn from a log-normal distribution whose mean is the same as the rate on the parent branch and whose variance is proportional to the difference in ages between the midpoints of the branches:



The constant of proportionality v is one of the parameters of the model, as are all the ages and rates associated with the internal nodes. One specifies the tree topology, a multivariate Gaussian approximation to the likelihood (as a function of branch lengths) and prior distributions for the rate and age of the root node and is left with a large Bayesian estimation problem. We follow Thorne et al. and answer these questions with a Markov-Chain Monte Carlo calculation using the Metropolis algorithm. (N. Metropolis et al., Journal of Chemical Physics 21, 1087 (1957)).

The model used here is designed to accomodate drift, rather than abrupt change, in the rate of molecular evolution. The important parameter is v, the rate heterogeneity parameter, and it is here given a prior distribution that allows for an expected rate variation of a factor of e = 2.718... at the leaf nodes or a factor of e ± 1/2 between the root and the leaves.

VI) A summary of the steps taken for the analyses of each of the four data sets studied.

The level of detail included in this section was requested by one of the reviewers of the Science paper. All input alignments were gapstripped. An F84 maximum likelihood tree assuming a constant rate of substitution across sites was created. This tree was used as an input to obtain REV model parameters, an eight-category maximum likelihood assignment of rate variation to all positions, and a likelihood estimate of base frequencies. This evolutionary model was fed back into the full maximum likelihood code, and a new tree was generated. In cases, 5 -- 7 jumbles of input order were tested, and trees were generated using regional rearrangements allowing three branches to cross during the sequential and final additions of taxa to the tree. Jumbling the input order of sequences could improve the log likelihood scores of the env trees by values approaching in the most extreme case 100 for envelope alignments, but for gag, with many fewer sequences, jumbling improved the log likelihood by only 6; after 5 -- 7 jumbles, the best scores tended to converge. Three evolutionary models, F84, F84 plus site-rate variation, and REV plus site-rate variation, were compared and in each case the most sophisticated model was justified by a substantial increase in likelihood. Best trees were created using (i) a SIVcpz sequence as an outgroup, and (ii) the consensus sequence as outgroup. Retaining the topology obtained in (ii), we replaced the consensus with the SIVcpz sequence and recalculated the branch lengths. The likelihood of these trees were almost identical to those found in (i). In each case, these branch lengths were then used to estimate the time of origin and rate of evolution of the M group based on a maximum likelihood fit that incorporated error on time of sampling as well as on branch lengths. We generated 120 bootstrap replicates to calculate the 95% CIs using the two strategies discussed in Figure 2. We then calculated the gradient and Hessian of the log likelihood from the best tree based on the REV with the site rate variation model, and used this as the input for the calculation of trees that allow variation in the rate of evolution.

VII) Details concerning the optimization of the trees used as a basis for the molecular clock estimations of ancestral points in the tree.

1. Jumbling the input order of sequences could improve likelihood scores by searching different regions in the tree space, so we jumbled until the likelihoods were converging to similar values; in each case the best score and the second best score are within 5 of each other. We would stop when we got to maximum likelihood scores that were very similar. The most extreme variation between log likelihoods with different input order was for the 1959 sequence, a difference of 113 between the best and the worst log likelihoods.

For the gag tree with fewer taxa, we found jumbling made very little difference.

The G (global) option was set to 3 to set the amount of branch crossing explored during tree rearrangements, during both the sequential addition phase of tree building and upon addition of the final taxa.

Gag -- 2 jumbles, SIV CPZ.US out

      master.out.257470: log Likelihood =   -15483.1441581
      master.out.73172: log Likelihood =   -15483.1441592

Env -- 7 jumbles, Consensus out

      master.out.156530: log Likelihood =   -55829.4682880
      master.out.35520: log Likelihood =   -55832.7448700
      master.out.18989: log Likelihood =   -55834.4910900
      master.out.44755: log Likelihood =   -55845.6608221
      master.out.51013: log Likelihood =   -55851.6229861
      master.out.120985: log Likelihood =   -55858.1982386
      master.out.59612: log Likelihood =   -55860.5584135

Env -- 2 jumbles, chimp out,

      master.out.318833: log Likelihood =   -57354.7457662
      master.out.501396: log Likelihood =   -57354.7457663

2) Reiterations recreating both the model and REV -- only done for gag and env gp160 coding region:

      master.out.131849: log Likelihood =   -55714.7125562
      master.out.381472: log Likelihood =   -55692.6834317
      master.out.418776: log Likelihood =   -55694.5408223
      master.out.46026: log Likelihood =   -55712.4821320
      master.out.975416: log Likelihood =   -55699.5099517

Env gp160 sequence, no E sequences were included to avoid recombinant sequences:

A first tree was generated using F84 and imperical base frequencies; base frequencies, the REV model and site rate variation were estimated from the initial tree. We did another tree from scratch using these parameters, got the following:

      log Likelihood = -57364.573106 (first tree)

We re-estimated the evolutionary model based on the second tree, using the same branching order as for the first tree for input and got the following:

      log Likelihood = -57365.981373 (revised branch length tree)

Finally, we redid the tree from scratch using the second set of parameters, and got:

      log Likelihood = -55692.683431

The 8 categories in each cases were very similar:

Normalized categories, Env iterations 1 and 2:

iteration 1 iteration 2
0.00159743 0.00148477
0.00430616 0.0040189
0.011312 0.0105515
0.0297151 0.0277027
0.0780587 0.0727328
0.205052 0.190958
0.538651 0.501356
1 1

85% of the sites were assigned the same category (1-8) in the two iterations; those that changed categories never did by more than 1; most of these went up one category in the second iteration, only 0.3% went down a category.

First iteration evolutionary model parameters:

model:   REV
order:   TCAG

freq:   0.218093 0.205726 0.372622 0.203559
ratepar:   0.966152 0.177914 0.242290 0.345885 0.155164
C 8     0.07581     0.20436     0.53684     1.41021     3.70448     9.73129     25.56309     47.45765

Categories

3445656234 3556664454 6456454455 1111244143 1511411111 4224214315
1224551163 2356541533 5513115413 2151153142 1444442431 3546554125
5651141144 1513512423 4115113214 1141142142 1551122441 2416675534
2121151241 1464121415 1132115444 5451221441 6225111315 4451144121
2512414111 2154244254 1422511221 1214345413 2141143241 1432542552
4646215566 4264643352 1533314455 4245664556 5442553415 4322655442
2161454556 4535515454 4541144354 2431455453 5244116325 2141141245
4564514432 6114114115 2133141441 1451412521 5142115414 4143451141
5761554656 5114454115 4651255551 1546543511 5254212251 2511521411
4115324165 1122244353 1431411452 4514512111 2411553231 2555624555
1244345563 2462556532 3465224411 4241155255 5765563551 4666234565
2145651142 3555544545 4445564546 5456665453 3445431445 5554555554
4543344345 2361411444 5624254665 6652115666 6425524665 6555126755
4543166566 5554211666 5665445341 2311412554 4114414554 4653441554
1213111344 3114214125 3141442155 1561253466 6415114576 4354664154
5411534411 2154214514 6341355432 2355434423 5452122135 1343143341
1422566622 5665415466 1255552143 3521411411 5525615515 5444352333
3646541522 4124454114 2245151121 5432432411 1133341344 1422414511
4114215314 5551441522 4514233624 1241244355 7431255124 3322244122
3341322122 4144734422 4143124134 1432443411 5113113141 1111111544
4716532212 1211411143 3121446551 5413225111 1121112511 3125255134
6132123552 1412441423 4134115124 5143234351 1321522311 1113315114
1121141444 1443213424 3235141142 2421112457 3112124122 2133152265
2531112511 3113114144 4432152125 3311434544 3115112544 6151213161
6442441366 5324214323 6666653222 3466641561 3245555314 1164631451
1422555511 1224555246 5145254543 1531544613 2545454524 1121621111
2466511461 3648111314 2111142132 6351412441 1345214131 4421441562
1115414134 4124344436 4441551541 4512444441 2214411141 1125151121
1551631621 6112142164 5621246654 6454115425 5142113145 4544411542
4414114213 1114344451 2522544444 1156675563 3434566611 4145245535
5166541113 5414742653 3253112444 3251425551 3125335624 4323135535
1454554443 2443234565 4243464454 4453335541 5446555134 1115564552
1541544445 4222134535 4555246255 2654546421 6424452314 3321354251
2421322565 4521334655 6553545446 6434124315 5473144445 5421445

Values from the second iteration:

order: 3 2 0 1
freqs: 0.218165 0.199517 0.377100 0.205218
rates: 0.953618 0.154045 0.226101 0.337060 0.140815

C 8     0.09497     0.25706     0.67490     1.77194     4.65219     12.21421     32.06810     63.96277

Categories

3434655133 1556663454 6445453455 1111144143 1511411111 4224214314
1224541163 1256541422 5512115413 2141153142 1444442421 2546554125
5651141144 1513512423 3115113214 1141132142 1541112341 2416675534
2121151131 1464121415 1132115444 5451221441 6225111315 4451134111
2412414111 2154244254 1422511121 1114345413 2141142241 1432542542
3636215566 4264643352 1533214455 4245664556 5432553414 4321654442
2161454555 2535514453 4541134354 2421455353 5244115325 2141141245
3564514432 6114114115 2133141441 1451412521 4142114414 4133451131
5861554656 5114354115 4641255551 1546543511 4254212251 2511521411
4114323165 1122244352 1431411452 4514412111 2411553231 1455624555
1244245563 2361555522 1465114311 4241154254 5765573551 4665134565
2145651142 3555444545 4345564536 5456655553 3445431445 5554555554
4442234345 2251411445 5624154665 6652115665 6415514665 6555126755
4542166466 5554111666 5665445341 2311412554 4113414554 4653441554
1213111354 3114214125 3141442155 1561152466 5414114576 4354655154
5411534411 1154214513 6331255422 2355434423 5452112135 1243133341
1422566622 5565415466 1255552133 3521411411 5525615414 4442251222
3546531421 4124354114 2245141121 5431421411 1133331334 1412414511
4114214313 5551341522 4514133623 1241244255 7331255124 1111144122
2331322122 4144734422 4133124123 1432443411 5113113141 1111111544
4716522212 1211411143 3111446451 4413225111 1121112511 2125254134
6131113552 1312441423 3134115124 5133234351 1211512311 1112214114
1111131444 1443212424 3235141142 2421112457 3112124122 1123152265
2531112511 3112114144 3321152114 2211334434 3115112544 6151213161
6431441155 5324213223 6666643212 2466641561 2245555314 1164621451
1412554511 1224555136 5145254543 1531544513 2545453424 1121611111
2466511461 2548111314 2111142132 6351312431 1344114131 3411441562
1115414134 4114344326 4441551551 4512444341 1214411141 1125141111
1551631621 6112142164 5621245654 6454115424 4141113144 4544411542
4314114113 1114244441 2522544444 1156575563 3434466511 4135245535
5165541113 5414642653 2253111244 3251315451 3125225513 4313135524
1444553431 1422134565 3131463254 4453325531 5346555133 1115553452
1541544445 4222134525 4555246255 2654546421 6424452314 3311354241
2321322565 3521334655 6552445436 6434124315 5473134445 54214351

Base Composition of the consensus:

One of the concerns for a potential bias in the consensus sequence was that the base composition would become imbalanced, over-weighted with the most common base, A, but this was not the case:

For Gag, the base composition of the consensus was (there were also 32 Ns):

A 0.370 493 A
C 0.192 255 C
G 0.240 319 G
T 0.198 264 T

The Empirical Base Frequencies for all other Gag sequences were (there was 1 N):

A 0.364 32257 A
C 0.193 17119 C
G 0.244 21669 G
T 0.198 17549 T

The gag maximum likelihood frequencies:

order: 3 2 0 1 (TCAG)

A 0.35411
C 0.21432
G 0.20836
T(U) 0.22320

For Env no E the base composition of the consensus was:

A 0.3454 704
C 0.1811 369
G 0.2287 466
T 0.2448 499

The Empirical Base Frequencies for all other Env sequences were (there was 1 N):

A 0.3403 97795 A
C 0.1809 51993 C
G 0.2312 66438 G
T 0.2475 71131 T

The Env maximum likelihood estimate for the base frequences:

A 0.37710
C 0.19952
G 0.20522
T 0.21816


IV) The place the outgroup joins the tree is a very shallow minimum, relative to jumbling differences:

Con-out means that the consensus was used as an outgroup from the start to create the tree.
Replace-CPZ means that the branching order of the Con-out tree was used, but that a chimpanzee sequence replaced the consensus and branch lengths were recalculated.
CPZ-start means that the chimpanzee sequence was used as an outgroup from the start to create the tree.

Final best trees:

gp160 env, no E subtype:
      Con-out: master.out.156530 log Likelihood =   -55829.47
      Replace-CPZ: outfile.16725 log Likelihood =   -57356.26
      CPZ-start: master.out.318833 log Likelihood =   -57354.75

Partial env, with E subtype:
      Con-out: master.out.33354 log Likelihood =   -42900.38
      Replace-CPZ: outfile.8884 log Likelihood =   -43984.31
      CPZ-start: master.out.103440: log Likelihood =   -43974.95

1959 Sequence:
      Con-out: master.out.131300:log Likelihood =   -11627.22
      Replace-CPZ: outfile.4974:log Likelihood =   -11900.26
      CPZ-start: master.out.38482:log Likelihood =   -11957.15

Gag:
      Con-out: treefile.312131 log Likelihood =   -14577.35
      Replace-CPZ: treefile.5065 log Likelihood =   -15485.16
      CPZ-start: treefile.73172: log Likelihood =   -15483.14


VIII) Available code and data.

Included here are links to some of the source code and documentation, and the alignments used in this study. The code was written by several different people and is not written to be user friendly. README files concerning how to run and compile the software will be available, but only very minimal software support will be provided, although we would welcome information about bugs. The code and README files can be retrieved via ftp as tar gzip directories.

Here is a list of what is available:

1) The maximum likelihood code for parallel computers that takes a REV model. This code was based on Gary Olsen's fastDNAml.

Input: Sequence alignment in fastDNAml format with slight modifications
Output: ML tree

This code is still preliminary. It has been run on a variety of parallel platforms supporting MPI, but its input, output, and error reporting is still rough. It is available for download (download ml.tar.gz). The code runs on serial machines as well (download mlser.tar.gz). If you need help, contact: tanmoy@lanl.gov.

2) A program that is an adaptation of the DNArates program by Gary Olsen and colleagues, that simultaneously calculates the REV model and the maximum likelihood rate variation per site.

Input: Sequence alignment in fastDNAml format with a USER tree
Output: REV model and maximum likelihood assigned rates/site

The code is written in C, is serial, and has been run on UNIX workstations. It is available for download (download RevDNArates.tar.gz).

3) A program that takes a maximum likelihood tree and calculates the Hessian for input into the code that estimates a tree using a relaxed molecular clock: hessian.tar.gz

Input: Sequence alignment in fastDNAml format with a USER tree
Output: The Hessian, or the co-variation matrix, a very large file, that is used as input for a tree that allows evolution of the rate of evolution.

This code is preliminary. It is available for download in a file that is combined with the serial machine maximum likelihood tree building code. (download mlser.tar.gz). Contact Tanmoy Bhattacharya: tanmoy@lanl.gov in case of problem.

4) A program that was a modified version of Jeff Thorne's code to relax the assumption of a molecular clock and allow the evolutionary rate to evolve.

Input: The sequence alignment and Hessian created by the program described in part 3, and some Bayesian priors.
Output: A phylogenetic tree where the branch length is time, and an outfile indicating the estimated evolutionary rate and time of each internal node, and the error.

The first version that was used in the Science paper is available for download, but is problematic because of a bias that inappropiately pushes back ages of internal nodes: (download sp.tar.gz).

A second, improved version of the code is available (manuscript in preparation, Mark Muldoon) that corrects the bias mentioned above: (download rate-evolution.tar.gz) .

5) A program that "colorizes" a relaxed molecular clock tree, breaking the estimated rates for each branch into five categories from fastest to slowest.

Input: A postscript tree created by the tool "treetool" based on the tree derived from program 4, and the output from program 4 that specifies a rate for each branch.
Output: A tree that has colored branches indicating the evolutionary rate of each branch.

The code is written in perl, and has been run on UNIX workstations. It is available for download (download TREEPAINT.tar.gz).

6) A series of programs that estimate the linear relationship between branch length and time using the model described in the paper. The full series of programs incorporates subroutines from Numerical recipes, and because of copyright issues can only be used by individuals who have a copy of Numerical Recipes for C; code could be provided upon request to individuals who do own Numerical Recipes btk@lanl.gov. This code was written very specifically for the HIV study at hand, and not readily generalized to other situations. It is a combination of shell scripts, perl scripts and C and has been run on linux and unix machines.

Parts of the code that may be generally useful are available for download, in particular code that sums branch lengths from tips to interior nodes in a tree and plots them against time with standard best fit lines:

Input: A PHYLIP or fastDNAml or modified-fastDNAml "outfile".
Output: Plots of branch lengths to any specified node against time, best fit lines, Monte Carlo respampling statistics.

(download BRANCHLENGTHxTIME.tar.gz).

And a general Perl code that provides the branch lengths of all tips in a tree to any specified node is available:

Input: A PHYLIP or fastDNAml or modified-fastDNAml "outfile".
Output: The sum of branch lengths to any specified node.

(download BRANCHLENGTH.tar.gz).

7) The alignments used for this study. BE WARNED -- more extensive alignments could be created from the Los Alamos Sequence database if desired, as this dataset was created from the data available in early 1999, and new sequences are constantly being generated.

It is available for download (download ALIGNMENTS.tar.gz).