Versions of the code used for the study are also available here, although they are not always user friendly.

I) A description of the four alignments used in this study.

II) A description of the parallel version of the maximum likelihood code.

III) A description of the method used to assign different rates to different positions code -- a modified version of DNArates by G. Olsen.

IV) A likelihood comparison of three evolutionary models: F84, F84 plus rate-variation at different positions, and REV plus rate-variation at different positions.

V) The relationship of evolutionary rates within a tree using the method that allows the rate of evolution in a tree to evolve.

VI) A summary of the steps taken for the analyses of each of the four data sets studied.

VII) Details concerning the optimization of the trees used as a basis for the molecular clock estimations of ancestral points in the tree.

VIII) Available code and data.

Four HIV-1 sequence alignments were used for this study. The two basic alignments were the full-length env and gag alignments from the 1998 HIV-1 sequence database supplemented with several additional sequences newly available in 1999; these comprise the most comprehensive datasets available at the time the study was initiated. We included one complete gene sequence per individual with known year of sampling and no indication of intersubtype recombination. Full-length genes were used because longer sequences improve the quality of phylogenetic reconstructions. 159 full-length env and 64 full-length gag sequences were used to give two independent estimates, although the gag dataset was much more limited. After gapstripping, the gp160 alignment had 142 sequences, 2038 sites, and 1460 distinct data patterns. The env subregion alignment, including the fragment from the 1959 sample, had 143 sequences, 403 sites (limited to the available sequence from the 1959 sample), and 317 distinct data patterns. The env subregion-alignment including the E-like region of the CRF AE sequences had 159 sequences (parts of the evelope gene were excluded to avoid including gene regions with overt recombination), 1512 sites, and 1089 distinct data patterns. The gag alignment had 64 sequences, 1363 sites, and 685 distinct data patterns.

The parallel version of the fastDNAml code that we used was implemented in C with the internode communication done using MPI. It follows G. Olsen's fastDNAml very closely. In extending it to use the REV model, we needed a fast method for exponentiating a general reversible transition matrix

where the are the equilibrium base frequencies. The matrix

The assignment of different rates to different positions has most commonly been done by estimating rate variation according to a gamma distribution with the shape determined by the data and the tree. The rates are then discretized into a user-specified number of categories, and each site is assigned a rate. (For more information about this strategy, see the documentation for PAML, a program which estimates the gamma distribution assigned rates, http://abacus.gene.ucl.ac.uk/software/paml.html .) The maximum likelihood strategy that we have used instead maximizes the likelihood of the rate for each site in an alignment, given the tree, adding as many parameters as there are positions in the alignment. Maximum likelihood estimates of rate variation are optimal only when adequate data are available to avoid over-parameterization.

For our estimates, we extended G. Olsen's DNArates program to use the REV model using the same strategy as for fastDNAml. To the estimation of the site-dependent rates, we added the calculation of the likelihood of the tree given the input model parameters. We iteratively maximized this likelihood over the model parameters using conjugate-gradient.

We have not implemented gamma distribution assigned rates into our parallel maximum likelihood code, and thus it was only possible to compare a gamma distribution with the maximum likelihood rate model directly for the gag tree, with a small enough number of taxa to test using PAUP* (Swofford, D. L. 1999. PAUP* Phylogenetic Analysis Using Parsimony (*and Other Methods). Sinauer Associates, Sunderland, MA http://abacus.gene.ucl.ac.uk/software/paml.html ). The env trees had too many taxa to use either PAUP or PAML. All trees incorporated a maximum likelihood estimate of base frequencies. For PAUP gag maximum likelihood trees including 66 taxa with 1363 sites: the F84 model with no rate variation at sites gave a log likelihood of -19349.7; a REV model gave -19294.0, and a REV model with gamma rates using 8 categories gave -17805.9 and estimated alpha to be 0.39659. Using our parallelized version of fastDNAml, the log likelihood using the F84 model was -19356.8, comparable to the log likelihood estimated using PAUP. However, the value using the REV model with maximum likelihood estimated rates assigned to 8 categories is -16328.5. The difference in the log likelihoods between gamma rates and maximum likelihood estimated rates, both tested with a REV model, was 1477, justifying the use of maximum likelihood estimated rates.

We compared all four of the trees included in this study with three models that were available in the parallelized maximum likelihood tree building code: (i) F84, (ii) F84 with maximum likelihood rate variation for sites with 8 categories, and (iii) a REV model with maximum likelihood rate variation for sites with 8 categories. The more complicated model was justified in each case, as the likelihood difference was always much greater than the increase in the number of parameters.

Tree | gp160 | gag | Esubtype | 1959 |
---|---|---|---|---|

Number of positions | 2038 | 1363 | 1512 | 403 |

log Likelihood difference (i) vs. (ii) | 9166 | 1775 | 7041 | 2488 |

log Likelihood difference (ii) vs. (iii) | 586 | 204 | 442 | 49 |

The method we used was adapted from the method used by J. Thorne and colleagues ( J. L. Thorne, H. Kishino, I. S. Painter, Mol Biol Evol 15:1647 (1998) ). Our version allows different times of origin for the sequences. We use as input a Hessian created from our parallel maximum likelihood trees, so we could incorporate rate variation at different positions and so we could use a REV model; Thorne

The relationship between rates of evolution on neighboring branches in a tree is defined as follows: say a child node has age

The constant of proportionality

The model used here is designed to accomodate drift, rather than abrupt change, in the rate of molecular evolution. The important parameter is

The level of detail included in this section was requested by one of the reviewers of the Science paper. All input alignments were gapstripped. An F84 maximum likelihood tree assuming a constant rate of substitution across sites was created. This tree was used as an input to obtain REV model parameters, an eight-category maximum likelihood assignment of rate variation to all positions, and a likelihood estimate of base frequencies. This evolutionary model was fed back into the full maximum likelihood code, and a new tree was generated. In cases, 5 -- 7 jumbles of input order were tested, and trees were generated using regional rearrangements allowing three branches to cross during the sequential and final additions of taxa to the tree. Jumbling the input order of sequences could improve the log likelihood scores of the env trees by values approaching in the most extreme case 100 for envelope alignments, but for gag, with many fewer sequences, jumbling improved the log likelihood by only 6; after 5 -- 7 jumbles, the best scores tended to converge. Three evolutionary models, F84, F84 plus site-rate variation, and REV plus site-rate variation, were compared and in each case the most sophisticated model was justified by a substantial increase in likelihood. Best trees were created using (i) a SIVcpz sequence as an outgroup, and (ii) the consensus sequence as outgroup. Retaining the topology obtained in (ii), we replaced the consensus with the SIVcpz sequence and recalculated the branch lengths. The likelihood of these trees were almost identical to those found in (i). In each case, these branch lengths were then used to estimate the time of origin and rate of evolution of the M group based on a maximum likelihood fit that incorporated error on time of sampling as well as on branch lengths. We generated 120 bootstrap replicates to calculate the 95% CIs using the two strategies discussed in Figure 2. We then calculated the gradient and Hessian of the log likelihood from the best tree based on the REV with the site rate variation model, and used this as the input for the calculation of trees that allow variation in the rate of evolution.

1. Jumbling the input order of sequences could improve likelihood scores by searching different regions in the tree space, so we jumbled until the likelihoods were converging to similar values; in each case the best score and the second best score are within 5 of each other. We would stop when we got to maximum likelihood scores that were very similar. The most extreme variation between log likelihoods with different input order was for the 1959 sequence, a difference of 113 between the best and the worst log likelihoods.

For the gag tree with fewer taxa, we found jumbling made very little difference.

The G (global) option was set to 3 to set the amount of branch crossing explored during tree rearrangements, during both the sequential addition phase of tree building and upon addition of the final taxa.

master.out.257470: log Likelihood | = | -15483.1441581 |

master.out.73172: log Likelihood | = | -15483.1441592 |

master.out.156530: log Likelihood | = | -55829.4682880 |

master.out.35520: log Likelihood | = | -55832.7448700 |

master.out.18989: log Likelihood | = | -55834.4910900 |

master.out.44755: log Likelihood | = | -55845.6608221 |

master.out.51013: log Likelihood | = | -55851.6229861 |

master.out.120985: log Likelihood | = | -55858.1982386 |

master.out.59612: log Likelihood | = | -55860.5584135 |

Env -- 2 jumbles, chimp out,

master.out.318833: log Likelihood | = | -57354.7457662 |

master.out.501396: log Likelihood | = | -57354.7457663 |

2) Reiterations recreating both the model and REV -- only done for gag and env gp160 coding region:

master.out.131849: log Likelihood | = -55714.7125562 |

master.out.381472: log Likelihood | = -55692.6834317 |

master.out.418776: log Likelihood | = -55694.5408223 |

master.out.46026: log Likelihood | = -55712.4821320 |

master.out.975416: log Likelihood | = -55699.5099517 |

Env gp160 sequence, no E sequences were included to avoid recombinant sequences:

A first tree was generated using F84 and imperical base frequencies; base frequencies, the REV model and site rate variation were estimated from the initial tree. We did another tree from scratch using these parameters, got the following:

log Likelihood = -57364.573106 (first tree)

We re-estimated the evolutionary model based on the second tree, using the same branching order as for the first tree for input and got the following:

log Likelihood = -57365.981373 (revised branch length tree)

Finally, we redid the tree from scratch using the second set of parameters, and got:

log Likelihood = -55692.683431

The 8 categories in each cases were very similar:

Normalized categories, Env iterations 1 and 2:

iteration 1 | iteration 2 |
---|---|

0.00159743 | 0.00148477 |

0.00430616 | 0.0040189 |

0.011312 | 0.0105515 |

0.0297151 | 0.0277027 |

0.0780587 | 0.0727328 |

0.205052 | 0.190958 |

0.538651 | 0.501356 |

1 | 1 |

First iteration evolutionary model parameters:

model: REV

order: TCAG

freq: 0.218093 0.205726 0.372622 0.203559

ratepar: 0.966152 0.177914 0.242290 0.345885 0.155164

C 8 0.07581 0.20436 0.53684 1.41021 3.70448 9.73129 25.56309 47.45765

3445656234 3556664454 6456454455 1111244143 1511411111 4224214315

1224551163 2356541533 5513115413 2151153142 1444442431 3546554125

5651141144 1513512423 4115113214 1141142142 1551122441 2416675534

2121151241 1464121415 1132115444 5451221441 6225111315 4451144121

2512414111 2154244254 1422511221 1214345413 2141143241 1432542552

4646215566 4264643352 1533314455 4245664556 5442553415 4322655442

2161454556 4535515454 4541144354 2431455453 5244116325 2141141245

4564514432 6114114115 2133141441 1451412521 5142115414 4143451141

5761554656 5114454115 4651255551 1546543511 5254212251 2511521411

4115324165 1122244353 1431411452 4514512111 2411553231 2555624555

1244345563 2462556532 3465224411 4241155255 5765563551 4666234565

2145651142 3555544545 4445564546 5456665453 3445431445 5554555554

4543344345 2361411444 5624254665 6652115666 6425524665 6555126755

4543166566 5554211666 5665445341 2311412554 4114414554 4653441554

1213111344 3114214125 3141442155 1561253466 6415114576 4354664154

5411534411 2154214514 6341355432 2355434423 5452122135 1343143341

1422566622 5665415466 1255552143 3521411411 5525615515 5444352333

3646541522 4124454114 2245151121 5432432411 1133341344 1422414511

4114215314 5551441522 4514233624 1241244355 7431255124 3322244122

3341322122 4144734422 4143124134 1432443411 5113113141 1111111544

4716532212 1211411143 3121446551 5413225111 1121112511 3125255134

6132123552 1412441423 4134115124 5143234351 1321522311 1113315114

1121141444 1443213424 3235141142 2421112457 3112124122 2133152265

2531112511 3113114144 4432152125 3311434544 3115112544 6151213161

6442441366 5324214323 6666653222 3466641561 3245555314 1164631451

1422555511 1224555246 5145254543 1531544613 2545454524 1121621111

2466511461 3648111314 2111142132 6351412441 1345214131 4421441562

1115414134 4124344436 4441551541 4512444441 2214411141 1125151121

1551631621 6112142164 5621246654 6454115425 5142113145 4544411542

4414114213 1114344451 2522544444 1156675563 3434566611 4145245535

5166541113 5414742653 3253112444 3251425551 3125335624 4323135535

1454554443 2443234565 4243464454 4453335541 5446555134 1115564552

1541544445 4222134535 4555246255 2654546421 6424452314 3321354251

2421322565 4521334655 6553545446 6434124315 5473144445 5421445

Values from the second iteration:

order: 3 2 0 1

freqs: 0.218165 0.199517 0.377100 0.205218

rates: 0.953618 0.154045 0.226101 0.337060 0.140815

C 8 0.09497 0.25706 0.67490 1.77194 4.65219 12.21421 32.06810 63.96277

3434655133 1556663454 6445453455 1111144143 1511411111 4224214314

1224541163 1256541422 5512115413 2141153142 1444442421 2546554125

5651141144 1513512423 3115113214 1141132142 1541112341 2416675534

2121151131 1464121415 1132115444 5451221441 6225111315 4451134111

2412414111 2154244254 1422511121 1114345413 2141142241 1432542542

3636215566 4264643352 1533214455 4245664556 5432553414 4321654442

2161454555 2535514453 4541134354 2421455353 5244115325 2141141245

3564514432 6114114115 2133141441 1451412521 4142114414 4133451131

5861554656 5114354115 4641255551 1546543511 4254212251 2511521411

4114323165 1122244352 1431411452 4514412111 2411553231 1455624555

1244245563 2361555522 1465114311 4241154254 5765573551 4665134565

2145651142 3555444545 4345564536 5456655553 3445431445 5554555554

4442234345 2251411445 5624154665 6652115665 6415514665 6555126755

4542166466 5554111666 5665445341 2311412554 4113414554 4653441554

1213111354 3114214125 3141442155 1561152466 5414114576 4354655154

5411534411 1154214513 6331255422 2355434423 5452112135 1243133341

1422566622 5565415466 1255552133 3521411411 5525615414 4442251222

3546531421 4124354114 2245141121 5431421411 1133331334 1412414511

4114214313 5551341522 4514133623 1241244255 7331255124 1111144122

2331322122 4144734422 4133124123 1432443411 5113113141 1111111544

4716522212 1211411143 3111446451 4413225111 1121112511 2125254134

6131113552 1312441423 3134115124 5133234351 1211512311 1112214114

1111131444 1443212424 3235141142 2421112457 3112124122 1123152265

2531112511 3112114144 3321152114 2211334434 3115112544 6151213161

6431441155 5324213223 6666643212 2466641561 2245555314 1164621451

1412554511 1224555136 5145254543 1531544513 2545453424 1121611111

2466511461 2548111314 2111142132 6351312431 1344114131 3411441562

1115414134 4114344326 4441551551 4512444341 1214411141 1125141111

1551631621 6112142164 5621245654 6454115424 4141113144 4544411542

4314114113 1114244441 2522544444 1156575563 3434466511 4135245535

5165541113 5414642653 2253111244 3251315451 3125225513 4313135524

1444553431 1422134565 3131463254 4453325531 5346555133 1115553452

1541544445 4222134525 4555246255 2654546421 6424452314 3311354241

2321322565 3521334655 6552445436 6434124315 5473134445 54214351

Base Composition of the consensus:

One of the concerns for a potential bias in the consensus sequence was that the base composition would become imbalanced, over-weighted with the most common base, A, but this was not the case:

For Gag, the base composition of the consensus was (there were also 32 Ns):

A | 0.370 | 493 | A |

C | 0.192 | 255 | C |

G | 0.240 | 319 | G |

T | 0.198 | 264 | T |

The Empirical Base Frequencies for all other Gag sequences were (there was 1 N):

A | 0.364 | 32257 | A |

C | 0.193 | 17119 | C |

G | 0.244 | 21669 | G |

T | 0.198 | 17549 | T |

The gag maximum likelihood frequencies:

order: 3 2 0 1 (TCAG)

A | 0.35411 |

C | 0.21432 |

G | 0.20836 |

T(U) | 0.22320 |

For Env no E the base composition of the consensus was:

A | 0.3454 | 704 |

C | 0.1811 | 369 |

G | 0.2287 | 466 |

T | 0.2448 | 499 |

The Empirical Base Frequencies for all other Env sequences were (there was 1 N):

A | 0.3403 | 97795 | A |

C | 0.1809 | 51993 | C |

G | 0.2312 | 66438 | G |

T | 0.2475 | 71131 | T |

The Env maximum likelihood estimate for the base frequences:

A | 0.37710 |

C | 0.19952 |

G | 0.20522 |

T | 0.21816 |

IV) The place the outgroup joins the tree is a very shallow minimum, relative to jumbling differences:

Con-out means that the consensus was used as an outgroup from the start to create the tree.

Replace-CPZ means that the branching order of the Con-out tree was used, but that a chimpanzee sequence replaced the consensus and branch lengths were recalculated.

CPZ-start means that the chimpanzee sequence was used as an outgroup from the start to create the tree.

Final best trees:

gp160 env, no E subtype:

Con-out: master.out.156530 log Likelihood | = | -55829.47 |

Replace-CPZ: outfile.16725 log Likelihood | = | -57356.26 |

CPZ-start: master.out.318833 log Likelihood | = | -57354.75 |

Partial env, with E subtype:

Con-out: master.out.33354 log Likelihood | = | -42900.38 |

Replace-CPZ: outfile.8884 log Likelihood | = | -43984.31 |

CPZ-start: master.out.103440: log Likelihood | = | -43974.95 |

1959 Sequence:

Con-out: master.out.131300:log Likelihood | = | -11627.22 |

Replace-CPZ: outfile.4974:log Likelihood | = | -11900.26 |

CPZ-start: master.out.38482:log Likelihood | = | -11957.15 |

Gag:

Con-out: treefile.312131 log Likelihood | = | -14577.35 |

Replace-CPZ: treefile.5065 log Likelihood | = | -15485.16 |

CPZ-start: treefile.73172: log Likelihood | = | -15483.14 |

Included here are links to some of the source code and documentation, and the alignments used in this study. The code was written by several different people and is not written to be user friendly. README files concerning how to run and compile the software will be available, but only very minimal software support will be provided, although we would welcome information about bugs. The code and README files can be retrieved via ftp as tar gzip directories.

Here is a list of what is available:

This code is still preliminary. It has been run on a variety of parallel platforms supporting MPI, but its input, output, and error reporting is still rough. It is available for download (download ml.tar.gz). The code runs on serial machines as well (download mlser.tar.gz). If you need help, contact: tanmoy@lanl.gov.

The code is written in C, is serial, and has been run on UNIX workstations. It is available for download (download RevDNArates.tar.gz).

This code is preliminary. It is available for download in a file that is combined with the serial machine maximum likelihood tree building code. (download mlser.tar.gz). Contact Tanmoy Bhattacharya: tanmoy@lanl.gov in case of problem.

The first version that was used in the Science paper is available for download, but is problematic because of a bias that inappropiately pushes back ages of internal nodes: (download sp.tar.gz).

A second, improved version of the code is available (manuscript in preparation, Mark Muldoon) that corrects the bias mentioned above: (download rate-evolution.tar.gz) .

The code is written in perl, and has been run on UNIX workstations. It is available for download (download TREEPAINT.tar.gz).

Parts of the code that may be generally useful are available for download, in particular code that sums branch lengths from tips to interior nodes in a tree and plots them against time with standard best fit lines:

(download BRANCHLENGTHxTIME.tar.gz).

And a general Perl code that provides the branch lengths of all tips in a tree to any specified node is available:

(download BRANCHLENGTH.tar.gz).

It is available for download (download ALIGNMENTS.tar.gz).