Download it from here: http://abacus.gene.ucl.ac.uk/software/paml.html#download
Windows (The bin/ folder already contains windows executables):
cd paml4.6/
Linux:
cd paml4.5/ rm bin/*.exe cd src make -f Makefile ls -l rm *.o mv baseml basemlg codeml pamp evolver yn00 chi2 ../bin chmod +x ../bin/*
MacOSX:
cd paml4.6/ rm bin/*.exe cd src cc -O2 -o baseml baseml.c tools.c -lm cc -O2 -o basemlg basemlg.c tools.c -lm cc -O2 -o codeml codeml.c tools.c -lm cc -O2 -o pamp pamp.c tools.c -lm cc -O2 -o mcmctree mcmctree.c tools.c -lm cc -O2 -o evolver evolver.c tools.c -lm cc -O2 -o yn00 yn00.c tools.c -lm cc -O2 -o chi2 chi2.c -lm ls -l rm *.o mv baseml basemlg codeml pamp evolver yn00 chi2 ../bin chmod +x ../bin/*
The selective pressure in protein coding genes can be detected within the framework of comparative genomics. The selective pressure is assumed to be defined by the ratio (ω) dN/dS. dS represents the synonymous rate (changing the amino acid) and dN the non-synonymous rate (keeping the amino acid). In the absence of evolutionary pressure, the synonymous rate and the non-synonymous rate are equal, so the dN/dS ratio is equal to 1. Under purifying selection, natural selection prevents the replacement of amino acids, so the dN will be lower than the dS, and dN/dS < 1. And under positive selection, the replacement rate of amino acid is favoured by selection, and dN/dS > 1.
CodeML is a program from the package PAML, based on Maximum Likelihood, and developed in the lab of Ziheng Yang, University College London.
It estimates various parameters (Ts/Tv, dN/dS, branch length) on the codon (nucleotide) alignment, based on a predefined topology (phylogenetic tree).
Different categories of codon models exist in CodeML:
First, we have to define the branch where we think that position could have occurred. We will call this branch the “foreground branch” and all other branches in the tree will be the “background” branches. The background branches share the same distribution of ω = dN/dS value among sites, whereas different values can apply to the foreground branch.
To compute the likelihood value, two models are computed: a null model, in which the foreground branch may have different proportions of sites under neutral selection to the background (i.e. relaxed purifying selection), and an alternative model, in which the foreground branch may have a proportion of sites under positive selection.
As the alternative model is the general case, it is easier to present it first.
Four categories of sites are assumed in the branch-site model:
Sites with identical dN/dS in both foreground and background branches:
Sites with different dN/dS between foreground and background branches:
For each category, we get the proportion of sites and the associated dN/dS values.
In the null model, the dN/dS (ω2) is fixed to 1:
Sites with identical dN/dS in both foreground and background branches:
Sites with different dN/dS between foreground and background branches:
For each model, we get the log likelihood value (lnL1 for the alternative and lnL0 for the null models), from which we compute the Likelihood Ratio Test (LRT).
The 2×(lnL1-lnL0) follows a χ² curve with degree of freedom of 1, so we can get a p-value for this LRT.
Let's go in details.
In the evolution vertebrates, we would like to know if the branch leading to the Teleost fishes (genes A50 to A54) in the serine/threonine-protein kinase PAK 2 gene was under positive selection or not. And if yes, which residues were under positive selection.
We need four files to run CodeML (unzip them all):
Run command file (alternative model):
We estimate the Ts/Tv ratio (fix_kappa = 0) and the dN/dS (fix_omega = 0). The branch-site model is specified by setting these two parameters:
seqfile = TF105351.Eut.3.phy * sequence data file name treefile = TF105351.Eut.3.53876.tree * tree structure file name outfile = TF105351.Eut.3.53876.mlc * main result file name noisy = 9 * 0,1,2,3,9: how much rubbish on the screen verbose = 1 * 1: detailed output, 0: concise output runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree aaDist = 0 * 0:equal, +:geometric; -:linear, {1-5:G1974,Miyata,c,p,v} model = 2 * models for codons: * 0:one, 1:b, 2:2 or more dN/dS ratios for branches NSsites = 2 * 0:one w; 1:NearlyNeutral; 2:PositiveSelection; 3:discrete; * 4:freqs; 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ10:3normal icode = 0 * 0:standard genetic code; 1:mammalian mt; 2-10:see below Mgene = 0 * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = 2 * initial or fixed kappa fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate omega = 1 * initial or fixed omega, for codons or codon-based AAs getSE = 0 * 0: don't want them, 1: want S.E.s of estimates RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2) Small_Diff = .45e-6 * Default value. cleandata = 1 * remove sites with ambiguity data (1:yes, 0:no)? fix_blength = 0 * 0: ignore, -1: random, 1: initial, 2: fixed
Run command file (null model):
The command file for the null model is the same as for the alternative model, except for two parameters (in red):
seqfile = TF105351.Eut.3.phy * sequence data file name treefile = TF105351.Eut.3.53876.tree * tree structure file name outfile = TF105351.Eut.3.53876.fixed.mlc * main result file name noisy = 9 * 0,1,2,3,9: how much rubbish on the screen verbose = 1 * 1: detailed output, 0: concise output runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree aaDist = 0 * 0:equal, +:geometric; -:linear, {1-5:G1974,Miyata,c,p,v} model = 2 * models for codons: * 0:one, 1:b, 2:2 or more dN/dS ratios for branches NSsites = 2 * 0:one w; 1:NearlyNeutral; 2:PositiveSelection; 3:discrete; * 4:freqs; 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ10:3normal icode = 0 * 0:standard genetic code; 1:mammalian mt; 2-10:see below Mgene = 0 * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = 2 * initial or fixed kappa fix_omega = 1 * 1: omega or omega_1 fixed, 0: estimate <- this line was changed, dN/dS is fixed to 1. omega = 1 * initial or fixed omega, for codons or codon-based AAs** getSE = 0 * 0: don't want them, 1: want S.E.s of estimates RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2) Small_Diff = .45e-6 * Default value. cleandata = 1 * remove sites with ambiguity data (1:yes, 0:no)? fix_blength = 0 * 0: ignore, -1: random, 1: initial, 2: fixed
Launch CodeML:
In MacOSX/Linux, this will look like:
codeml ./TF105351.Eut.3.53876.ctl codeml ./TF105351.Eut.3.53876.fixed.ctl
In Windows:
codeml.exe TF105351.Eut.3.53876.ctl codeml.exe TF105351.Eut.3.53876.fixed.ctl
Two mlc output files are produced (as it can take time, you can download them directly in the next step).:
We retieve the likelihood values lnL1 and lnL0 from TF105351.Eut.3.53876.mlc and TF105351.Eut.3.53876.fixed.mlc files, respectively.
We retieve the number of parameters np1 and np0 from TF105351.Eut.3.53876.mlc and TF105351.Eut.3.53876.fixed.mlc files, respectively.
We can construct the LRT (you can use your favourite spreadsheet for that. Or even better with R):
ΔLRT = 2×(lnL1 - lnL0) = 2×(-4707.210163 - (-4710.222252)) = 6.024178
The degree of freedom is 1 (np1 - np0 = 46 - 45).
(With OpenOffice: 1-CHISQDIST(6.024178;1))
p-value = 0.014104 (under χ²) ⇒ significant!
A significant result with the branch-site codon model means that positive selection affected a subset of sites during a specific evolutionary time (also called episodic model of protein evolution).
In the TF105351.Eut.3.53876.mlc, we can retrieve sites under positive selection using the Bayes Empirical Bayes (BEB) method:
Positive sites for foreground lineages Prob(w>1): 36 K 0.971* 159 C 0.993**
Amino acids K and C refer to the first sequence in the alignment.
You can visualise the multiple in Jalview.
- Open Jalview
- Load TF105351.Eut.3.phy
- Then: Calculate→translate cDNA. (tips: by moving the pointer on the amino acids alignment, you can see the corresponding codon in the nucleotide alignment.
Other models can be tested by changing these parameters model and NSsites.
Site model M1 (neutral):
Site model M2a (positive selection):
Then we can compare M1 and M2a by the likelihood ratio test.
Branch model M0:
Branch model M2 with different dN/dS (positive selection on selected branches):
Then we can compare M0 and M2 by the likelihood ratio test.