This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| tutorials:eccb_t2_codeml [2012/09/06 16:59] – created romainstuder | tutorials:eccb_t2_codeml [2012/09/09 09:47] (current) – romainstuder | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | CODEML Lysozyme | + | ==== CodeML/PAML ==== |
| + | |||
| + | |||
| + | === Install of PAML === | ||
| + | |||
| + | Download it from here: http:// | ||
| + | |||
| + | Windows (The bin/ folder already contains windows executables): | ||
| + | < | ||
| + | cd paml4.6/ | ||
| + | </ | ||
| + | |||
| + | |||
| + | Linux: | ||
| + | < | ||
| + | cd paml4.5/ | ||
| + | rm bin/*.exe | ||
| + | cd src | ||
| + | make -f Makefile | ||
| + | ls -l | ||
| + | rm *.o | ||
| + | mv baseml basemlg codeml pamp evolver yn00 chi2 ../bin | ||
| + | chmod +x ../bin/* | ||
| + | </ | ||
| + | |||
| + | MacOSX: | ||
| + | < | ||
| + | cd paml4.6/ | ||
| + | rm bin/*.exe | ||
| + | cd src | ||
| + | cc -O2 -o baseml baseml.c tools.c -lm | ||
| + | cc -O2 -o basemlg basemlg.c tools.c -lm | ||
| + | cc -O2 -o codeml codeml.c tools.c -lm | ||
| + | cc -O2 -o pamp pamp.c tools.c -lm | ||
| + | cc -O2 -o mcmctree mcmctree.c tools.c -lm | ||
| + | cc -O2 -o evolver evolver.c tools.c -lm | ||
| + | cc -O2 -o yn00 yn00.c tools.c -lm | ||
| + | cc -O2 -o chi2 chi2.c -lm | ||
| + | ls -l | ||
| + | rm *.o | ||
| + | mv baseml basemlg codeml pamp evolver yn00 chi2 ../bin | ||
| + | chmod +x ../bin/* | ||
| + | </ | ||
| + | |||
| + | |||
| + | === Theoretical principles of the Branch-site model === | ||
| + | |||
| + | The selective pressure in protein coding genes can be detected within the framework of comparative genomics. The selective pressure is assumed to be defined by the ratio (ω) dN/dS. dS represents the synonymous rate (changing | ||
| + | |||
| + | == CodeML and substitutions models: == | ||
| + | |||
| + | CodeML is a program from the package PAML, based on Maximum Likelihood, and developed in the lab of [[http:// | ||
| + | |||
| + | It estimates various parameters (Ts/Tv, dN/dS, branch length) on the codon (nucleotide) alignment, based on a predefined topology (phylogenetic tree). | ||
| + | |||
| + | Different categories of codon models exist in CodeML: | ||
| + | * The model 0 estimates a unique dN/dS ratio for the whole alignment. Not really interesting, | ||
| + | * The site models estimate different dN/dS among sites (ie in the antigen-binding groove of the MHC). | ||
| + | * The branch-site models estimate different dN/dS among sites and among branches. It can detect episodic evolution in protein sequences. IMHO, the most interesting model. | ||
| + | |||
| + | First, we have to define the branch where we think that position could have occurred. We will call this branch the " | ||
| + | |||
| + | To compute the likelihood value, two models are computed: a null model, in which the foreground branch may have different proportions of sites under neutral selection to the background (i.e. relaxed purifying selection), and an alternative model, in which the foreground branch may have a proportion of sites under positive selection. | ||
| + | |||
| + | As the alternative model is the general case, it is easier to present it first. | ||
| + | |||
| + | Four categories of sites are assumed in the branch-site model: | ||
| + | |||
| + | Sites with identical dN/dS in both foreground and background branches: | ||
| + | |||
| + | * K0 : Proportion of sites that are under purifying selection (ω0 < 1) on both foreground and background branches. | ||
| + | * K1 : Proportion of sites that are under neutral evolution (ω1 = 1) on both foreground and background branches. | ||
| + | |||
| + | Sites with different dN/dS between | ||
| + | |||
| + | * K2a: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches. | ||
| + | * K2b: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches. | ||
| + | |||
| + | For each category, we get the proportion of sites and the associated dN/dS values. | ||
| + | |||
| + | In the null model, the dN/dS (ω2) is fixed to 1: | ||
| + | |||
| + | Sites with identical dN/dS in both foreground and background branches: | ||
| + | |||
| + | * K0 : Sites that are under purifying selection (ω0 < 1) on both foreground and background branches. | ||
| + | * K1 : Sites that are under neutral evolution (ω1 = 1) on both foreground and background branches. | ||
| + | |||
| + | Sites with different dN/dS between | ||
| + | |||
| + | * K2a: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches. | ||
| + | * K2b: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches. | ||
| + | |||
| + | For each model, we get the log likelihood value (lnL1 for the alternative and lnL0 for the null models), from which we compute the Likelihood Ratio Test (LRT). | ||
| + | |||
| + | The 2×(lnL1-lnL0) follows a χ² curve with degree of freedom of 1, so we can get a p-value for this LRT. | ||
| + | |||
| + | Let's go in details. | ||
| + | |||
| + | === Identification of positive seleciton in the serine/ | ||
| + | |||
| + | In the evolution vertebrates, | ||
| + | And if yes, which residues were under positive selection. | ||
| + | |||
| + | We need four files to run CodeML (unzip them all): | ||
| + | |||
| + | - The multiple nucleotide (CDS) alignment, in PHYLIP format. CodeML will strictly remove any position that contains at least one gap or an unknown " | ||
| + | - The phylogenetic tree in newick format, with the branch of interest specified by "# | ||
| + | - A command file where all parameters to run CodeML under the alternative model are specified: {{: | ||
| + | - A command file where all parameters to run CodeML under the null model are specified: {{: | ||
| + | |||
| + | The tree looks like: | ||
| + | {{: | ||
| + | |||
| + | == Execute CodeML == | ||
| + | |||
| + | Run command file (alternative model): | ||
| + | |||
| + | We estimate the Ts/Tv ratio (fix_kappa = 0) and the dN/dS (fix_omega = 0). The branch-site model is specified by setting these two parameters: | ||
| + | * model = 2 (different dN/dS for branches) | ||
| + | * NSsites value to 2 (which allows 3 categories for sites: purifying, neutral and positive selection). | ||
| + | |||
| + | < | ||
| + | | ||
| + | treefile = TF105351.Eut.3.53876.tree | ||
| + | | ||
| + | |||
| + | noisy = 9 * 0,1,2,3,9: how much rubbish on the screen | ||
| + | | ||
| + | | ||
| + | * 3: StepwiseAddition; | ||
| + | |||
| + | | ||
| + | | ||
| + | clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree | ||
| + | aaDist = 0 * 0:equal, +: | ||
| + | model = 2 * models for codons: | ||
| + | * 0:one, 1:b, 2:2 or more dN/dS ratios for branches | ||
| + | | ||
| + | * 4:freqs; 5: | ||
| + | icode = 0 * 0:standard genetic code; 1:mammalian mt; 2-10:see below | ||
| + | Mgene = 0 * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all | ||
| + | | ||
| + | kappa = 2 * initial or fixed kappa | ||
| + | | ||
| + | omega = 1 * initial or fixed omega, for codons or codon-based AAs | ||
| + | getSE = 0 * 0: don't want them, 1: want S.E.s of estimates | ||
| + | RateAncestor = 0 * (0,1,2): rates (alpha> | ||
| + | Small_Diff = .45e-6 | ||
| + | | ||
| + | | ||
| + | </ | ||
| + | |||
| + | Run command file (null model): | ||
| + | |||
| + | The command file for the null model is the same as for the alternative model, except for two parameters (in red): | ||
| + | |||
| + | - The name of the output file (outfile) is different. | ||
| + | - The dN/dS ratio is fixed to 1 (fix_omega = 1). | ||
| + | |||
| + | < | ||
| + | | ||
| + | treefile = TF105351.Eut.3.53876.tree | ||
| + | | ||
| + | |||
| + | noisy = 9 * 0,1,2,3,9: how much rubbish on the screen | ||
| + | | ||
| + | | ||
| + | * 3: StepwiseAddition; | ||
| + | |||
| + | | ||
| + | | ||
| + | clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree | ||
| + | aaDist = 0 * 0:equal, +: | ||
| + | model = 2 * models for codons: | ||
| + | * 0:one, 1:b, 2:2 or more dN/dS ratios for branches | ||
| + | | ||
| + | * 4:freqs; 5: | ||
| + | icode = 0 * 0:standard genetic code; 1:mammalian mt; 2-10:see below | ||
| + | Mgene = 0 * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all | ||
| + | | ||
| + | kappa = 2 * initial or fixed kappa | ||
| + | | ||
| + | omega = 1 * initial or fixed omega, for codons or codon-based AAs** | ||
| + | getSE = 0 * 0: don't want them, 1: want S.E.s of estimates | ||
| + | RateAncestor = 0 * (0,1,2): rates (alpha> | ||
| + | Small_Diff = .45e-6 | ||
| + | | ||
| + | | ||
| + | </ | ||
| + | |||
| + | Launch CodeML: | ||
| + | |||
| + | In MacOSX/ | ||
| + | |||
| + | < | ||
| + | codeml ./ | ||
| + | codeml ./ | ||
| + | </ | ||
| + | |||
| + | In Windows: | ||
| + | < | ||
| + | codeml.exe TF105351.Eut.3.53876.ctl | ||
| + | codeml.exe TF105351.Eut.3.53876.fixed.ctl | ||
| + | </ | ||
| + | |||
| + | Two mlc output files are produced (as it can take time, you can download them directly in the next step).: | ||
| + | * {{: | ||
| + | * {{: | ||
| + | |||
| + | |||
| + | === Analyse results === | ||
| + | |||
| + | == Step 1) Assign significance of the detection of positive selection on the selected branch: == | ||
| + | |||
| + | |||
| + | We retieve the likelihood values lnL1 and lnL0 from TF105351.Eut.3.53876.mlc and TF105351.Eut.3.53876.fixed.mlc files, respectively. | ||
| + | |||
| + | We retieve the number of parameters np1 and np0 from TF105351.Eut.3.53876.mlc and TF105351.Eut.3.53876.fixed.mlc files, respectively. | ||
| + | |||
| + | * lnL(ntime: 41 np: 46): **-4707.210163** | ||
| + | * lnL(ntime: 41 np: 45): **-4710.222252** | ||
| + | |||
| + | We can construct the LRT (you can use your favourite spreadsheet for that. Or even better with R): | ||
| + | |||
| + | ΔLRT = 2×(lnL1 - lnL0) = 2×(-4707.210163 - (-4710.222252)) = 6.024178 | ||
| + | |||
| + | The degree of freedom is 1 (np1 - np0 = 46 - 45). | ||
| + | |||
| + | (With OpenOffice: 1-CHISQDIST(6.024178; | ||
| + | |||
| + | p-value = 0.014104 (under χ²) => __significant!__ | ||
| + | |||
| + | A significant result with the branch-site codon model means that positive selection affected a subset of sites during a specific evolutionary time (also called episodic model of protein evolution). | ||
| + | |||
| + | == Step 2) If significant, | ||
| + | |||
| + | In the TF105351.Eut.3.53876.mlc, | ||
| + | < | ||
| + | Positive sites for foreground lineages Prob(w> | ||
| + | 36 K 0.971* | ||
| + | 159 C 0.993** | ||
| + | </ | ||
| + | Amino acids K and C refer to the first sequence in the alignment. | ||
| + | |||
| + | * Position 36 has a high probability (97.1%) of being under positive selection. It shifted from a lysine to a glycine. | ||
| + | * Position 159 has a very high probability (99.3%) of being under positive selection. | ||
| + | |||
| + | You can visualise the multiple in Jalview. | ||
| + | |||
| + | - Open Jalview | ||
| + | |||
| + | - Load TF105351.Eut.3.phy | ||
| + | |||
| + | - Then: Calculate-> | ||
| + | |||
| + | {{: | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | === Using other models === | ||
| + | |||
| + | Other models can be tested by changing these parameters model and NSsites. | ||
| + | |||
| + | == Example 1: == | ||
| + | |||
| + | Site model M1 (neutral): | ||
| + | * model = 0 (dN/dS doesn' | ||
| + | * NSsites = 1 (which allows 2 categories for sites: purifying and neutral). | ||
| + | |||
| + | Site model M2a (positive selection): | ||
| + | * model = 0 (dN/dS doesn' | ||
| + | * NSsites = 2 (which allows 3 categories for sites: purifying, neutral and positive selection). | ||
| + | |||
| + | Then we can compare M1 and M2a by the likelihood ratio test. | ||
| + | |||
| + | |||
| + | == Example 2: == | ||
| + | |||
| + | |||
| + | Branch model M0: | ||
| + | * model = 0 (dN/dS doesn' | ||
| + | * NSsites = 0 (dN/dS doesn' | ||
| + | |||
| + | Branch model M2 with different dN/dS (positive selection on selected branches): | ||
| + | * model = 2 (different dN/dS for branches) | ||
| + | * NSsites = 2 (dN/dS doesn' | ||
| + | |||
| + | Then we can compare M0 and M2 by the likelihood ratio test. | ||