This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| projects:gemma [2009/11/14 02:22] – cathteam | projects:gemma [2012/08/08 13:57] (current) – cathteam | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== GeMMA ====== | ||
| + | ** As of 08/2012, this page will not be updated until further notice. See further notes below. ** | ||
| + | |||
| + | ===== Background ===== | ||
| + | |||
| + | [[wp> | ||
| + | |||
| + | The domain superfamilies in [[http:// | ||
| + | |||
| + | GeMMA [2] was designed to divide homologous protein and domain sequence families, such as those found in Gene3D and [[http:// | ||
| + | |||
| + | ===== The clustering algorithm ===== | ||
| + | |||
| + | GeMMA performs an all-by-all comparison of a set of sequence clusters and merges the most similar clusters, in an iterative manner. With each iteration, the clusters increase in size, whilst their total number decreases. We currently use [[http:// | ||
| + | |||
| + | {{: | ||
| + | |||
| + | A sequence profile is generated from the MSA of each sequence cluster and compared to all other cluster profiles. Profiles store the observed residue and gap frequencies for each position in a MSA and have been shown to detect remote homology much better than, for example, pairwise sequence comparisons. | ||
| + | |||
| + | {{: | ||
| + | |||
| + | ===== Use of domain sequence clusters ===== | ||
| + | |||
| + | Each GeMMA cluster is much more conserved than its superfamily as a whole; member domains will be highly similar in (partial protein) function and structure. Each cluster can be represented, | ||
| + | |||
| + | {{: | ||
| + | |||
| + | ===== Threshold optimization ===== | ||
| + | |||
| + | ** Note that there has been quite some development since the original GeMMA publication, | ||
| + | |||
| + | The clustering stops when no remaining cluster pair matches better than a given threshold E (a stochastic ' | ||
| + | |||
| + | {{: | ||
| + | |||
| + | ===== HPC implementation ===== | ||
| + | |||
| + | For processing large domain families GeMMA was modified to run on the [[http:// | ||
| + | |||
| + | {{: | ||
| + | |||
| + | ** The following was written in 2010; shortly after we were awarded an Amazon EC2 research grant. ** | ||
| + | |||
| + | We are currently considering to use resources outside UCL as well, in particular so-called [[wp> | ||
| + | |||
| + | If such projects are successful it can be envisioned that bioinformatics groups or even whole universities in the future could focus their resources on infrastructure-on-demand services rather than buying new expensive hardware every two years (enormous sums are currently spent on cooling and powering local HPC facilities). | ||
| + | |||
| + | Different ways of using compute clouds (for different algorithms) should be considered. Options include flow control from scripting languages via API libraries, through middleware as the SGE cloud adapter, and the use of (soon-to-be? | ||
| + | |||
| + | So-called 'cloud bursting', | ||
| + | |||
| + | ===== References ====== | ||
| + | |||
| + | - {{pubmed> | ||
| + | - {{pubmed> | ||
| + | - {{pubmed> | ||
| + | - {{pubmed> | ||
| + | - {{pubmed> | ||