====== GeMMA ======

** As of 08/2012, this page will not be updated until further notice. See further notes below. **

===== Background =====

[[wp>Protein domains]] are the fundamental units of protein sequence evolution. The majority of known proteins contain multiple domains, the particular combination of which can explain protein function.

The domain superfamilies in [[http://gene3d.biochem.ucl.ac.uk/Gene3D/|Gene3D]] [1] collect [[wp>homologous]] domain sequences from all available genomes, extending the [[:glossary:homologous superfamily|'H' level]] of the [[http://www.cathdb.info|CATH]] domain classification. Some superfamilies are very large and sequence-diverse and the proteins containing these domains can vary greatly in function.

GeMMA [2] was designed to divide homologous protein and domain sequence families, such as those found in Gene3D and [[http://pfam.sanger.ac.uk/|Pfam]] [3], into functionally conserved clusters. These will help us to study the evolution of sequence, structure, and function at the protein domain level in more detail.

===== The clustering algorithm =====

GeMMA performs an all-by-all comparison of a set of sequence clusters and merges the most similar clusters, in an iterative manner. With each iteration, the clusters increase in size, whilst their total number decreases. We currently use [[http://prodata.swmed.edu/compass/compass.php|COMPASS]] [4] for profile generation and comparison and [[http://align.bmr.kyushu-u.ac.jp/mafft/software/|MAFFT]] [5] for [[wp>multiple sequence alignment]] (MSA). 

{{:projects:gemma:gemma_hag_algorithm.png?400|The agglomerative hierarchical clustering process}}

A sequence profile is generated from the MSA of each sequence cluster and compared to all other cluster profiles. Profiles store the observed residue and gap frequencies for each position in a MSA and have been shown to detect remote homology much better than, for example, pairwise sequence comparisons.

{{:projects:gemma:prof_prof_aln.png?400|The comparison of sequence profiles}}

===== Use of domain sequence clusters =====

Each GeMMA cluster is much more conserved than its superfamily as a whole; member domains will be highly similar in (partial protein) function and structure. Each cluster can be represented, stored, and queried as a profile [[wp>hidden Markov model]] (HMM) or a MSA profile (similar to a [[wp>Position-Specific Scoring Matrix]] (PSSM)). This can leverage annotation transfer to unknown sequences, depict family evolution, provide functional categories for comparative genomics studies, and help select targets for [[wp>Structural genomics]] initiatives more economically.

{{:projects:gemma:gemma_annotation.png?400|GeMMA clusters are structurally and functionally conserved}}

===== Threshold optimization =====

** Note that there has been quite some development since the original GeMMA publication, during which the clustering algorithm was retained but the family identification step was entirely revamped. We are moving away from using fixed profile similarity thresholds, towards using available high-quality protein function annotation data, i.e., towards a supervised protocol. **

The clustering stops when no remaining cluster pair matches better than a given threshold E (a stochastic 'expectation' value as used in [[wp>BLAST]]). Expert-curated domain families from the [[http://sfld.rbvi.ucsf.edu/|Structure-Function Linkage Database]] (SFLD) currently serve as a gold standard to derive E. Good performance in this case means high functional conservation of clusters without over-division of the family.

{{:projects:gemma:gemma_threshold_optimization.png?400|The way generic profile-profile cluster similarity thresholds are derived}}

===== HPC implementation =====

For processing large domain families GeMMA was modified to run on the [[http://www.ucl.ac.uk/research-computing/information/services/cluster|UCL Legion]] HPC cluster and can be easily set up to run in any environment based on [[wp>SGE|Sun Grid Engine]] or the [[wp>PBS|Portable Batch System]] and derivatives such as [[wp>TORQUE Resource Manager|Torque]]. However, the iterative nature of the algorithm poses a challenge to both HPC implementation and resources. For most (i.e. the ~2,000 small and medium-sized) superfamilies in Gene3D the current implementation produces clusters within 2-4 weeks.

{{:projects:gemma:gemma_hpc_workflow.png?400|The current HPC workflow of GeMMA}}

** The following was written in 2010; shortly after we were awarded an Amazon EC2 research grant. **

We are currently considering to use resources outside UCL as well, in particular so-called [[wp>cloud computing]]. An early adapter bioinformatics project using this technology (there are only few at this point) could benefit both us and the BI community as a whole. It could also help companies providing on-demand HPC and storage resources such as Amazon's [[wp>Amazon EC2|EC2]] and [[wp>Amazon S3|S3]] services to develop solutions tailored more specifically towards BI applications. 

If such projects are successful it can be envisioned that bioinformatics groups or even whole universities in the future could focus their resources on infrastructure-on-demand services rather than buying new expensive hardware every two years (enormous sums are currently spent on cooling and powering local HPC facilities). 

Different ways of using compute clouds (for different algorithms) should be considered. Options include flow control from scripting languages via API libraries, through middleware as the SGE cloud adapter, and the use of (soon-to-be?) available EC2 [[wp>Amazon Machine Image|AMIs]] for BI, pre-configured with cluster OSs such as [[wp>Rocks Cluster Distribution|Rocks Clusters]]. 

So-called 'cloud bursting', where external resources are recruited into the same workforce with local nodes and (seemingly) form a single cluster could become a viable tradeoff between availability / bandwidth (local resources) and scalability (grids and clouds) especially for bioinformatics applications, and should thus be tested as well.

===== References ======

  - {{pubmed>long:19906693}}
  - {{pubmed>long:19923231}}
  - {{pubmed>long:18039703}}
  - {{pubmed>long:12547212}}
  - {{pubmed>long:18372315}}