Differences

This shows you the differences between two versions of the page.

--- index [2016/08/02 12:24]
sillitoe
+++ index [2024/11/26 14:54] (current)
sillitoe [Expansion in CATH structural data from AlphaFold Database]
@@ Line 3: / Line 3: @@
 The CATH database is a free, publicly available online resource that provides
 information on the evolutionary relationships of protein domains. It was
-created in the mid-1990s by Professor Christine Orengo and colleagues, and
+created in the mid-1990s by [[https://www.ucl.ac.uk/orengo-group/lab-members/christine-orengo|Professor Christine Orengo]] and colleagues, and continues to be developed by the [[cathteam:index|Orengo group]] at University College London.
-continues to be developed by the Orengo group at University College London.
 ===== How is CATH-Gene3D created? =====
@@ Line 12: / Line 11: @@
 applicable. Protein domains are identified within these chains using a mixture
 of automatic methods and manual curation. The domains are then classified within
-the CATH structural hierarchy: at the Class (C) level, domains are assigned
+the CATH structural hierarchy: at the [[glossary:class|Class]] (C) level, domains are assigned
 according to their secondary structure content, i.e. all alpha, all beta, a
-mixture of alpha and beta, or little secondary structure; at the Architecture
+mixture of alpha and beta, or little secondary structure; at the [[glossary:architecture|Architecture]]
 (A) level, information on the secondary structure arrangement in
-three-dimensional space is used for assignment; at the Topology/fold (T) level,
+three-dimensional space is used for assignment; at the [[glossary:topology|Topology/fold]] (T) level,
 information on how the secondary structure elements are connected and arranged
-is used; assignments are made to the Homologous superfamily (H) level if there
+is used; assignments are made to the [[glossary:homologous_superfamily|Homologous superfamily]] (H) level if there is good evidence that the domains are related by evolution, i.e. they are
-is good evidence that the domains are related by evolution, i.e. they are
+homologous. To browse the classification hierarchy, see [[http://cathdb.info/browse/tree|CATH hierarchy]].
-homologous.
 Additional sequence data for domains with no experimentally determined
-structures are provided by our sister resource, Gene3D, which are used to
+structures are provided by our sister resource, [[http://gene3d.biochem.ucl.ac.uk/Gene3D|Gene3D]], which are used to populate the homologous superfamilies. Protein sequences from UniProtKB and
-populate the homologous superfamilies. Protein sequences from UniProtKB and
 Ensembl are scanned against CATH HMMs to predict domain sequence boundaries and
 make homologous superfamily assignments.
+== Recognition as a Global Core BioData Resource ==
+CATH has been recognized as a Global Core BioData Resource (GCBR) by the Global Biodata Consortium. This endorsement reflects the database's significance as a reliable and comprehensive resource for protein structure classification in the life sciences community.
+===== Expansion in CATH structural data from AlphaFold Database =====
+We are pleased to announce the release of CATH v4.4 (October 2024 ; https://www.cathdb.info/), the latest update to the CATH (Class, Architecture, Topology, Homology) structural classification database. This release includes an up-to-date classification of PDB structures as well as over 90 million domain models from The Encyclopedia of Domains (TED) (https://ted.cathdb.info).
+== Integration of domains from The Encyclopedia of Domains (TED) ==
+CATH v4.4 incorporates approximately ~600.000 newly classified domain structures from the Protein Data Bank (PDB) and maps over 90 million predicted domain structures from the Encyclopedia of Domains (TED) resource into CATH superfamilies—a joint effort between the Jones group (UCL Computer Science) and the Orengo group (UCL Structural and Molecular Biology). This integration has resulted in a 180-fold increase in structural information for CATH superfamilies.
+The inclusion of TED data has expanded the number of superfamilies from 5,841 to 6,573, folds from 1,349 to 2,081, and architectures from 41 to 77. It is important to note that the TED data comprises predicted structures, and these new folds and architectures remain hypothetical until experimentally confirmed.
+Advancements in Domain Segmentation and Classification:
+To manage the substantial volume of data from AlphaFold Protein Structure Database, our automated domain segmentation workflow has been enhanced. We have integrated a faster and more accurate in-house deep-learning approach called Chainsaw, along with the publicly available methods Merizo and UniDoc. For homologue detection and verification, we predicted CATH superfamilies using a deep-learning tool based on embeddings from protein language models, CATHe, and expanded our suite of protein structure comparison tools to include Foldseek. Domains from PDB and TED with a homology assignment were further validated using Hidden Markov Model matching with strict overlaps and manual curation. These advancements enable the classification of domains into CATH superfamilies using evidence from multiple independent approaches, including both structural and sequence-based methods.
+Expansion of Functional Families (FunFams):
+Within superfamilies, CATH further subclassifies domains into coherent sets of sequences where functions are conserved, called Functional Families (FunFams). We updated the sequences in FunFams to UniProt release 2024_02, achieving a 276% increase in FunFam coverage. Additionally, the mapping of TED structural domains has resulted in a fourfold increase in FunFams with structural information, increasing the number of FunFams with at least one high-quality structural representative to 73,215.
+This expansion enhances our ability to analyze conserved residues within protein families and to identify putative functional sites, contributing to a deeper understanding of protein function and evolution.
+Identification of Novel Folds and Architectures:
+Analysis of TED data has led to the identification of 479 new folds and 34 new architectures, including structures such as the Alpha-propeller, Beta hairpin barrel, and Alpha-Beta flower. These new categories are currently hypothetical and await experimental confirmation.
+Future Directions:
+The extensive data integrated into CATH v4.4 presents opportunities for further exploration of protein structures and evolutionary relationships. Ongoing efforts will focus on refining algorithms and workflows to improve domain boundary assignments, particularly in complex structures such as repeats and proteins with large interfaces.
 ===== CATH Releases =====
-We aim to provide official releases of the CATH classification every 12 months.
+==== CATH (daily snapshot) ====
-This release process is important because is allows us to provide internal
-validation, extra annotations and analysis. However, it can mean that there is a
+We provide a daily snapshot of the very latest classifications and annotations as they happen in our pipeline. This enables users to find the most up-to-date information about their particular structure of interest. The amount of data we provide at this stage is limited mainly to domain boundaries and superfamily classification.
-time delay between new structures appearing in the PDB and the latest official
-CATH release,
+==== CATH-Plus (full release) ====
+We aim to provide full releases of CATH (CATH-Plus) every 12 months. CATH-Plus adds a significant amount of data on top of the core classification information available in CATH. The CATH-Plus release process includes a number of manual annotation checks in addition to adding a huge amount of information combining protein structure, sequence and function. As a result, there is a greater depth of information available in CATH-Plus, though it may be missing information on the most recent structures.
+CATH-Plus data includes:
+=== FunFams (Functional Families) ===
+The homologous superfamilies in CATH-Gene3D can often be functionally and structurally diverse even though they share a conserved structural core. Therefore, the superfamilies have been sub-classified into functional families (FunFams) using a subclassification protocol purely based on sequence patterns. Relatives within these FunFams are likely to share highly similar structures and functions. The FunFams are useful in function prediction and in providing information on the evolution of function.
+=== Structural clusters ===
+The structures within a homologous superfamily have been clustered at < 9 Å RMSD to form structural clusters, also known as structurally-similar groups (SSGs). These structural clusters are useful for understanding the structural diversity of a superfamily.
+=== Structural superpositions ===
-In order to address this issue: CATH-B provides a limited amount of information
+The conserved structural core in the homologous superfamilies can be observed from the structural superpositions generated from its representative domains by [[cath_tools#cath_tools|CATH Tools]]. It is an effective way of observing the structural conservation and diversity across the superfamily.
-to the very latest domain annotations (e.g. domain boundaries and superfamily
-classifications).
-The latest release of CATH-Gene3D (v4.1) was released in July 2016 and
+See [[release_notes|release notes]] for information on the statistics for specific releases.
-consists of:
-  * 308,999    structural protein domain entries
+CATH and CATH-Plus data for all releases can be downloaded from [[data:index|Data Downloads]].
-  * 53,479,436 non-structural protein domain entries
-  * 2,737       homologous superfamily entries
-  * 92,882      functional family entries
 ===== Open Source Software =====
@@ Line 56: / Line 89: @@
 If you have any comments/suggestions/criticisms, please let us know:
-http://www.cathdb.info/support/contact
+https://www.cathdb.info/support/contact

Trace:

CATH Documentation

Differences

Search

Navigation

Print/export

Toolbox