The testing plugin is enabled and should be disabled.

This is an old revision of the document!


CATH Data Downloads

This page provides information on the data files that are available to download from the CATH FTP site:

ftp://orengoftp.biochem.ucl.ac.uk/cath

CATH (daily)

We provide a daily snapshot of our very latest classifications and annotations as they happen in our pipeline. This enables users to find the most up-to-date information about their particular structure of interest. The amount of data we provide at this stage is limited mainly to domain boundaries and superfamily classification.

ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/daily-release/newest/

File name Description
cath-b-newest-all.gz List the latest domain boundaries and superfamily (C.A.T.H) annotations for all CATH domains
cath-b-newest-names.gz Provides the names for each node in the CATH hierarchy
cath-b-newest-latest-release.gz List the latest domain boundaries and superfamily annotations for CATH domains in the most recent release of CATH-Plus
cath-b-newest-putative.gz List the latest domain boundaries and superfamily annotations for CATH domains released since the most release release of CATH-Plus
cath-b-s35-newest.gz List the latest domain boundaries and sequence family (C.A.T.H.S) annotations for all non-redundant sequence representatives

CATH-Plus

CATH-Plus adds a significant amount of data on top of the core classification information available in CATH. The CATH-Plus release process includes a number of manual annotation checks (e.g. looking for evidence that would support merging superfamilies, checking for errors, etc) in addition to adding a huge amount of information combining protein structure, sequence and function. As a result, there is a greater depth of information available in CATH-Plus, though it may not contain information on the most recent structures.

For information on the statistics from specific releases, see release notes.

Data related to the CATH classification

ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/cath-classification-data/

File name Description
cath-chain-list-<version>.txt Lists all of the PDB chain IDs in CATH, whether they are chopped into domains or not.
cath-domain-boundaries-*-<version>.txt Description of domain and segment boundaries for domains classified into CATH.
cath-domain-description-file-<version>.txt Description of each protein domain in CATH
cath-domain-list-<S35%|S60|S95|S100|all>-<version>.txt Lists of domains classified into CATH
cath-domain-pdb-*-<version>.txt Description of each domain PDB classified into CATH
cath-names-<version>.txt Name description of each node in the CATH hierarchy, along with an example domain
cath-superfamily-list-<version>.txt List of all the superfamilies in the CATH hierarchy
cath-unclassified-list-<version>.txt List of all unclassified protein chains and domains that are still being processed

Data related to non-redundant data sets

ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/non-redundant-data-sets/

File name Description
cath-dataset-nonredundant-S[20|40]-v4_1_0.atom.fa The ATOM sequences of the domains in the dataset (which only contain residues that have ATOM records in the PDB file)
cath-dataset-nonredundant-S[20|40]-v4_1_0.fa The sequences of the domains in the dataset
cath-dataset-nonredundant-S[20|40]-v4_1_0.list A list of the domains in the dataset; one domain ID per line
cath-dataset-nonredundant-S[20|40]-v4_1_0.pdb.tgz (A gzipped tar file containing) the PDB files of the domains in the data set

Data related to sequence data

ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/sequence-data/

File name Description
cath-domain-seqs-*-<version>.fa Sequences for each CATH domain
cath-S35-<version>-hmm3.lib.gz HMMs for each CATH representative domain from the sequence clusters at 35% sequence identity
funfam-hmm3-<version>.lib.gz HMMs for each functional family (FunFam)
cath-superfamily-seqs-<superfamily>-<version>.fa Sequences for each CATH superfamily in FASTA format
Print/export