====== FAQ: Questions about the definition of CATH code ======
The following is a reply from a recent email that may provide a useful explanation to others regarding the CATH code numbering with sequence clusters (SOLID).
Received on Aug 21, 2008 (reproduced with permission)
Hi, CATH team:
Upon reading the original paper published in 1997 and visiting your website, I am
still confused about the definition of CATH code in sequence family levels. Take
the CATH 3.1 reflected by these three below proteins as a example, my questions of
them were as following:
1) the code in CATHSOLI level of 2a8vA01 and 1a8vA01 are all the same, but why
their codes in D level were different?
Does 2a8vA01 and 1a8vA01 not belong to the same s100 family?
2) the code in CATHSO ID level of 1a8vA01 and 1a62001 are all the same, but why
their codes in L level were different?
If 1a8vA01 and 1a62001 are 100% sequence identical, why they were assigned to
different 95% sequence group?
Sincerely.
backy
35% 60% 95% 100%
C A T H S O L I D
2a8vA01 1 10 720 10 2 1 2 1 3 47 2.400
1a8vA01 1 10 720 10 2 1 2 1 1 49 2.000
1a62001 1 10 720 10 2 1 1 1 1 44 1.550
The reply on Aug 21, 2008
Hi Backy,
Thanks for getting in touch with us, hopefully I can answer your questions below:
1) the code in CATHSOLI level of 2a8vA01 and 1a8vA01 are all the same, but why
their codes in D level were different?
The D level stands for "Domain Count" and is just there to provide a unique code
for every domain - so if two domains are identical (i.e. they share everything up
to the I, or 100% Identical, code) then we use the D level to differentiate
between them - this is just a sequential counter.
Does 2a8vA01 and 1a8vA01 not belong to the same s100 family?
Yes, they do - they share up to the I count so they are 100% identical - as mentioned
above - the domain level is just a counter to differentiate between domains in the
same I cluster.
2) the code in CATHSO ID level of 1a8vA01 and 1a62001 are all the same, but
why their codes in L level were different?
You need to bear in mind that CATH is a tree-like hierarchy with the trunk of the
tree represented on the left of the CATHSOLID classification (e.g. the C code) and
the leaves of the tree on the right (e.g. the D code). In the example you give above
- you have to read the CATH codes from left to right and stop the first time one of
the codes differs. In this case, they differ at the 'L' code so they are in different
S95% clusters. It doesn't matter that the numbers after this (I, D) are the same as
they are talking about different branches of the tree.
If 1a8vA01 and 1a62001 are 100% sequence identical, why they were assigned
to different 95% sequence group?
The simple answer is that they aren't 100% identical - they have a seq id of 94.7%
so they are in different L codes. As mentioned above, the I and D happen to be the
same, but that doesn't mean anything if the L code is different (CATHSOLID needs to
be read from left to right).
So for the following three domains:
2a8vA01 1.10.720.10.2.1.2.1.3
1a8vA01 1.10.720.10.2.1.2.1.1
1a62001 1.10.720.10.2.1.1.1.1
The tree/hierarchy would look something like:
C 1
A 10
T 720
H 10
S 2
O 1
L 1 2
I 1 1
D 1 1 3
1a62001 1a8vA01 2a8vA01
This seems like a good question/answer to add to our FAQ section of the website - would you mind?
Best wishes,
Ian Sillitoe
CATH Team