

research papers
Protein Three-Dimensional Structural Databases: Domains, Structurally Aligned Homologues and Superfamilies
aDepartment of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge
CB2 1QW, England, bKunming Institute of Zoology, The Chinese Academy of Sciences, Eastern Jiaochang Road,
Kunming, Yunnan 650223, People's Republic of China, and cPfizer Central Research, Sandwich, Kent CT13 9NJ, England
*Correspondence e-mail: [email protected]
This paper reports the availability of a database of protein structural domains (DDBASE),
an alignment database of homologous proteins (HOMSTRAD) and a database of structurally
aligned superfamilies (CAMPASS) on the World Wide Web (WWW). DDBASE contains information
on the organization of structural domains and their boundaries; it includes only one
representative domain from each of the homologous families. This database has been
derived by identifying the presence of structural domains in proteins on the basis
of inter-secondary structural distances using the program DIAL [Sowdhamini & Blundell (1995), Protein Sci. 4, 506–520]. The alignment of proteins in superfamilies has been performed on the basis
of the structural features and relationships of individual residues using the program
COMPARER [Sali & Blundell (1990
), J. Mol. Biol. 212, 403–428]. The alignment databases contain information on the conserved structural
features in homologous proteins and those belonging to superfamilies. Available data
include the sequence alignments in structure-annotated formats and the provision for
viewing superposed structures of proteins using a graphical interface. Such information,
which is freely accessible on the WWW, should be of value to crystallographers in
the comparison of newly determined protein structures with previously identified protein
domains or existing families.
1. Introduction
The Brookhaven Protein Data Bank (PDB) (Bernstein et al., 1977) currently contains over 7000 entries; after removing the repeated entries of identical
proteins (such as the same protein in different complexes or at different resolutions),
there remain 1729 proteins (Brenner et al., 1997
), including many homologues (see Fig. 1
). If only representative structures from the homologous protein `family' are retained
such that no two proteins have more than 25% sequence identity (Hobohm et al., 1992
; May 1997 release), the resultant data set still includes 687 proteins. This corresponds
to 463 superfamilies of protein domains with 96 superfamilies arising from more than
one family (Brenner et al., 1997
).
![]() |
Figure 1 A cartoon representation of the classification and alignment of proteins at various structural hierarchies. HOMSTRAD database contains alignments of homologous sequences. Some of them exist as multi-domain proteins (denoted by different coloured spheres). DDBASE is a compilation of structural domains found in representatives of homologous proteins. CAMPASS is a database of aligned protein domains belonging to superfamilies. |
Proteins that have diverged but retain high sequence identity fold into similar three-dimensional
structures and usually perform similar functions – these clearly belong to a homologous
family (Richardson, 1981; Rossmann & Argos, 1977
; Chothia, 1984
; Overington et al., 1990
, 1993
). Proteins or domains of proteins that adopt the same three-dimensional fold despite
poor sequence identity and perform remotely similar functions (Blundell & Humbel,
1980
; Murzin & Chothia, 1992
; Murzin et al., 1995
; Murzin, 1996
) are termed superfamilies. The identification of new members belonging to pre-existing
families and superfamilies is straightforward only when contiguous residues forming
a functional motif are conserved, where PROSITE searches may be appropriate (Bairoch, 1991
). Furthermore these should be distinguished from proteins with no sequence identity
and no similarity of functions that nevertheless have the same fold or superfolds
(Orengo et al., 1994
).
An analysis of protein sequence and structure entries indicates that about 50% of
the `new' sequences could be attributed a previously known function and roughly 20%
of the sequences have homologues of known structure (Bork et al., 1992, 1994
; Koonin et al., 1994
). When the of a `new' protein is determined, it is important to compare its structure with the
previously determined structures. This is facilitated by the existence of databases
of aligned protein structures and sequences (Overington et al., 1990
, 1993
; Johnson et al., 1993
).
Often homology or structural similarity exists between parts of two different proteins;
one or two domains only may be conserved (Wetlaufer, 1973; Richardson, 1981
; Wodak & Janin, 1981
; Go, 1981
). Although algorithms to identify such compact sub-structures have been developed
(Schulz, 1977
; Crippen, 1978
; Rose, 1979
; Zehfus & Rose, 1986
), it is convenient to use automatic methods so that the information of domain organization
can be compiled for the large number of protein structures now available (Islam et al., 1995
; Siddiqui & Barton, 1995
; Swindells, 1995
; Nichols et al., 1995
). We have constructed a database of protein structural domains (DDBASE) (Sowdhamini
et al., 1996
) using the procedure DIAL (Sowdhamini & Blundell, 1995
).
Structure-based alignment of sequences of related protein domains provides a basis
for understanding evolutionary relationships as well as diversity in function and
specificity. Such alignments can be used to derive information on amino-acid replacements
which are of value also in comparative modelling and fold recognition (Overington
et al., 1990). Databases of structural alignments of homologous proteins (HOMSTRAD: HOMologous
STRucture Alignment Database) (Overington et al., 1990
, 1993
; Mizuguchi et al., 1998
) and protein superfamilies (CAMPASS: CAMbridge database of Protein Alignments organized
as Structural Superfamilies) (RS, Sowdhamini et al., 1998
) will be described in this paper. Because of the low percentage of sequence identities
amongst distantly related proteins, it is difficult, on the basis of sequence alone,
to obtain reliable alignments where secondary structures and functionally important
residues are aligned correctly. Alignment of proteins in superfamilies, therefore,
is based on the conservation of structural features and relationships using the program
COMPARER (Sali & Blundell, 1990
; Zhu et al., 1992
). The three databases, described here, are available on the WWW (http://www-cryst.bioc.cam.ac.uk/~ddbase for DDBASE, http://www-cryst.bioc.cam.ac.uk/~homstrad for HOMSTRAD and http://www-cryst.bioc.cam.ac.uk/~campass for CAMPASS).
2. DDBASE
2.1. Description and availability
DDBASE is a compilation of the information on structural domains that are present
in a representative set of 436 protein chains (Sowdhamini et al., 1996). The identification of structural domains in a protein chain was performed using
the program DIAL (Sowdhamini & Blundell, 1995
), where elements of secondary structure are clustered on the basis of the proximity
to each other. This gave rise to 695 structural domains, of which 206 are α-rich, 191 are β-rich and 294 fall under the α-and-β class. 63% of the domains are from multi-domain proteins and 73% of the identified
domains have less than 150 residues.
The organization of structural domains in individual protein chains is described on
the WWW page assigned to that protein chain; an example is shown in Fig. 2. Secondary-structural dendrograms are provided that correspond to the clustering
based on distances between all possible pairs of secondary structures. All possible
combinations of nodes in the secondary-structural dendrogram are automatically examined
for compactness of putative domains corresponding to clusters and listed with their
disjoint-factor values (see Sowdhamini & Blundell, 1995
, for details). It is possible for the user to extract the domain boundary corresponding
to any situation by clicking on that entry. However, the `best' domain boundaries,
defined by the program, have been identified and the domain organization may be viewed
on graphics using RasMol (Sayle & Milner-White, 1995
). Each domain can be identified by its unique six-character code (the first four
characters correspond to the PDB code of the protein, the fifth to the chain identifier
and the sixth, as a subscript, corresponds to the domain numbering as in the individual
domain pages).
![]() |
Figure 2 Domain database (DDBASE) WWW page for the B chain of abrin (PDB code, 1abr) as an example. Domains have been identified using the program DIAL (Sowdhamini & Blundell, 1995 ![]() ![]() ![]() |
2.2. Application
DDBASE can be used to trace similarities where particular domains are shared between
proteins. It is especially useful where there are discontinuous domains. 400 large
(with seven or more secondary structures) domains can be grouped into 30 classes on
the basis of the structural similarity estimated from structural environments of individual
secondary structures (Rufino & Blundell, 1994; Sowdhamini et al., 1996
). The clustering of individual protein domains into structurally similar classes
can also be examined on the DDBASE WWW page.
3. HOMSTRAD and CAMPASS
3.1. Description and availability
HOMSTRAD and CAMPASS are databases of structure-based alignments of protein sequences,
grouped into homologous families and superfamilies, respectively. Aligned sequences
of families of homologous protein structures are available in HOMSTRAD (Overington
et al., 1990, 1993
) and categorized according to the secondary-structural classes. There are 130 homologous
protein families with at least two members in the March 1998 version. The sequences
of homologous proteins within a family are initially aligned using the rigid-body
superposition program MNYFIT (Sutcliffe et al., 1987
) or COMPARER (Sali & Blundell, 1990
; Zhu et al., 1992
) and later subjected to a careful manual examination. Similar types of information
are available for CAMPASS, the database of protein (domain)s belonging to superfamilies
(RS, Sowdhamini et al., 1998
). Superfamilies of structural domains were selected initially on the basis of structural
environ-ment at secondary structural units (Rufino & Blundell, 1994
; Sowdhamini et al., 1996
). The selection of superfamilies has been extended by referring to SCOP (Murzin et al., 1995
) and by including smaller domains like the cystine-knots, not considered earlier
in the clustering analysis since they were not easy to compare using automatic structure-based
procedures. 367 of 451 superfamilies annotated in SCOP have single families (Brenner
et al., 1997
; the more recent February 1998 release of SCOP has 419 of the 571 superfamilies with
single families). Superfamily members were chosen such that no two domains within
a superfamily share more than 25% sequence identity (alignments of closely related
proteins are available in HOMSTRAD). This cut-off is consistent with the DDBASE definition
in choosing representative protein chains. A rigorous sequence-alignment program,
COMPARER (Sali & Blundell, 1990
; Zhu et al., 1992
), was used to align the members of a superfamily on the basis of structural features
and relationships, which are equivalenced using simulated annealing. Table 1
lists protein superfamilies, with at least two members within the above-defined cut-off
of sequence identity, whose alignments have been compiled in the March 1998 version.
This includes 67 multi-member superfamilies which involves 293 domains representing
464 homologous proteins. There are a further 357 superfamilies, annotated in SCOP,
which have single members (Murzin et al., 1995
; Brenner et al., 1997
). A few other multi-member superfamilies included in SCOP, such as the DNA-binding
HMG box, pheromones, annexins and insulin-superfamily, were excluded from CAMPASS
as members exhibited more than 25% sequence identity.
‡This family is yet to be added in the homologous alignment database. |
3.2. Availability
The WWW site of HOMSTRAD (Mizuguchi et al., 1998) provides a page for each of the families. The name of the protein, source, resolution
and R factor are given for each family member corresponding to a PDB entry. The alignment
of sequences is formatted in JOY (Overington et al., 1990
) which highlights the conservation of local-residue structural features such as secondary
structure, solvent accessibility and hydrogen bonding. Fig. 3
shows the alignment of cytochrome c from different sources and its homologues (cytochrome c2 and cytochrome c550), as an example.
![]() |
Figure 3 HOMSTRAD database. Structure-based alignment of proteins in the family of cytochrome c. The first four characters of the code of the protein corresponds to the PDB code. Numbers in brackets correspond to residue numbers and residues are shown in single letter code. The alignment has been formatted using JOY (Overington et al., 1990 ![]() ![]() |
CAMPASS, on the WWW, provides information on the superfamilies: for each superfamily
member, the name, source, resolution and domain boundaries are given. The beginning
and end residue numbers for each segment of discontinuous domains are recorded. The
pairwise percentage identity matrix of the members is provided. The structure-based
alignment in the JOY-annotated form (Overington et al., 1990), similar to that described in HOMSTRAD, is shown and also available for extraction
in the form of PostScript files, or as LATEX or HTML files or as a plain text file.
Fig. 4
shows the alignment of the cytochrome superfamily as an example. A single representative
(1ycc) of the nine cytochrome homologues (see above and Fig. 3
) has been aligned with rather distantly related such as cytochrome c6 and c551. The structures of the proteins within a family/superfamily have been superposed
using MNYFIT (Sutcliffe et al., 1987
), where the equivalent residues correspond to the final alignment. These superposed
structures can be viewed on the WWW using the RASMOL graphics interface (Sayle & Milner-White, 1995
).
![]() |
Figure 4 CAMPASS database. Structure-based alignment of the cytochrome superfamily including distantly related proteins such as c550. Helix 2 of 1ycc, conserved within the homologues (see Fig. 3 ![]() |
Fig. 5 shows the distribution of pairwise percentage identities in the two alignment databases.
Protein pairs in HOMSTRAD have a broad range of pairwise sequence identities with
a slightly (237 pairs have sequence identities between 25 and 30% and 121 pairs have sequence
identities between 60 and 65% out of a total of 1962 pairs). However, the majority
of homologous proteins in the database have sequence identities between 15 and 65%.
The distribution of pairwise sequence identity of members within superfamilies (CAMPASS)
is restricted to a maximum of 25%. A vast majority of protein pairs (449 out of 665)
have pairwise percentage identities between 5 and 15%.
![]() |
Figure 5 Distribution of pairwise percentage sequence identities amongst members in the homologue alignment database (HOMSTRAD) and superfamily alignment database (CAMPASS). |
4. Conclusions
HOMSTRAD and CAMPASS are distinct from but complementary to other databases. SCOP
(Murzin et al., 1995) has classified the entire Protein Data Bank at different levels of structural hierarchy
and structural domains are defined. There is emphasis on functionality in the clustering
of folds. SCOP does not attempt to perform or present sequence or structural alignments.
CATH (Orengo et al., 1993
, 1994
) was originally designed and developed for whole proteins where the authors had taken
particular caution to exclude multi-domain proteins. Subsequently, the structures
have been systematically classified at the level of domains (Orengo et al., 1997
). CATH does not include structure-based alignments of sequences. FSSP (Holm & Sander, 1994
) is most similar to HOMSTRAD and CAMPASS due to the fact that FSSP also provides structure-based sequence alignments, even incorporating remote homologues.
However, the alignments do not distinguish homologues and superfamilies from those
which only share a similar fold. The databases described in this paper contain structure-based
alignments that have been specially annotated to describe the structural environment
at residue positions. This should provide extra information useful in the comparison
of protein structures.
References
Bairoch, A. (1991). Nucleic Acids Res. 19, 2013–2018.
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D.,
Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CSD CrossRef CAS PubMed Web of Science
Blundell, T. L. & Humbel, R. E. (1980). Nature (London), 287, 781–787. CrossRef CAS PubMed Web of Science
Bork, P., Ouzounis, C. & Sander, C. (1994). Curr. Opin. Struct. Biol. 4, 393–403. CrossRef Web of Science
Bork, P., Ouzounis, C., Sander, C., Scharf, M., Schneider, R. & Sonnhammer, E. (1992).
Nature (London) 358, 287–287. CrossRef PubMed CAS Web of Science
Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1997). Curr. Opin. Struct. Biol. 7, 369–376. CrossRef CAS PubMed Web of Science
Chothia, C. (1984). Ann. Rev. Biochem. 53, 537–572. CrossRef CAS PubMed Web of Science
Crippen, G. M. (1978). J. Mol. Biol. 126, 315–332. CrossRef CAS PubMed Web of Science
Go, M. (1981). Nature (London), 291, 90–92. CAS PubMed Web of Science
Hobohm, U., Scharf, M., Schneider, R. & Sander, C. (1992). Protein Sci. 1, 409–417. CrossRef PubMed CAS
Holm, L. & Sander, C. (1994). Nucleic Acids Res. 22, 3600–3609. CAS PubMed Web of Science
Hubbard, T. J. P. & Blundell, T. L. (1987). Protein Eng. 1, 159–171. CrossRef CAS PubMed Web of Science
Islam, S. A., Luo, J. & Sternberg, J. E. (1995). Protein Eng. 8, 513–525. CrossRef CAS PubMed Web of Science
Johnson, M. S., Overington, J. P. & Blundell, T. L. (1993). J. Mol. Biol. 231, 735–752. CrossRef CAS PubMed Web of Science
Koonin, E. V., Bork, P. & Sander, C. (1994). EMBO J. 13, 493–503. CAS PubMed Web of Science
Mizuguchi, K., Deane, C., Overington, J. P. & Blundell, T. L. (1998). Protein Sci. In the press.
Murzin, A. G. (1996). Curr. Opin. Struct. Biol. 6, 386–394. CrossRef CAS PubMed Web of Science
Murzin, A. G. & Chothia, C. (1992). Curr. Opin. Struct. Biol. 2, 895–903. CrossRef CAS
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). J. Mol. Biol. 247, 536–540. CrossRef CAS PubMed Web of Science
Nichols, W. L., Rose, G. D., Eyck, L. F. T & Zimm, B. H. (1995). Proteins, 23, 38–48. CrossRef CAS PubMed Web of Science
Orengo, C. A., Flores, T. P., Taylor, W. R. & Thornton, J. M. (1993). Protein Eng. 6, 485–500. CrossRef CAS PubMed Web of Science
Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994). Nature (London), 372, 631–634. CrossRef CAS PubMed Web of Science
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton,
J. M. (1997). Structure, 5, 1093–1108. CrossRef CAS PubMed Web of Science
Overington, J. P., Johnson, M. S., Sali, A. & Blundell, T. L. (1990). Proc. R. Soc. London Ser. B, 241, 132–145. CrossRef CAS Web of Science
Overington, J. P., Zhu, Z.-Y., Sali, A., Johnson, M. S., Sowdhamini, R., Louie, G.
V. & Blundell, T. L. (1993). Biochem. Soc. Trans. 21, 597–604. CrossRef CAS PubMed Web of Science
Richardson, J. S. (1981). Adv. Protein Chem. 34, 167–339. CrossRef CAS PubMed
Rose, G. D. (1979). J. Mol. Biol. 134, 447–470. CrossRef CAS PubMed Web of Science
Rossmann, M. G. & Argos, P. (1977). J. Mol. Biol. 109, 99–129. CrossRef CAS PubMed Web of Science
Rufino, S. D. & Blundell, T. L. (1994). Comput. Aided Mol. Design, 8, 5–27. CrossRef CAS Web of Science
Sali, A. & Blundell, T. L. (1990). J. Mol. Biol. 212, 403–428. CrossRef CAS PubMed
Sayle, R. A. & Milner-White, E.J. (1995). Trends Biochem. Sci. 20, 374–376. CrossRef CAS PubMed Web of Science
Schulz, G. E. (1977). Angew. Chem. Intl Ed. 16, 23–33. CrossRef CAS Web of Science
Siddiqui, A. S. & Barton, G. J. (1995). Protein Sci. 4, 872–884. CrossRef CAS PubMed Web of Science
Sowdhamini, R. & Blundell, T. L. (1995). Protein Sci. 4, 506–520. CrossRef CAS PubMed Web of Science
Sowdhamini, R., Burke, D. F., Huang, J.-F., Mizuguchi, K., Nagarajaram, H. J., Srinivasan,
N., Steward, R. E. & Blundell, T. L. (1998). Structure. In the press.
Sowdhamini, R., Rufino, S. D. & Blundell, T. L. (1996). Folding Design, 1, 209–220. CrossRef CAS PubMed Web of Science
Sutcliffe, M. J., Haneef, I., Carney, D. & Blundell, T. L. (1987). Protein Eng. 1, 377–384. CrossRef CAS PubMed Web of Science
Swindells, M. B. (1995). Protein Sci. 4, 103–112. CrossRef CAS PubMed Web of Science
Wetlaufer, D. B. (1973). Proc. Natl Acad. Sci. USA, 70, 697–701. CrossRef CAS PubMed Web of Science
Wodak, S. J. & Janin, J. (1981). Biochemistry, 20, 6544–6553. CrossRef CAS PubMed Web of Science
Zehfus, M. H. & Rose, G. D. (1986). Biochemistry, 25, 5759–5765. CrossRef CAS PubMed Web of Science
Zhu, Z.-Y., Sali, A. & Blundell, T. L. (1992). Protein Eng. 5, 43–51. CrossRef PubMed CAS Web of Science
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.