The three-dimensional structures of globins are known, from crystallographic analyses, to be very similar. Their amino acid sequences, however, differ greatly. Only two residues are absolutely conserved in all sequences, and the residue identities of some pairs of sequences are only 16%. We have determined the nature and exact extent of the sequence variations and the extent to which the conserved features of the globin sequences are unique to this family. The 226 globin sequences now known were aligned and analysed. Because distantly related protein sequences cannot be aligned correctly without the use of structural data, we developed a method that incorporated structural information into the alignment procedure. Analysis of the aligned sequences shows that: 1. (1) Although individual chains vary in size between 132 and 157 residues, deletions and insertions result in there being only 102 residue sites common to all globins. These sites form six separate regions. Insertions and deletions between these regions means that their separations can vary in different sequences. 2. (2) Within the conserved regions there are 32 sites that almost always contain hydrophobic residues. In the known structures, these sites are in the protein interior. We measured the variations in the size of the residues that occur in the 226 sequences at these sites. At six sites the residues differ in size by less than 40 Å3, at 11 sites they differ by 40 to 100 Å3, and at 15 sites they differ by more than 100 Å3. There are two other conserved buried sites: one contains the His linked to the haem iron and the other usually contains a His involved with the haem ligand. 3. (3) Within the conserved regions there are another 32 sites that are almost always occupied by charged, polar or small non-polar (Gly or Ala) residues. In the known structures, these sites are on the protein surface. To determine the extent to which the conserved features found for the globin sequences are unique to that protein family, the following procedure was used. The six conserved regions, and the residue restrictions that occur at the 66 sites within these regions, were encoded into two "templates". One was based only on the sequences so far determined; the other was extended to include as yet unobserved substitutions that seemed plausible on the basis of size, hydrophobicity and polarity. Each of the 3286 non-globin sequences in the data bank was then examined by a computer program to see how closely it could be matched to these templates. No non-globin made an exact match to either template. A fairly close match could be found for only a small number of non-globins, nearly all of which are very long and most of which are polyproteins. Thus, the features of the globin sequences that are conserved and define its fold and function are essentially unique to that family.
All Science Journal Classification (ASJC) codes
- Structural Biology
- Molecular Biology