Does ConSurf-DB Use Too Many Sequences?

In ConSurf-DB, I suppose that no single compromise between too many and too few sequences will produce optimal results for all proteins. Nevertheless, I found the following case informative.

I conclude that perhaps we should caution users: ConSurf-DB is designed to include a wide range of sequences in its multiple-sequence alignments (MSA) and analyses. Often, the MSA will a include substantial number of sequences for proteins with different functions than the query protein. Consequently, amino acids that are colored as highly conserved are truly highly conserved across a wide range of sequence-similar proteins. However, amino acids that are highly conserved in proteins with the same function as the query protein may not appear conserved in ConSurf-DB results. In order to identify these residues, examine the sequences gathered by PSI-BLAST in a ConSurf run, and then set the "Maximum Number of Homologues" equal to the number representing proteins of the same function as the query protein. An example of such an analysis is available.

The example follows.

I was surprised to notice that Q226 in the alpha chain of MHC class I receives a conservation grade of 4 in ConSurf-DB. Q226 is involved in recognizing the CD8 protein, and the binding of CD8 to MHC class I is crucial to MHC class I's function in stimulating T lymphocytes to respond to foreign peptide antigens presented by MHC class I. Therefore I expect Q226 to be highly conserved.

Examining the MSA utilized by ConSurf-DB, I found many proteins other than MHC class I proteins. When I eliminated these, the conservation grade of Q226 went from 4 to 9. Details follow.

Functional Loop 220-230 in MHC Alpha Chain

In MHC Class I (mouse and human) Q226 is part of the binding site for CD8[1]. Regarding the alpha chain (A) in human MHC 1akj, the authors state:

A flexible loop of the alpha3 domain (residues 223-229) is clamped between the complementarity-determining region (CDR)-like loops of the two CD8 subunits in the classic manner of an antibody-antigen interaction ....

Hydrogen bonds between CD8 and the alpha chain of HLA-A2 that involve this loop are (from Table 2[1]):

CD8:D  T30.OG1 : T225.O     HLA-A2:A  2.7 Å
CD8:E  S34.OG  : Q226.NE2             3.0
CD8:D  S100.O  : Q226.NE2             3.0
CD8:D  S100.OG : Q226.O               2.7
CD8:E  Y51.OH  : D227.OD2             3.0
CD8:D  N99.OD1 : L230.N               3.0
CD8:D  N99.ND2 : L230.O               3.4
CD8:D  S27.OG1 : E232.OE1             2.7

Similar interactions occur in the mouse in 1bqh[2]. These two (1akj and 1bqh) are the only two crystal structures of CD8:MHC class I complexes that I found in the PDB. Sequence comparison of the alpha chain of MHC in the CD8-binding regions:

195-198 220-230

RED amino acids have conservation grade 8 or 9 for an MSA containing only MHC Class I sequences (136 sequences, see below).

ConSurf Results for 2VAA:A 220-230

The first 136* sequences found are for MHC class I molecules from diverse species. After that come non-MHC class I sequences, namely multiple sequences each of Hereditary hemochromatosis protein, Zinc-alpha-2-glycoprotein, IgG receptor FcRn large subunit, and MHC class II (which does not bind to CD8).

All jobs shown here utilized default parameters, except for the maximum number of sequences to use.

2VAA chain A
Server Number of
Cons. Grades
Av. Cons. Gr.
APD Job Link
ConSurf-DB 144 8 57776 48775 7.1 1.72 consurfdb
ConSurf all=218 8 76565 45553 5.9 1.17 1237248584
ConSurf 150 8 66356 66365 6.0 0.52 1237414642
ConSurf 139 8 98466 98577 7.7 0.36 1237421568
ConSurf 136* 8 98466 98587 7.8 0.33 1237421568
ConSurf 100 8 99688 99899 9.2 0.20 1237327837
ConSurf 70 7 99?88 98799 8.3 0.22 1237327964

Q226 conservation grades are in boldface.
? = insufficient data.
APD = Average Pairwise Difference in the Multiple Sequence Alignment.


When I specified 136 as the maximum number of sequences to be used (based on the examination of the "PSI-BLAST output" in job 1237421568), I expected the most distant sequence used to be the 136th in the list, namely "sp|P15979|HA1F_CHICK Class I histocompatibility antigen, F10 alp...". However, the next two sequences (Q9GL43, Q9GL42) were included in the list "Unique Sequences Used", which did include exactly 136 sequences. This I do not understand.

Conservation Grade Distribution for 2VAA:A

Here are the distributions of conservation grades for 2VAA:A using different total numbers of sequences in the multiple sequence alignment (from the jobs linked in the above table).

This graph was prepared with Google Spreadsheets.

I was not surprised to see:

  1. The number of residues with conservation level-9 go down dramatically with increased number of sequences and increased APD[3].
  2. A compensating rise in the number of residues with intermediate conservation levels, notably levels 4-6.

I was surprised to see:

  1. The number of sequences and APD[3] have very little effect on the numbers of residues in conservation levels 1-3 (gray zone) and a minor effect on levels 7-8 (light gray zone). The constancy of level 1 was particularly surprising.
  2. Almost no difference between the distributions for 218 sequences in ConSurf vs. 144 in ConSurf-DB, surprising given that the APDs[3] were 1.17 and 1.72 respectively.

Of course some of these observations may not generalize to other protein chains. It would be useful to analyze more cases.


