Stephen
L. Mayo *
The first fully automated design
and experimental validation of a novel sequence for an entire protein is
described.
A computational design algorithm
based on physical chemical potential functions and stereochemical
constraints was used to screen a combinatorial library of 1.9x1027
possible amino acid sequences for compatibility with the design
target, a
protein motif based on the polypeptide
backbone structure of a zinc finger domain. A BLAST search shows
that the designed sequence,
full sequence design 1 (FSD-1), has
very low identity to any
known protein sequence. The solution structure of FSD-1 was
solved by nuclear magnetic resonance spectroscopy and indicates
that FSD-1 forms a compact well-ordered
structure, which is in excellent agreement with the design
target structure. This result demonstrates
that computational methods can perform the immense combinatorial
search required for protein design,and
it suggests that an unbiased and quantitative algorithm canbe used in various
structural contexts.
B I. Dahiyat, Division of Chemistry and Chemical Engineering, California
Institute of Technology, mail code
147-75, Pasadena, CA 91125, USA.
* To whom correspondence
should be addressed. E-mail:
steve@mayo.caltech.edu
Present address: Xencor, Pasadena,
CA 91106, USA.
We have sought to expand the range of computational protein design to residues of all parts of a protein: the buried core, the solvent-exposed surface, and the boundary between core and surface (4-6). Our goal is an unbiased, quantitative design algorithm that is based on the physical properties that determine protein structure and stability and that is not limited to specific folds or motifs. Such a method should escape the lack of generalityof design approaches based on system-specific heuristics or subjective considerations or both. We have developed our algorithm by combining theory, computation, and experiment in a cycle that has improved our understanding of the physical chemistry governing protein design (4). We now report the successful design by the algorithm of an original sequence for an entire protein and the experimental validation of the protein's structure.
Sequence selection. Our design methodology begins with a backbone fold and we attempt to select an amino acid sequence that will stabilize this target structure. The method consists of an automated side-chain selection algorithm that explicitly and quantitatively considers specific interactions between (i) side chain and backbone and (ii) side chain and side chain (4).The side chain selection algorithm screens all possible amino acid sequences and finds the optimal sequence and side-chain orientationsfor a given backbone. In order to correctly account for the torsional flexibility of side chains and the geometric specificity of side-chainplacement, we consider a discrete set of all allowed conformers of each side chain, called rotamers (7). The sizable search problem presented by rotamer sequence optimization is overcome by application of the dead-end elimination (DEE) theorem (8). Our implementation of the DEE theorem extends its utility to sequence design and rapidly finds the globally optimal sequence in its optimal conformation (4).
Previously we determined the different
contributions of core, surface, and boundary residues to the scoring of
a sequence arrangement. The sequence predictions of a scoring
function, or a combination of scoring functions, were experimentally
tested in order to assess
the accuracy of the algorithm and to deriveimprovements
to it. We successfully redesigned the
core of a coiled coil and of the streptococcal protein G 1
(G
1) domain usinga van der
Waals potential to account for steric constraints and an atomic
solvation potential favoring the burial and penalizing the exposure
of nonpolar surface area (4, 6). Effectivesolvation
parameters and the appropriate balance between packingand solvation terms
were found by systematic analysis of experimentaldata and feedback into
the simulation. Solvent-exposed residueson the surface of a protein were
designed
with the use of a hydrogen-bond potential and secondary structure
propensities in addition to a van der
Waals potential. Coiled coils designed
with such a scoring function were 10° to 12°C more thermally
stable than the naturally occurring analog (5).
Residues that form the boundary between the core and surface
require a combination of the core and the surface scoring functions.
The algorithm considers both hydrophobicand
hydrophilic amino acids at boundary positions, whereas corepositions are
restricted to hydrophobic amino acids and surfacepositions are restricted
to hydrophilic amino acids.
In order to assess the capability
of our design algorithm, we have computed
the entire amino acid sequence for a small protein motif. We
sought a protein fold that would be small enough to be both
computationally and experimentally tractable, yet large enough
to form an independently
folded structure in the absenceof disulfide
bonds or metal binding. We chose the
motif typified by the zinc finger DNA binding module (9).
Although this motif consists of fewer than 30 residues, it does
contain sheet, helix, and turn structures. The ability of this
fold to form in the absence of metal ions or disulfide
bonds has been demonstrated by Imperialiand
co-workers, who designed a 23-residue
peptide, containingan unusual amino
acid (D-proline) and a nonnatural amino acid
[3-(1,10-phenanthrol-2-yl)-L-alanine], which
achieved this fold (10); our initial characterization
of a partially computed sequence indicated that it also forms
this fold (11). In computing thefull sequence for this
target fold, we use the scoring functionsfrom our previous work without
modification (12). The
motif was not used in any of our prior work to develop
the design methodology and
therefore provides a test of the algorithm's
generality.
The sequence selection algorithm requires structure coordinates that
define
the target motif's backbone (N, C, C,
and O atomsand C
-C
vectors). The Brookhaven Protein Data Bank (PDB) (13)was
examined for high-resolution structures of the
motif, and the second zinc finger module of the DNA binding
protein Zif268 was selected as our design
template (9,
14). In order
to assign the residue positions in the template structure into
core, surface, or boundary classes, the orientation of the C
-C
vectors was assessed relative to a solvent-accessible surface
computed with only the template C
atoms (15). The small size of this motif
limits to one (position 5) the number of residues that can be
assigned unambiguously to the core, whereas seven residues (positions
3, 7, 12, 18, 21, 22, and 25) were classified as boundary and
the remaining 20 residues were assigned to the surface. Whereas three
of the zinc binding positions of Zif268 are in the boundary or
core, one residue, position 8, has a C
-C
vector directed away from the geometric center of the protein
and is classified as a surface position. As in our previous
studies, the amino acids considered
at the core positions during sequence selection were Ala, Val,
Leu, Ile, Phe, Tyr, and Trp; the amino acids consideredat
the surface positions were Ala, Ser, Thr, His, Asp, Asn, Glu,Gln, Lys,
and Arg; and the combined core and surface amino acidsets (16 amino acids)
were considered at the boundary positions.
Two of the residue positions (9 and 27) have
angles greaterthan 0°and are set to Gly by the sequence selection algorithmto
minimize backbone strain.
The total number of amino acid sequences that must be considered
by the design algorithm is the product
of the number of possible amino acids at each residue position.
The
motif residue classification described
above results in a virtual combinatorial library of 1.9x1027
possible amino acid sequences (16). This library size
is 15 orders of magnitude
larger than that accessible by experimental random library approaches.
A corresponding peptide library consistingof
only a single molecule for each 28-residue sequence would havea mass of
11.6 metric tons (17). In order
to accurately model the
geometric specificity of side-chain
placement, we explicitly consider
the torsional flexibility of amino acid side
chains in our sequence scoring by representing each amino acid
with a discrete set of allowed conformations, called rotamers
(18). As a result, the
design
algorithm must consider all rotamers
for each possible amino acid at each residue position. The total
size of the search space for the
motif is therefore 1.1 × 1062 possible rotamer sequences.
We use a search algorithm based on an extension of the DEE
theorem to solve the rotamer sequence optimization problem (4,
8).
Efficient implementation of the DEE
theorem has made complete protein sequence
design
tractable for about 50 residues on current parallel computers
in a single calculation. The rotamer optimization problem for
the
motif required 90 CPU hours to find the optimal sequence (19,
20).
The optimal sequence (Fig. 1) is called full sequence
design
(FSD-1). Even though all of the hydrophilic amino acids were considered
at each of the boundary positions, the algorithm selected only
nonpolar amino acids. The eight core and boundary positions are
predicted to form a well-packed buried cluster. The Phe sidechains
selected by the algorithm at positions 21 and 25, the zinc-bindingHis positions
of Zif268, are more than 80 percent buried, andthe Ala at position 5 is
100 percent buried but the Lys at position8 is more than 60 percent exposed
to solvent (Fig.
2). The other boundary positions
demonstrate
the steric constraints on buried residues by packing similar
side chains in an arrangement similarto
that of Zif268 (Fig.
2). The calculated optimal configuration
for core and boundary residues buries ~1150 Å2 of
nonpolar surface area. On the helix surface, the algorithm places
Asn14 with a hydrogen bond between its side-chain
carbonyl oxygen and the backbone amide
proton of residue 16. The eight charged residues on the helix
form three pairs of hydrogen bonds, although in our coiled-coil
designs,
helical surface hydrogen bonds appeared to be less important
than the overall helix propensity of the sequence (5).
Positions 4 and 11 on the exposed sheet surface were selected by
the program to be Thr, one of the best -sheet
forming residues (21).
Alignment of the sequences for FSD-1 and Zif268 (Fig. 1)
indicates that only 6 of the 28 residues (21 percent) are identicaland
only 11 (39 percent) are similar. Four of the identities
are in the buried cluster, which is consistent with the expectationthat
buried residues are more conserved than solvent-exposed residuesfor a given
motif (22). A BLAST (23) search of
the FSD-1 sequence against the nonredundant protein sequence
database of the National Center for Biotechnology Information
did not reveal any zinc finger protein sequences. Further, the
BLAST search found only low identitymatches
of weak statistical significance to fragments of variousunrelated proteins.
The highest identity matches were 10
residues (36 percent) with P values ranging from 0.63
to 1.0, where P is the probability of a match being a
chance occurrence. Random 28-residue sequences that consist
of amino acids allowed in the
position classification described
above produced similar BLAST search results, with 10- or 11-residue
identities (36 to 39 percent) and P
values ranging from 0.35 to 1.0, further suggesting that the
matches for FSD-1 are statistically insignificant. The low identity
with any known protein sequence
demonstrates
the novelty of the FSD-1 sequence and underscores
that no sequence information from any protein motif was used
in our sequence scoring function.
In order to examine the robustness of the computed sequence, we used the sequence of FSD-1 as the starting point of a Monte Carlo simulated annealing run. The Monte Carlo search revealed high scoring, suboptimal sequences in the neighborhood of the optimal solution (4). The energy spread from the ground-state solution to the 1000th most stable sequence is about 5 kcal/mol, an indication that the density of states is high. The amino acids comprising the core of the molecule, with the exception of position 7, are essentially invariant (Fig. 1). Almost all of the sequence variation occurs at surface positions, and typically involves conservative changes. Asn14, which is predicted to form a stabilizing hydrogen bond to the helix backbone, is among the most conserved surface positions. The strong sequence conservation observed for critical areas of the molecule suggests that, if a representative sequence folds into the design target structure, then many sequences whose variations do not disrupt the critical interactions may be equally competent. Even if billions of sequences would successfully achieve the target fold, they would represent only a very small proportion of the 1027 possible sequences.
Experimental validation. FSD-1 was synthesized in order
to allow us to characterize its structure and assess the performanceof
the design algorithm (24).
The far-ultraviolet (UV) circular dichroism (CD) spectrum of
FSD-1 shows minima at 220 nm and 207 nm, which is indicative
of a folded structure (Fig. 3A)
(25). The thermal melt is weakly cooperative,
with an inflection point at 39°C (Fig.
3B),
and is completely reversible. The broad melt is consistent with
a low enthalpy of folding which is expected for a motif with
a small hydrophobic core. This behavior contrasts the uncooperative
thermal unfolding transitions observed for other folded
short peptides (26).
FSD-1 is highly soluble (greater than 3 mM), and equilibrium
sedimentation studies at 100 µM, 500 µM, and 1 mM
show the protein to be monomeric (27). The sedimentationdata
fit well to a single species, monomer model
with a molecular mass of 3630 at 1 mM, in good agreement with
the calculated monomer mass of 3488. Also, far UV-CD spectra
showed no concentration dependence
from 50 µM to 2 mM, and nuclear magnetic resonance (NMR)
COSY spectra taken at 100 µM and 2 mM were essentially identical.
The solution structure of FSD-1 was solved by means of homonuclear 2D
1H
NMR spectroscopy (28). NMR spectra were well dispersed,
indicating an ordered protein
structure and easing resonance assignments. Proton chemical
shift assignments were determined with
standard homonuclear methods (29). Unambiguous
sequential and short-range NOEs (Fig.
4) indicate
helical secondary structure from residues 15 to 26 in agreement
with the design target. Representative
long-range NOEs from the helix to Ile7 and Phe12
indicate a hydrophobic core consistent with the
desired
tertiary structure (Fig. 4B).
The structure of FSD-1 was determined
from 284 experimental restraints (10.1 restraints per residue) that were
nonredundant with covalent structure including 274 NOE distance
restraints and 10 hydrogen bond restraints involving slowly
exchanging amide protons
(30). Structure calculations were performed with X-PLOR(31)
with the use of standard protocols for hybrid distance geometry-simulated
annealing (32). An ensemble of 41 structures converged
with good covalent geometry and no distance restraint violations
greater than 0.3 Å (Fig.
5 and Table
1).
The backboneof FSD-1 is well defined
with a root-mean-square (rms) deviationfrom
the mean of 0.54 Å (residues 3 to 26). Consideration
of the buried side chains
(Tyr3, Ala5, Ile7, Phe12, Leu18,
Phe21, Ile22, and Phe25) along with the
backbone gives an rms deviation of
0.99 Å, indicating that the core of the molecule is well
ordered. The stereochemical quality
of the ensemble of structures was examined with PROCHECK (33).
Apart from the disordered termini and
the glycine residues, 87 percent of the residues fall in the
most favored region and the remainder
in the allowed region of ,
space. Modest heterogeneity is
evident in the first strand (residues
3 to 6), which has an average backbone angular order
parameter,
S
(34), of 0.96 ± 0.04 compared to the
second strand (residues 9 to 12) with an
S
= 0.98 ± 0.02 and the helix (residues 15 to 26) with an
S
=
0.99 ± 0.01. Overall, FSD-1 is notably well ordered
and, to our knowledge, is the shortest sequence consisting entirely
of naturally occurring amino acids that folds to a well-ordered
structure without metal binding, oligomerization, or disulfide
bond formation (35).
|
The packing pattern of the hydrophobic core of the NMR structure ensemble
of FSD-1 (Tyr3, Ile7, Phe12, Leu18,
Phe21, Ile22, and Phe25) is similar to
the computed packing arrangement. Five of the seven residues
have 1 angles in the same
gauche+, gauche
or trans category
as the design target, and three residues
match both
1 and
2
angles. The two residues that do not match their computed
1
angles are Ile7 and Phe25, which is consistent with
their location at the less constrained open end of the molecule.
Ala5 is not involved in its expected extensive packing interactionsand
instead exposes about 45 percent of its surface area becauseof the displacement
of the strand 1 backbone relative to the designtemplate.
Conversely, Lys8 behaves as predicted by the algorithm with
its solvent exposure (60 percent) and
1
and
2 angles matching the
computed structure. Because there are few NOEs involving solvent-exposed
side chains,most of these side
chains are disordered in the solution
structure,a state that precludes examination
of the predicted surface residue hydrogen bonds. However, Asn14
forms a hydrogen bond from its side
chain carbonyl oxygen as predicted, but to the amide
of Glu17, not Lys16 as expected from the
design.
This hydrogen bond is present in 95 percent of the structure
ensemble and has a donor-acceptor distance of 2.6 ± 0.06
Å. In general, the side chains
of FSD-1 correspond well with the design
algorithm predictions, but further refinement of the scoring
function and rotamer library should improve sequence selection
and side chain placement and improvethe
correlation between the predicted and observed structures.
We compared the average restrained minimized structure of FSD-1 and
the design target (Fig. 6).
The overall backbone rms deviationof
FSD-1 from the design target is 1.98
Å for residues 3 to 26 and only 0.98 Å for residues
8 to 26 (Table
2). The largest differencebetween FSD-1
and the target structure occurs from residues 4to 7, with a displacement
of 3.0 to 3.5 Å of the backbone atompositions of strand 1. The agreement
for strand 2, the strand-to-helixturn, and the helix is remarkable, with
the differences nearlywithin the accuracy of the structure determination.
For this region of the structure, the rms difference of ,
angles between FSD-1 and the design
target is only 14 ± 9°. In order
to quantitatively assess the similarity of FSD-1 to the global
fold of the target, we calculated their supersecondary structure
parameter values (Table 2) (36,
37),
which describe the relative orientationsof
secondary structure units in proteins. The values of
,
the inclination of the helix relative to the sheet, and
,
the dihedral angle between the helix axis and the strand axes
(see legend to Table 2), are nearly identical.
The height of the helix above the sheet, h, is only 1
Å greater in FSD-1. A study of protein core design
as a function of helix height for G
1
variants demonstrated that
up to 1.5 Å variation in helix height has little effect on sequence
selection (37). The comparison of supersecondary structureparameter
values and backbone coordinates highlights the excellentagreement between
the experimentally determined structure
of FSD-1 and the design
target, and demonstrates the success
of our algorithm at computing a sequence for this
motif.
|
The quality of the match between FSD-1 and the design
target
demonstrates the ability of
our algorithm to design a sequencefor
a fold that contains the three major secondary structure elementsof proteins:
sheet, helix, and turn. Since the
fold is different from those used to develop
the sequence-selection methodology, the
design
of FSD-1 represents a successful transfer of our algorithm to
a new motif. Further tests of the performance of the algorithm on
several different motifs are necessary, although its basis in
physical chemistry and the absence of heuristics and subjective considerations
should allow the algorithm to be used in many different structural
contexts. Also, the generation of various kinds of backbone
templates for use as input to our fully automated sequence selection
algorithm could enable the design of
new protein folds. Recent results indicate that the sequence
selection algorithm is not sensitive to even fairly large perturbations
in backbone geometry and should be robust enough to accommodate
computer-generated backbones (37).
The key to using a quantitative method for the FSD-1
design,
and for the continued development of
the methodology, is the tight coupling of theory, computation,
and experiment used to improve the accuracy of the physical
chemical potential functions in our algorithm. When combined
with these potential functions, computational optimization methods
such as DEE can rapidly find sequences
for structures too large for experimental library screening or
too complex for subjective approaches. Given that the FSD-1 sequence
was computed with only a 4-GigaFLOPS computer (19),and
that TeraFLOPS computers are now available with PetaFLOPScomputers on the
drawing board, the prospect for pursuing evenlarger and more complex
designs
is excellent.
![]() |
|
||||||||||
![]() ![]() |
This article has been cited by other articles: