Pro Design

De Novo Protein Design: Fully Automated Sequence Selection

Bassil I. Dahiyat,Stephen L. Mayo ^*

The first fully automated design and experimental validation of a novel sequence for an entire protein is described. A computationaldesign algorithm based on physical chemical potential functionsand stereochemical constraints was used to screen a combinatoriallibrary of 1.9x10²⁷ possible amino acid sequences for compatibility with the designtarget, a beta beta alpha protein motif based on the polypeptide backbonestructure of a zinc finger domain. A BLAST search shows that thedesigned sequence, full sequence design 1 (FSD-1), has very lowidentity to any known protein sequence. The solution structureof FSD-1 was solved by nuclear magnetic resonance spectroscopyand indicates that FSD-1 forms a compact well-ordered structure,which is in excellent agreement with the design target structure.This result demonstrates that computational methods can performthe immense combinatorial search required for protein design,and it suggests that an unbiased and quantitative algorithm canbe used in various structural contexts.

B I. Dahiyat, Division of Chemistry and Chemical Engineering, California Institute of Technology, mail code 147-75, Pasadena, CA 91125, USA.
S. L. Mayo, Howard Hughes Medical institute and Division of Biology, California Institute of Technology, mail code 147-75, Pasadena, CA 91125, USA.
^* To whom correspondence should be addressed. E-mail: steve@mayo.caltech.edu

dagger Present address: Xencor, Pasadena, CA 91106, USA.

Significant advances have been made toward the design of stable, well-folded proteins with novel sequences (1). These effortshave generated insight into the factors that control protein foldingand have suggested new approaches to biotechnology (2). Inorder to broaden the scope and power of protein design techniques,several groups have developed and experimentally tested systematicquantitative methods for protein design directed toward developinggeneral design algorithms (3, 4). These techniques, whichhave been used to screen possible sequences for compatibilitywith the desired protein fold, have been focused mostly on theredesign of protein cores.

We have sought to expand the range of computational protein design to residues of all parts of a protein: the buried core,the solvent-exposed surface, and the boundary between core andsurface (4-6). Our goal is an unbiased, quantitative design algorithmthat is based on the physical properties that determine proteinstructure and stability and that is not limited to specific foldsor motifs. Such a method should escape the lack of generalityof design approaches based on system-specific heuristics or subjectiveconsiderations or both. We have developed our algorithm by combiningtheory, computation, and experiment in a cycle that has improvedour understanding of the physical chemistry governing proteindesign (4). We now report the successful design by the algorithmof an original sequence for an entire protein and the experimentalvalidation of the protein's structure.

Sequence selection. Our design methodology begins with a backbone fold and we attempt to select an amino acid sequencethat will stabilize this target structure. The method consistsof an automated side-chain selection algorithm that explicitlyand quantitatively considers specific interactions between (i)side chain and backbone and (ii) side chain and side chain (4).The side chain selection algorithm screens all possible aminoacid sequences and finds the optimal sequence and side-chain orientationsfor a given backbone. In order to correctly account for the torsionalflexibility of side chains and the geometric specificity of side-chainplacement, we consider a discrete set of all allowed conformersof each side chain, called rotamers (7). The sizable searchproblem presented by rotamer sequence optimization is overcomeby application of the dead-end elimination (DEE) theorem (8).Our implementation of the DEE theorem extends its utility to sequencedesign and rapidly finds the globally optimal sequence in itsoptimal conformation (4).

Previously we determined the different contributions of core, surface, and boundary residues to the scoring of a sequencearrangement. The sequence predictions of a scoring function, ora combination of scoring functions, were experimentally testedin order to assess the accuracy of the algorithm and to deriveimprovements to it. We successfully redesigned the core of a coiledcoil and of the streptococcal protein G beta 1 (G beta 1) domain usinga van der Waals potential to account for steric constraints andan atomic solvation potential favoring the burial and penalizingthe exposure of nonpolar surface area (4, 6). Effectivesolvation parameters and the appropriate balance between packingand solvation terms were found by systematic analysis of experimentaldata and feedback into the simulation. Solvent-exposed residueson the surface of a protein were designed with the use of a hydrogen-bondpotential and secondary structure propensities in addition toa van der Waals potential. Coiled coils designed with such a scoringfunction were 10° to 12°C more thermally stable than the naturallyoccurring analog (5). Residues that form the boundary betweenthe core and surface require a combination of the core and thesurface scoring functions. The algorithm considers both hydrophobicand hydrophilic amino acids at boundary positions, whereas corepositions are restricted to hydrophobic amino acids and surfacepositions are restricted to hydrophilic amino acids.

In order to assess the capability of our design algorithm, we have computed the entire amino acid sequence for a small proteinmotif. We sought a protein fold that would be small enough tobe both computationally and experimentally tractable, yet largeenough to form an independently folded structure in the absenceof disulfide bonds or metal binding. We chose the beta beta alpha motif typifiedby the zinc finger DNA binding module (9). Although this motifconsists of fewer than 30 residues, it does contain sheet, helix,and turn structures. The ability of this fold to form in the absenceof metal ions or disulfide bonds has been demonstrated by Imperialiand co-workers, who designed a 23-residue peptide, containingan unusual amino acid (D-proline) and a nonnatural amino acid[3-(1,10-phenanthrol-2-yl)-L-alanine], which achieved this fold(10); our initial characterization of a partially computed sequenceindicated that it also forms this fold (11). In computing thefull sequence for this target fold, we use the scoring functionsfrom our previous work without modification (12). The beta beta alpha motifwas not used in any of our prior work to develop the design methodologyand therefore provides a test of the algorithm's generality.

The sequence selection algorithm requires structure coordinates that define the target motif's backbone (N, C alpha , C, and O atomsand C alpha -C beta vectors). The Brookhaven Protein Data Bank (PDB) (13)was examined for high-resolution structures of the beta beta alpha motif,and the second zinc finger module of the DNA binding protein Zif268was selected as our design template (9, 14). In order toassign the residue positions in the template structure into core,surface, or boundary classes, the orientation of the C alpha -C beta vectorswas assessed relative to a solvent-accessible surface computedwith only the template C alpha atoms (15). The small size of thismotif limits to one (position 5) the number of residues that canbe assigned unambiguously to the core, whereas seven residues(positions 3, 7, 12, 18, 21, 22, and 25) were classified as boundaryand the remaining 20 residues were assigned to the surface. Whereasthree of the zinc binding positions of Zif268 are in the boundaryor core, one residue, position 8, has a C alpha -C beta vector directedaway from the geometric center of the protein and is classifiedas a surface position. As in our previous studies, the amino acidsconsidered at the core positions during sequence selection wereAla, Val, Leu, Ile, Phe, Tyr, and Trp; the amino acids consideredat the surface positions were Ala, Ser, Thr, His, Asp, Asn, Glu,Gln, Lys, and Arg; and the combined core and surface amino acidsets (16 amino acids) were considered at the boundary positions.Two of the residue positions (9 and 27) have phi angles greaterthan 0°and are set to Gly by the sequence selection algorithmto minimize backbone strain.

The total number of amino acid sequences that must be considered by the design algorithm is the product of the number of possibleamino acids at each residue position. The beta beta alpha motif residue classificationdescribed above results in a virtual combinatorial library of1.9x10²⁷ possible amino acid sequences (16). This library size is 15orders of magnitude larger than that accessible by experimentalrandom library approaches. A corresponding peptide library consistingof only a single molecule for each 28-residue sequence would havea mass of 11.6 metric tons (17). In order to accurately modelthe geometric specificity of side-chain placement, we explicitlyconsider the torsional flexibility of amino acid side chains inour sequence scoring by representing each amino acid with a discreteset of allowed conformations, called rotamers (18). As a result,the design algorithm must consider all rotamers for each possibleamino acid at each residue position. The total size of the searchspace for the beta beta alpha motif is therefore 1.1 × 10⁶² possible rotamer sequences. We use a search algorithm based onan extension of the DEE theorem to solve the rotamer sequenceoptimization problem (4, 8). Efficient implementation ofthe DEE theorem has made complete protein sequence design tractablefor about 50 residues on current parallel computers in a singlecalculation. The rotamer optimization problem for the beta beta alpha motifrequired 90 CPU hours to find the optimal sequence (19, 20).

The optimal sequence (Fig. 1) is called full sequence design (FSD-1). Even though all of the hydrophilic amino acids wereconsidered at each of the boundary positions, the algorithm selectedonly nonpolar amino acids. The eight core and boundary positionsare predicted to form a well-packed buried cluster. The Phe sidechains selected by the algorithm at positions 21 and 25, the zinc-bindingHis positions of Zif268, are more than 80 percent buried, andthe Ala at position 5 is 100 percent buried but the Lys at position8 is more than 60 percent exposed to solvent (Fig. 2). The otherboundary positions demonstrate the steric constraints on buriedresidues by packing similar side chains in an arrangement similarto that of Zif268 (Fig. 2). The calculated optimal configurationfor core and boundary residues buries ~1150 Å² of nonpolar surface area. On the helix surface, the algorithmplaces Asn¹⁴ with a hydrogen bond between its side-chain carbonyl oxygen andthe backbone amide proton of residue 16. The eight charged residueson the helix form three pairs of hydrogen bonds, although in ourcoiled-coil designs, helical surface hydrogen bonds appeared tobe less important than the overall helix propensity of the sequence(5). Positions 4 and 11 on the exposed sheet surface were selectedby the program to be Thr, one of the best beta -sheet forming residues(21).

Fig. 1. Sequence of FSD-1 aligned with the second zinc finger of Zif268. The bar at the top of the figure shows the residue positionclassifications: the solid bar indicates the single core position,the hatched bars indicate the seven boundary positions and theopen bars indicate the 20 surface positions. The alignment matchespositions of FSD-1 to the corresponding backbone template positionsof Zif268. Of the six identical positions (21 percent) betweenFSD-1 and Zif268, four are buried (Ile⁷, Phe¹², Leu¹⁸, and Ile²²). The zinc binding residues of Zif268 are boxed. Representativenonoptimal sequence solutions determined by means of a Monte Carlosimulated annealing protocol are shown with their rank. Verticallines indicate identity with FSD-1. The symbols at the bottomof the figure show the degree of sequence conservation for eachresidue position computed across the top 1000 sequences: filledcircles indicate more than 99 percent conservation, half-filledcircles indicate conservation between 90 and 99 percent, opencircles indicate conservation between 50 and 90 percent, and theabsence of a symbol indicates less than 50% conservation. Theconsensus sequence determined by choosing the amino acid withthe highest occurrence at each position is identical to the sequenceof FSD-1. Single-letter abbreviations for amino acid residuesas follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H,His; I, Ile; K, Lys; L, Leu; M, Met; N, Asn; P, Pro; Q, Gln; R,Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr.[View Larger Version of this Image (51K GIF file)]

Fig. 2. Comparison of Zif268 (9) and computed FSD-1 structures. (A) Stereoview of the second zinc finger module of Zif268showing its buried residues and zinc binding site. (B)Stereoview of the computed orientations of buried side chainsin FSD-1. For clarity, only side chains from residues 3, 5, 8,12, 18, 21, 22, and 25 are shown. Color figures were created withMOLMOL (38). [View Larger Version of this Image (39K GIF file)]

Alignment of the sequences for FSD-1 and Zif268 (Fig. 1) indicates that only 6 of the 28 residues (21 percent) are identicaland only 11 (39 percent) are similar. Four of the identities arein the buried cluster, which is consistent with the expectationthat buried residues are more conserved than solvent-exposed residuesfor a given motif (22). A BLAST (23) search of the FSD-1 sequenceagainst the nonredundant protein sequence database of the NationalCenter for Biotechnology Information did not reveal any zinc fingerprotein sequences. Further, the BLAST search found only low identitymatches of weak statistical significance to fragments of variousunrelated proteins. The highest identity matches were 10 residues(36 percent) with P values ranging from 0.63 to 1.0, where P isthe probability of a match being a chance occurrence. Random 28-residuesequences that consist of amino acids allowed in the beta beta alpha positionclassification described above produced similar BLAST search results,with 10- or 11-residue identities (36 to 39 percent) and P valuesranging from 0.35 to 1.0, further suggesting that the matchesfor FSD-1 are statistically insignificant. The low identity withany known protein sequence demonstrates the novelty of the FSD-1sequence and underscores that no sequence information from anyprotein motif was used in our sequence scoring function.

In order to examine the robustness of the computed sequence, we used the sequence of FSD-1 as the starting point of a MonteCarlo simulated annealing run. The Monte Carlo search revealedhigh scoring, suboptimal sequences in the neighborhood of theoptimal solution (4). The energy spread from the ground-statesolution to the 1000th most stable sequence is about 5 kcal/mol,an indication that the density of states is high. The amino acidscomprising the core of the molecule, with the exception of position7, are essentially invariant (Fig. 1). Almost all of the sequencevariation occurs at surface positions, and typically involvesconservative changes. Asn¹⁴, which is predicted to form a stabilizing hydrogen bond to thehelix backbone, is among the most conserved surface positions.The strong sequence conservation observed for critical areas ofthe molecule suggests that, if a representative sequence foldsinto the design target structure, then many sequences whose variationsdo not disrupt the critical interactions may be equally competent.Even if billions of sequences would successfully achieve the targetfold, they would represent only a very small proportion of the10²⁷ possible sequences.

Experimental validation. FSD-1 was synthesized in order to allow us to characterize its structure and assess the performanceof the design algorithm (24). The far-ultraviolet (UV) circulardichroism (CD) spectrum of FSD-1 shows minima at 220 nm and 207nm, which is indicative of a folded structure (Fig. 3A) (25).The thermal melt is weakly cooperative, with an inflection pointat 39°C (Fig. 3B), and is completely reversible. The broad meltis consistent with a low enthalpy of folding which is expectedfor a motif with a small hydrophobic core. This behavior contraststhe uncooperative thermal unfolding transitions observed for otherfolded short peptides (26). FSD-1 is highly soluble (greaterthan 3 mM), and equilibrium sedimentation studies at 100 µM, 500µM, and 1 mM show the protein to be monomeric (27). The sedimentationdata fit well to a single species, monomer model with a molecularmass of 3630 at 1 mM, in good agreement with the calculated monomermass of 3488. Also, far UV-CD spectra showed no concentrationdependence from 50 µM to 2 mM, and nuclear magnetic resonance(NMR) COSY spectra taken at 100 µM and 2 mM were essentially identical.

Fig. 3. Circular dichroism (CD) measurements of FSD-1. (A) Far-UV CD spectrum of FSD-1 at 1°C. The minima at 220 and 207 nmindicate a folded structure. (B) Thermal unfolding of FSD-1monitored by CD. The melting curve has an inflection point at39°C. To illustrate the cooperativity of the thermal transition,the melting curve was fit to a two-state model [(39) and thederivative of the fit is shown (inset)]. The melting temperaturedetermined from this fit is 42°C.[View Larger Version of this Image (17K GIF file)]

The solution structure of FSD-1 was solved by means of homonuclear 2D ¹H NMR spectroscopy (28). NMR spectra were well dispersed, indicatingan ordered protein structure and easing resonance assignments.Proton chemical shift assignments were determined with standardhomonuclear methods (29). Unambiguous sequential and short-rangeNOEs (Fig. 4) indicate helical secondary structure from residues15 to 26 in agreement with the design target. Representative long-rangeNOEs from the helix to Ile⁷ and Phe¹² indicate a hydrophobic core consistent with the desired tertiarystructure (Fig. 4B).

Fig. 4. NOE contacts for FSD-1. (A) Sequential and short-range NOE connectivities. The d denotes a contact between the indicatedprotons. All adjacent residues are connected by H alpha

-HN, HN-HN,or H beta

-HN NOE crosspeaks. The helix (residues 15 to 26) is welldefined by short-range connections, as is the hairpin turn atresidues 7 and 8. (B) Representative NOE contacts fromaromatic to methyl protons. Several long-range NOEs from Ile⁷ and Phe¹² to the helix help define the fold of the protein. The starredpeak has an ambiguous F1 assignment, Ile²² Hd1 or Leu¹⁸ Hd2. [View Larger Version of this Image (16K GIF file)]

The structure of FSD-1 was determined from 284 experimental restraints (10.1 restraints per residue) that were nonredundantwith covalent structure including 274 NOE distance restraintsand 10 hydrogen bond restraints involving slowly exchanging amideprotons (30). Structure calculations were performed with X-PLOR(31) with the use of standard protocols for hybrid distancegeometry-simulated annealing (32). An ensemble of 41 structuresconverged with good covalent geometry and no distance restraintviolations greater than 0.3 Å (Fig. 5 and Table 1). The backboneof FSD-1 is well defined with a root-mean-square (rms) deviationfrom the mean of 0.54 Å (residues 3 to 26). Consideration of theburied side chains (Tyr³, Ala⁵, Ile⁷, Phe¹², Leu¹⁸, Phe²¹, Ile²², and Phe²⁵) along with the backbone gives an rms deviation of 0.99 Å, indicatingthat the core of the molecule is well ordered. The stereochemicalquality of the ensemble of structures was examined with PROCHECK(33). Apart from the disordered termini and the glycine residues,87 percent of the residues fall in the most favored region andthe remainder in the allowed region of phi , psi space. Modest heterogeneityis evident in the first strand (residues 3 to 6), which has anaverage backbone angular order parameter, S (34), of 0.96± 0.04 compared to the second strand (residues 9 to 12) with anS = 0.98 ± 0.02 and the helix (residues 15 to 26) with anS= 0.99 ± 0.01. Overall, FSD-1 is notably well ordered and, toour knowledge, is the shortest sequence consisting entirely ofnaturally occurring amino acids that folds to a well-ordered structurewithout metal binding, oligomerization, or disulfide bond formation(35).

Fig. 5. Solution structure of FSD-1. Stereoview showing the best-fit superposition of the 41 converged simulated annealing structuresfrom X-PLOR (31). The backbone C alpha

trace is shown in blue andthe side-chain heavy atoms of the hydrophobic residues (Tyr³, Ala⁵, Ile⁷, Phe¹², Leu¹⁸, Phe²¹, Ile²², and Phe²⁵) are shown in magenta. The amino terminus is at the lower leftof the figure and the carboxyl terminus is at the upper rightof the figure. The structure consists of two antiparallel strandsfrom positions 3 to 6 (back strand) and 9 to 12 (front strand),with a hairpin turn at residues 7 and 8, followed by a helix frompositions 15 to 26. The termini, residues 1, 2, 27, and 28 havevery few NOE restraints and are disordered.[View Larger Version of this Image (33K GIF file)]

Table 1. NMR structure determination: distance restraints, structural statistics, and atomic root-mean-square (rms) deviations.SAare the 41 simulated annealing structures, SA is the average structurebefore energy minimization, (SA)_r is the restrained energy minimizedaverage structure, and SD is the standard deviation.

Distance restraints

Intraresidue 97

Sequential 83

Short range (|i - j| = 2 to 5 residues) 59

Long range (|i - j| > 5 residues) 35

Hydrogen bond 10

Total 284

Structural statistics

rms deviations SA ± SD (SA)_r

Distance restraints (Å) 0.043 ± 0.003 0.038

Idealized geometry

Bonds (Å) 0.0041 ± 0.0002 0.0037

Angles (degrees) 0.67 ± 0.02 0.65

Impropers (degrees) 0.53 ± 0.05 0.51

Atomic rms deviations (Å)^*

SA versus SA ± SD SA versus (SA)_r ± SD

Backbone 0.54 ± 0.15 0.69 ± 0.16

Backbone + nonpolar side chains 0.99 ± 0.17 1.16 ± 0.18

Heavy atoms 1.43 ± 0.20 1.90 ± 0.29

^* Atomic rms deviations are for residues 3 to 26, inclusive. Residues 1, 2, 27, and 28 were disordered [ phi , psi , angular orderparameters (34) < 0.78] and had only sequential and |i - j|= 2 NOEs.

dagger Nonpolar side chains are from residues Tyr³, Ala⁵, Ile⁷, Phe¹², Leu¹⁸, Phe²¹, Ile²², and Phe²⁵, which constitute the core of the protein.

The packing pattern of the hydrophobic core of the NMR structure ensemble of FSD-1 (Tyr³, Ile⁷, Phe¹², Leu¹⁸, Phe²¹, Ile²², and Phe²⁵) is similar to the computed packing arrangement. Five of theseven residues have chi ₁ angles in the same gauche⁺, gauche or trans category as the design target, and three residues matchboth chi ₁ and chi ₂ angles. The two residues that do not match theircomputed chi ₁ angles are Ile⁷ and Phe²⁵, which is consistent with their location at the less constrainedopen end of the molecule. Ala⁵ is not involved in its expected extensive packing interactionsand instead exposes about 45 percent of its surface area becauseof the displacement of the strand 1 backbone relative to the designtemplate. Conversely, Lys⁸ behaves as predicted by the algorithm with its solvent exposure(60 percent) and chi ₁ and chi ₂ angles matching the computed structure.Because there are few NOEs involving solvent-exposed side chains,most of these side chains are disordered in the solution structure,a state that precludes examination of the predicted surface residuehydrogen bonds. However, Asn¹⁴ forms a hydrogen bond from its side chain carbonyl oxygen aspredicted, but to the amide of Glu¹⁷, not Lys¹⁶ as expected from the design. This hydrogen bond is present in95 percent of the structure ensemble and has a donor-acceptordistance of 2.6 ± 0.06 Å. In general, the side chains of FSD-1correspond well with the design algorithm predictions, but furtherrefinement of the scoring function and rotamer library shouldimprove sequence selection and side chain placement and improvethe correlation between the predicted and observed structures.

We compared the average restrained minimized structure of FSD-1 and the design target (Fig. 6). The overall backbone rms deviationof FSD-1 from the design target is 1.98 Å for residues 3 to 26and only 0.98 Å for residues 8 to 26 (Table 2). The largest differencebetween FSD-1 and the target structure occurs from residues 4to 7, with a displacement of 3.0 to 3.5 Å of the backbone atompositions of strand 1. The agreement for strand 2, the strand-to-helixturn, and the helix is remarkable, with the differences nearlywithin the accuracy of the structure determination. For this regionof the structure, the rms difference of phi , psi angles between FSD-1and the design target is only 14 ± 9°. In order to quantitativelyassess the similarity of FSD-1 to the global fold of the target,we calculated their supersecondary structure parameter values(Table 2) (36, 37), which describe the relative orientationsof secondary structure units in proteins. The values of theta , theinclination of the helix relative to the sheet, and Omega , the dihedralangle between the helix axis and the strand axes (see legend toTable 2), are nearly identical. The height of the helix abovethe sheet, h, is only 1 Å greater in FSD-1. A study of proteincore design as a function of helix height for G beta 1 variants demonstratedthat up to 1.5 Å variation in helix height has little effect onsequence selection (37). The comparison of supersecondary structureparameter values and backbone coordinates highlights the excellentagreement between the experimentally determined structure of FSD-1and the design target, and demonstrates the success of our algorithmat computing a sequence for this beta beta alpha motif.

Fig. 6. Comparison of the FSD-1 structure (blue) and the design target (red). Stereoview of the best-fit superposition of the restrainedenergy minimized average NMR structure of FSD-1 and the backboneof Zif268. Residues 3 to 26 are shown.[View Larger Version of this Image (22K GIF file)]

Table 2. Comparison of the FSD-1 experimentally determined structure and the design target structure. The FSD-1 structure is the restrainedenergy minimized average from the NMR structure determination.The design target structure is the second DNA binding module ofthe zinc finger Zif268 (9)

Atomic rms deviations (Å)

Backbone, residues 3 to 26 1.98

Backbone, residues 8 to 26 0.98

Super-secondary structure parameters^*

FSD-1 Design target

h (Å) 9.9 8.9

theta (degrees) 14.2 16.5

Omega (degrees) 13.1 13.5

^* h, theta , and Omega are calculated as described (36, 37). h is the distance between the centroid of the helix C alpha coordinates (residues15 to 26) and the least-squares plane fit to the C alpha coordinatesof the sheet (residues 3 to 12); theta is the angle of inclinationof the principal moment of the helix C alpha atoms with the plane ofthe sheet; Omega is the angle between the projection of the principalmoment of the helix onto the sheet and the projection of the averageleast-squares fit line to the strand C alpha coordinates (residues3 to 6 and 9 to 12) onto the sheet.

The quality of the match between FSD-1 and the design target demonstrates the ability of our algorithm to design a sequencefor a fold that contains the three major secondary structure elementsof proteins: sheet, helix, and turn. Since the beta beta alpha fold is differentfrom those used to develop the sequence-selection methodology,the design of FSD-1 represents a successful transfer of our algorithmto a new motif. Further tests of the performance of the algorithmon several different motifs are necessary, although its basisin physical chemistry and the absence of heuristics and subjectiveconsiderations should allow the algorithm to be used in many differentstructural contexts. Also, the generation of various kinds ofbackbone templates for use as input to our fully automated sequenceselection algorithm could enable the design of new protein folds.Recent results indicate that the sequence selection algorithmis not sensitive to even fairly large perturbations in backbonegeometry and should be robust enough to accommodate computer-generatedbackbones (37).

The key to using a quantitative method for the FSD-1 design, and for the continued development of the methodology, is thetight coupling of theory, computation, and experiment used toimprove the accuracy of the physical chemical potential functionsin our algorithm. When combined with these potential functions,computational optimization methods such as DEE can rapidly findsequences for structures too large for experimental library screeningor too complex for subjective approaches. Given that the FSD-1sequence was computed with only a 4-GigaFLOPS computer (19),and that TeraFLOPS computers are now available with PetaFLOPScomputers on the drawing board, the prospect for pursuing evenlarger and more complex designs is excellent.

REFERENCES AND NOTES

M. H. J. Cordes, A. R. Davidson, R. T. Sauer, Curr. Opinion Struct. Biol. 6, 3 (1996).
D. Y. Jackson, et al., Science 266, 243 (1994)[ISI][Medline]; B. Li, et al., ibid.270, 1657 (1995)[Abstract]; J. S. Marvin, et al., Proc. Natl. Acad. Sci. U.S.A. 94, 4366 (1997)[ISI][Abstract/Full Text].
H. W. Hellinga, J. P. Caradonna, F. M. Richards, J. Mol. Biol. 222, 787 (1991)[ISI][Medline]; J. H. Hurley, W. A. Baase, B. W. Matthews, ibid.224, 1143 (1992)[Medline]; J. R. Desjarlais and T. M. Handel, Protein Sci. 4, 2006 (1995)[ISI][Medline]; P. B. Harbury, B. Tidor, P. S. Kim, Proc. Natl. Acad. Sci. U.S.A.92, 8408 (1995)[ISI][Abstract]; M. Klemba, K. H. Gardner, S. Marino, N. D. Clarke, L. Regan, Nature Struc. Biol. 2, 368 (1995); S. F. Betz and W. F. Degrado, Biochemistry 35, 6955 (1996)[ISI][Medline].
B. I. Dahiyat and S. L. Mayo, Protein Sci. 5, 895 (1996)[ISI][Medline].
___, ibid.6, 1333 (1997)[Medline].
___, Proc. Natl. Acad. Sci. U.S.A. 94, 10172 (1997)[Abstract/Full Text].
J. W. Ponder and F. M. Richards, J. Mol. Biol. 193, 775 (1987)[ISI][Medline].
[See J. Desmet, M. De Maeyer, B. Hazes, I. Lasters, Nature 356, 539 (1992)[ISI]; R. F. Goldstein, Biophys. J. 66, 1335 (1994)[ISI][Abstract]; M. De Maeyer, J. Desmet, I. Laster, Folding Design 2, 53 (1997)] DEE finds and eliminates rotamers that are mathematically provable to be inconsistent (or dead-ending) with the global minimum energy solution of the system. A rotamer r at some residue position i will be dead-ending if, when compared with some other rotamer t, at the same residue position the following inequality is satisfied:

<IT>E</IT>(<IT>i</IT><SUB><IT>r</IT></SUB>) − <IT>E</IT>(<IT>i</IT><SUB><IT>t</IT></SUB>) + <LIM><OP>∑</OP><LL><IT>j</IT></LL></LIM><LIM><OP> min</OP><LL><IT>s</IT></LL></LIM>[<IT>E</IT>(<IT>i<SUB>r</SUB>j</IT><SUB><IT>s</IT></SUB>) − <IT>E</IT>(<IT>i<SUB>t</SUB>j</IT><SUB><IT>s</IT></SUB>)] > 0

i_r

i_t

i_rj_s

i_tj_s

min_s

N. P. Pavletich and C. O. Pabo, Science 252, 809 (1991)[Medline].
M. D. Struthers, R. P. Cheng, B. Imperiali, ibid.271, 342 (1996)[Abstract].
B. I. Dahiyat and S. L. Mayo, unpublished results.
Potential functions and parameters for van der Waals interactions, solvation, hydrogen bonding, and secondary structure propensity are described in our previous work (4-6). A secondary structure propensity potential was used for surface -sheet positions where the i 1 and i + 1 residues were also in -sheet conformations (5). Propensity values from Serrano and co-workers were used [ V. Munoz and L. Serrano, Proteins Struct. Funct. Genet. 20, 301 (1994)[ISI]].
F. C. Bernstein, et al., J. Mol. Biol. 112, 535 (1977)[ISI][Medline].
The coordinates of PDB record 1zaa (9, 13) from residues 33 to 60 were used as the structure template. In our numbering, position 1 corresponds to 1zaa position 33. The program BIOGRAF (Molecular Simulations, Inc., San Diego, CA) was used to generate explicit hydrogens on the structure which was then conjugate-gradient minimized for 50 steps by means of the Dreiding force field (40).
A solvent-accessible surface was generated using the Connolly algorithm (41) with a probe radius of 8.0 Å, a dot density of 10 Å², and a C radius of 1.95 Å. A residue was classified as a core position if the distance from its C, along its C-C vector, to the solvent-accessible surface was greater than 5.0 Å, and if the distance from its C to the nearest surface point was greater than 2.0 Å. The remaining residues were classified as surface positions if the sum of the distances from their C, along their C-C vector, to the solvent-accessible surface plus the distance from their C to the closest surface point was less than 2.7 Å. All remaining residues were classified as boundary positions. The classifications for Zif268 were used as computed except that positions 1, 17, and 23 were converted from the boundary to the surface class to account for end effects from the proximity of chain termini to these residues in the tertiary structure and inaccuracies in the assignment.
One core position (7 possible amino acids), 7 boundary positions (16 possible amino acids), 18 surface positions (10 possible amino acids), and 2 positions with greater than 0° (1 possible amino acid) result in 7 * 16⁷ * 10¹⁸ * 1² = 1.88 × 10²⁷ possible amino acid sequences.
1.88 × 10²⁷ peptide molecules, with an average mass of 3712 daltons for the possible compositions allowed by the residue position classification, would weigh (1.88 × 10²⁷ * 3712 daltons) = 1.159 × 10⁷ g = 11.6 metric tons.
As in our previous work (5), a backbone-dependent rotamer library was used [ R. L. Dunbrack and M. Karplus, J. Mol. Biol.230, 543 (1993)[ISI][Medline]]. All His rotamers were protonated on both N and N.
All calculations were performed on a Silicon Graphics Power Challenge server with 10 R10000 processors running in parallel. Peak performance is 3.9 GigaFLOPS (FLOPS = floating point operations per second).
The sequence optimization consists of two phases: pairwise rotamer energy calculations and DEE searching. The DEE optimization was initially run with control parameters set for optimal speed followed by a DEE-based, residue-pairwise, round-robin optimization. The energy calculations took 53 CPU (central processing unit) hours and sequence optimizations took 37 CPU hours.
C. W. A. Kim and J. M. Berg, Nature 362, 267 (1993)[ISI][Medline]; D. L. Minor and P. S. Kim, ibid.367, 660 (1994)[Medline]; C. K. Smith, J. M. Withka, L. Regan, Biochemistry 33, 5510 (1994)[ISI][Medline].
J. U. Bowie, J. F. Reidhaar-Olson, W. A. Lim, R. T. Sauer, Science 247, 1306 (1990)[ISI][Medline].
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, J. Mol. Biol. 215, 403 (1990)[ISI][Medline].
FSD-1 was synthesized by means of standard solid-phase Fmoc chemistry. The peptide was cleaved from the resin with trifluoroacetic acid and purified by reversed-phase high-performance liquid chromatography. Peptide was lyophilized and stored at 20°C. Matrix-assisted laser desorption mass spectrometry yielded a molecular weight of 3489.7 daltons (3489.0 calculated).
Protein concentration was 50 µM in 50 mM sodium phosphate at pH 5.0. The spectrum was acquired at 1°C in a 1-mm cuvette and was baseline-corrected with a buffer blank. The spectrum is the average of 3 scans with a 1-s integration time and 1-nm increments. All CD data were acquired on an Aviv 62DS spectrometer equipped with a thermoelectric temperature control unit. Thermal unfolding was monitored at 218 nm in a 1-mm cuvette with 2° increments and an averaging time of 40 s and an equilibration time of 120 s per increment. Reversibility was confirmed by comparison of 1°C CD spectra before and after heating to 99°C. Peptide concentrations were determined by UV spectrophotometry.
J. M. Scholtz, et al., Proc. Natl. Acad. Sci. U.S.A. 88, 2854 (1991)[ISI][Abstract]; M. A. Weiss and H. T. Keutmann, Biochemistry 29, 9808 (1990)[ISI][Medline]; M. D. Struthers, R. P. Cheng, B. Imperiali, J. Am. Chem. Soc. 118, 3073 (1996)[ISI].
Sedimentation equilibrium studies were performed on a Beckman XL-A ultracentrifuge equipped with an An-60 Ti analytical rotor at a speed of 40,000 rpm. Protein concentration was 100 µM, 500 µM, or 1 mM in 50 mM sodium phosphate at pH 5.0 and 7°C. Absorption was monitored at 286 nm (500 µM and 1 mM) or 234 nm (100 µM). Concentration profiles were fit to an ideal single species model which resulted in randomly distributed residuals.
NMR data were collected on a Varian Unityplus 600 MHz spectrometer equipped with a Nalorac inverse probe with a self-shielded z-gradient. NMR samples (~2 mM) were prepared in H₂O-D₂O (90:10) or in 99.9 percent D₂O with 50 mM sodium phosphate at pH 5.0 (uncorrected glass electrode). All spectra were collected at 7°C. DQF-COSY [ U. Piantini, O. W. Sorensen, R. R. Ernst, J. Am. Chem. Soc. 104, 6800 (1982)[ISI]], TOCSY [A. Bax and D. G. Davis, J. Magnetic Reson. 65, 355 (1985)], and NOESY [ J. Jeener, B. H. Meier, P. Bachmann, R. R. Ernst, J. Chem. Phys. 71, 4546 (1979)[ISI]] spectra were acquired to accomplish resonance assignments and structure determination. NOESY spectra were recorded with mixing times of 200 ms for use during resonance assignments and 100 ms to derive distance restraints. Water suppression was accomplished either with presaturation during the relaxation delay or pulsed field gradients [ M. Piotto, V. Saudek, V. Sklenar, J. Biomol. NMR 2, 661 (1992)[ISI][Medline]]. Spectra were processed with VNMR (Varian Associates, Palo Alto, CA), and spectra were assigned with ANSIG [P. J. Kraulis, J. Magnetic Reson. 24, 627 (1989)].
K. Wuthrich, NMR of Proteins and Nucleic Acids (Wiley, New York, 1986).
NOEs were classified into three distance-bound ranges based on cross-peak intensity calibrated to the Tyr³ H-H crosspeak: strong (1.8 to 2.7 Å), medium (1.8 to 3.3 Å), and weak (1.8 to 5.0 Å). Upper bounds for restraints involving methyl protons were increased by 0.5 Å to account for the increased intensity of methyl resonances. All partially overlapped NOEs were set to weak restraints. Hydrogen bond restraints were derived from hydrogen deuterium-exchange kinetics measurements followed by one dimensional ¹H spectroscopy. Unambiguously assigned amide peaks for Tyr³, Phe¹², Leu¹⁸, Phe²¹, and Phe²⁵ were protected from exchange at 7°C, pH 5.0. Hydrogen bond restraints (two per hydrogen bond) were only included at the late stages of structure refinement when initial calculations indicated the donor-acceptor pairings.
A. T. Brünger, X-PLOR, version 3.1, A System for X-ray Crystallography and NMR (Yale Univ. Press, New Haven, CT, 1992).
Standard hybrid distance geometry-simulated annealing protocols were followed [ M. Nilges, G. M. Clore, A. M. Gronenborn, FEBS Lett. 229, 317 (1988)[ISI][Medline]; M. Nilges, J. Kuszewski, A. T. Brünger, in Computational Aspects of the Study of Biological Macromolecules by NMR J. C. Hoch, Ed. (Plenum, New York, 1991); J. Kuszewski, M. Nilges, A. T. Brünger, J. Biomol. NMR 2, 33 (1992)[ISI][Medline]]. Distance geometry structures (100) were generated, regularized, and refined, resulting in an ensemble, called SA, of 41 structures with no restraint violations greater than 0.3 Å, rms deviations from idealized bond lengths less than 0.01 Å, and rms deviations from idealized bond angles and impropers less than 1°. An average structure was generated by superimposing and then averaging the coordinates of the ensemble, followed by refinement and restrained minimization.
R. A. Laskowski, M. W. Macarthur, D. S. Moss, J. M. Thornton, J. Appl. Crystallogr. 26, 283 (1993)[ISI].
S. G. Hyberts, M. S. Goldberg, T. F. Havel, G. Wagner, Protein Sci. 1, 736 (1992)[ISI][Medline].
C. J. McKnight, P. T. Matsudaira, P. S. Kim, Nature Struct. Biol. 4, 180 (1997)[ISI][Medline].
J. Janin and C. Chothia, J. Mol. Biol. 143, 95 (1980)[ISI][Medline]; F. E. Cohen, M. J. E. Sternberg, W. R. Taylor, ibid.56, 821 (1982).
A. Su and S. L. Mayo, Protein Sci. 6, 1701 (1997)[ISI][Medline].
R. Koradi, M. Billeter, K. Wuthrich, J. Mol. Graph. 14, 51 (1996)[ISI][Medline].
W. J. Becktel and J. A. Schellman, Biopolymers 26, 1859 (1987)[ISI][Medline].
S. L. Mayo, B. D. Olafson, W. A. Goddard III, J. Phys. Chem. 94, 8897 (1990)[ISI].
M. L. Connolly, Science221, 709 (1983)[ISI][Medline].
We thank P. Poon and T. Laue for sedimentation equilibrium measurements and discussions, A. Su for assistance calculating super-secondary structure parameters, S. Ross for assistance with NMR measurements, G. Hathaway for mass spectrometry, J. Abelson and P. Bjorkman for critical reading of the manuscript, and R. A. Olofson for helpful discussions. Supported by the Howard Hughes Medical Institute (S.L.M.), the Rita Allen Foundation, the Chandler Family Trust, the Booth Ferris Foundation, the David and Lucile Packard Foundation, the Searle Scholars Program and The Chicago Community Trust, and grant GM08346 from the National Institutes of Health (B.I.D.). Coordinates and NMR restraints have been deposited in the Brookhaven Protein Data Bank with accession numbers 1FSD and R1FSDMR, respectively.

16 June 1997; accepted 8 September 1997

Abstract of this Article

Reprint (PDF) Version of this Article

Related articles in Science

Similar articles found in:
SCIENCE Online
ISI Web of Science
PubMed

PubMed Citation

This Article has been cited by:

Search Medline for articles by:
Dahiyat, B. I. || Mayo, S. L.

Search for citing articles in:
ISI Web of Science (61)

Alert me when:
new articles cite this article

Download to Citation Manager

Collections under which this article appears:
Biochemistry

This article has been cited by other articles:

DeGrado, W. F., Summa, C. M., Pavone, V., Nastri, F., Lombardi, A. (1999). DENOVO DESIGN AND STRUCTURAL CHARACTERIZATION OF PROTEINS AND METALLOPROTEINS. Annu. Rev. Biochem. 68: 779-819 [Abstract][Full Text]
Hefford, M. A., Dupont, C., MacCallum, J., Parker, M. H., Beauregard, M. (1999). Characterization of MB-1: A dimeric helical protein with a compact core. Eur J Biochem 262: 467-474 [Abstract][Full Text]
Walsh, S. T. R., Cheng, H., Bryson, J. W., Roder, H., DeGrado, W. F. (1999). Solution structure and dynamics of a de novodesigned three-helix bundle protein. Proc. Natl. Acad. Sci. U. S. A. 96: 5486-5491 [Abstract][Full Text]
Havranek, J. J., Harbury, P. B. (1999). Tanford-Kirkwood electrostatics for protein modeling. Proc. Natl. Acad. Sci. U. S. A. 96: 11145-11150 [Abstract][Full Text]
Brown, B. M., Sauer, R. T. (1999). Tolerance of Arc repressor to multiple-alanine substitutions. Proc. Natl. Acad. Sci. U. S. A. 96: 1983-1988 [Abstract][Full Text]
Dinner, A. R., Verosub, E., Karplus, M. (1999). Use of a quantitative structure-property relationship to design larger model proteins that fold rapidly. Protein Eng 12: 909-917 [Abstract][Full Text]
Takahashi, K.-i., Noguti, T., Hojo, H., Yamauchi, K., Kinoshita, M., Aimoto, S., Ohkubo, T. (1999). A mini-protein designed by removing a module from barnase: molecular modeling and NMR measurements of the conformation. Protein Eng 12: 673-680 [Abstract][Full Text]
Kortemme, T., Ramírez-Alvarado, M., Serrano, L. (1998). Design of a 20-Amino Acid, Three-Stranded -Sheet Protein. Science 281: 253-256 [Abstract][Full Text]
Harbury, P. B., Plecs, J. J., Tidor, B., Alber, T., Kim, P. S. (1998). High-Resolution Protein Design with Backbone Freedom. Science 282: 1462-1467 [Abstract][Full Text]