PUMA: an Integration of Biological Data
to Support the Interpretation of Genomes

Terry Gaasterland / Natalia Maltsev / Ross Overbeek / Evgeni Selkov

What Is PUMA?

PUMA is a web-based system that offers integrated access to biological data. It is intended as an environment to support the interpretation and presentation of genomes.

It is now clear that the biological community will have access to at least three or four complete genomes yet this year, and it seems likely that this number will grow rapidly during the next few years. It is important that we construct an environment to support effective analysis of these genomes. PUMA is a system that attempts to integrate access to the emerging body of biological sequence data from within a functional context; it is an integration based on a functional overview of an organism and offers access through structured presentations of the data -- most notably, metabolic pathways, alignments, and phylogenetic trees. We believe that this organization of data will be both useful and necessary as researchers attempt to fill in the emerging image of life at the molecular level.

Enough discussion, let's look at the overview for a single organism to get an idea of exactly what we mean. When you click on the following link to the overview of Escherichia coli, you will move to the PUMA overview for E.coli. When you get there, you will be positioned on a functional outline which contains metabolic pathways, protein sequences, and alignments -- all grouped into categories representing a functional breakdown of the components of the organism. We suggest that you explore the overview for a few minutes and then return to this point, and then we can continue our discussion of what we have built and why we built it.

Functional Overview for Escherichia coli.

So, now that you have an initial exposure, let us now return to the question "What is PUMA?" We have organized the data in PUMA in terms of typed objects with relationships to other objects. The functional overview is designed to offer a context in which you can locate objects. To help you migrate between the different classes of objects, here is a EMP). The metabolic pathways have been made available on the web, although the EMP itself remains proprietary.

Metabolic Pathway objects are connected to enzyme and compound objects. To see what we mean, you might click on the following link to a standard version of glycolysis:

Glycolysis

If you peruse the diagram, you should note that most of the enzymes and compounds are links (to enzyme and compound objects).

Enzymes

Enzyme objects are related to metabolic pathways, alignments and to the component of the Japanese GenomeNet database dealing with data on enzymes. If you click on the following link, you will go to the enzyme object for enzyme 2.7.1.11 (6-phosphofructokinase).

2.7.1.11

When you get to the enzyme object, the title contains a link to the Japanese database. Below the title are links to alignment objects which contain protein sequences that are part of specific instances of the enzyme. That list of links is followed by links to the metabolic pathway objects that utilize the enzyme. Note that there are often numerous pathways that are essentially the same; they often contain slight variations which become important for specific organisms. When you see the actual overview for an organism, only the appropriate variant will be referenced from the overview. However, the enzyme object refers to the generic notion and connects to all variants of the pathway.

Compounds

Compound objects connect to metabolic pathway objects. They normally connect to three distinct sets: pathways that utilize the compound as a substrate, those that produce the compound as a product, and those in which the compound occurs as an intermediate. To see an instance of this, click on the following link:

phospho`enol`pyruvate

Alignments

Alignments represent one of the most powerful tools for comparative analysis. We constructed a large set from entries in the Swiss Protein Database. However, after a comparison with the set generated by Randy Smith at the Baylor College of Medicine, we decided that Randy had a more complete set, that he was already maintaining the set (to support his efforts relating to more sensitive similarity searches), and that it made sense to utilize his collection.

As recently as just two or three years ago, the creation and maintenance of such a collection of alignments appeared to be a truly formidable task. Now that initial collections of alignments have appeared, and alignments done manually by experts are being released, we have the ability to offer the casual user access to a wealth of data.

If you click on the following link, you will gain access to one of the alignment objects:

448.29

Note that we use Randy's identifiers. The alignment object itself has links to Randy's alignment in report or FastA format. If you look at the report format, you can gain access to information about each of the related sequences.

Below the links offering access to Randy's alignments, you will note a link to the "corresponding DNA". This is a little additional feature that we have added. We have linked many of the Swiss Protein Database entries to the corresponding DNA sequences from EMBL. When you access this link, we get as many versions of the DNA as we can, we group the upstream regions (of up to 200 characters), we align the DNA (to agree exactly with Randy's protein alignments), and then group the downstream regions. This should allow researchers easy access to upstream regions of the proteins for use in searching for regulatory signals.

Finally, we offer a link to a phylogenetic tree computed from the protein alignment using the Fitch tool distributed by Joseph Felsenstein in the Phylip package. These trees are unrooted. Further, it is clear that more carefully done alignments would be useful in many cases (however, there are over 12,000 alignments, so we had to cut a few corners). Gradually, we hope to build up a reliable set. For now, they are intended only as aids to allow users to phylogenetically group the sequences if they extract them and perform a more extensive analysis. Note that there are often paralogous copies of proteins from the same organism, so some care should be exercised in interpreting these trees.

After you have played with one of Randy's alignments, we feel we must show you an alternative representation:

A3176

This was one of our old alignments. We like the format a bit better, and we are encouraging Randy to switch. Our version includes a mask on top that shows both conservation and an estimate of reliability. Links to a descriptive page, an alignment editor, and a phylogenetic tree representation are available. Further, we offer a section that shows relationships between alignments (in which subsections of the alignments were recognizably homologous).

Now, before moving on to discuss the motivation and research goals addressed by PUMA, let us briefly discuss the overall structure of the home page for PUMA:

PUMA

This link takes us to a main page, which just asks you to choose between versions with or without graphical images. Once you pick one, you will be given a choice between

A general functional overview
Functional overviews corresponding to nodes in the phylogenetic tree
Indexed access to PUMA objects

The first option gives you access to our "general functional overview", which is a concept we descibe below. The second allows you to move through the phylogenetic tree, examining functional overviews of specific organisms or ancestral nodes in the tree. You have already seen the functional overview for E. coli. The last option is just a simple forms-based access to specific PUMA objects via IDs. Like many aspects of PUMA, we hope to improve our indexed access over the next few months; by December, we must be ready to offer good access to the burgeoning set of complete genomes!

Maintaining Overviews for Organisms

PUMA attempts to maintain more-or-less current overviews of those organisms for which substantial sequencing is taking place. There are a growing number of organisms for which at least 50 sequences now exist in the Swiss Protein Database (over 300 such organisms). It seems likely that there will be thousands of such organisms within two or three years. This raises the question: "Should any attempt be made to maintain overviews for more than just a few genomes?" The answer to this question must hinge on how costly it will be to actually maintain a large collection, as opposed to maintaining overviews for only a limited set of model organisms.

Our attempt to place a large number of organisms within a coherent integration of alignments and pathways has been based on the generation of tools to automate the process. Our approach proceeded roughly as follows:

The Outline
First, we constructed a general outline which attempts to group lower-level functions within a single structured view of the subsystems that make up organisms. To attempt to create such a single overview that effectively captures the groupings that exist within the diverse organisms requires a certain level of bravado. No matter what overview is compiled, it will be difficult to characterize the result as "natural" or "compelling". The general overview that we show in this version of PUMA has been subjected to criticism from a number of knowledgeable biologists. We intend to continue allowing others to review and alter it.
Grouping Pathways and Alignments into Functional Topics
The outline itself is maintained as an ascii file. We attach to each leaf in the outline patterns that can be used to scan through pathways and alignments for those entries to be placed within the lowest-level topic. A given pathway or alignment is often placed under several leaves, and that seems reasonable to us. Unfortunately, it is often the case that patterns pick up entries which really do not belong. The process of gradually refining the pattern set, altering the descriptions of objects that are scanned using the pathways, and evaluating the results has been time-consuming, and much remains to be done. However, the version that we are now presenting does, in our view, offer a resource to the community that did not exist before.
Specializing the Outline to Specific Organisms

We have collected a set of tools that scan all protein sequences for a given organism, note what alignments, enzymes, and pathways they correspond to, and connect them to the leaves in a copy of the outline for the particular organism. Topics in the general outline for which no sequences exist are automatically deleted from the customized version. Thus, in the outline for E.coli, the subtopic of methanogenesis has been deleted. When this operation is performed, the aspects of the general outline that seem most controversial (e.g., Should special topics that relate only to very small groups of organisms be included?) disappear; we believe that the process produces fairly usable overviews, and that the overall cost of generating hundreds of such overviews is not significantly larger than generating the set required to capture just the model organisms.

The goal that has often been articulated is to visualize the overall phylogenetic tree and to gradually characterize what is known about the organisms represented by each node. This would allow researchers, instructors and students to follow the central design commitments represented at each node and allow one to reach a more meaningful integration of the wealth of detailed data that now exists. PUMA does not achieve the framework to support such a view, but it does represent a significant step along the way to constructing it.

On a Somewhat More Philosophical Note

For the past six months, we have posed the following question to a number of senior biologists: "How many genomes do expect to be completely sequenced by the turn of the century?" The answers have ranged from 5 to 300. People whose opinions we sincerely respect answered at both the low and high ends of those estimates. We started asking the question because our own estimate ("hundreds") appeared somewhat counter to current wisdom. In fact, now that it has been established that complete microbial genomes can be sequenced for well under a dollar per base, the answer probably hinges on what is achieved from access to the first few completely sequenced genomes.

What Biological Problems Will Be Addressed Using Complete Genomes?

A cursory analysis of computational biology and genetic sequence analysis might lead one to believe that most of the central issues involve the relationship between molecular structure and function. The role of molecular structure is certainly critical, but it is worth emphasizing that a great deal can be achieved without a detailed knowledge of structural relationships. This is important, since most of what is learned from these early genomes will have an indirect and fairly long-range impact on the essential problems relating to determination of the structural basis of molecular functions. We believe that the ultimate solution of some of these structural issues will grow out of much more rapid advances that emerge more directly from the analysis of sequence.

So, what are the central problems that will be addressed by genetic sequence analysis during the next few years?

The Metabolic Graph

A rapid estimation of the metabolic pathways present in microbial organisms can be achieved from a partial list of enzymes, which in turn can be acquired directly from sequence similarity data. To do this requires a fairly comprehensive collection of known pathways (with enzymes identified via EC numbers), along with protein sequences connected to the enzyme numbers. These now exist and are available. The related problem of determining a set of enzymes that must almost certainly exist, but for which no sequence has yet been identified, becomes relatively easy in the presence of a large, organized body of known pathways. Connecting enzymes that must be present to sequences for which the function is not yet known should be simplified by the presence of multiple complete genomes (e.g., one has the added constraint that one is looking for a sequence or set of sequences that must occur in one set of genomes, but not in others).
Metabolic Regulation

Progress in "engineering the metabolism" of microbes will often require at least a rudimentary understanding of metabolic regulation. If we draw the distinction between metabolic and genetic regulation of pathways (with the metabolic regulation occuring at far more rapid timescales), effective modification of organisms for bioremediation and related applications will almost certainly require at least a grasp of local metabolic regulatory mechanisms. It seems likely that the essential features of such regulation will often be understood long before the exact structural basis for the regulation can be clarified.
Protein Families

While we have expressed some skepticism relating to the speed with which structure/function relationships can be explicated, fundamental insights will almost certainly emerge rapidly from detailed analysis of the evolution of protein families. These insights will play a major supporting role in numerous specific problems. The clarification of how function evolved in specific cases will frequently be based initially on alignments, phylogenetic evidence, and correlation with known structural data. The current efforts towards integrating these categories of data are proceeding at an astonishing rate.
Regulation of Expression

With the availability of a number of phylogenetically diverse, complete genomes, we will be in a position to more rapidly identify operon structures and regulatory signals. In many instances, partial genomes from closely-related organisms will be used to advantage.

How Can Integrated Data Support the Analysis?

The broad classes of problems discussed above are deeply interrelated. Advances on any single problem directly impact the others. For example, one might point to the obvious relationships between

metabolic organization/metabolic regulation/genetic regulation/operon structure
protein families/conservation in alignments/functional sites
phylogeny of organisms/phylogeny of proteins/functional correspondences

In each of these cases, comparative analysis will play a central role. Effective access to diverse categories of data will accelerate processes which now often require a number of steps. Rapid access to both protein and DNA alignments, phylogenetically organized and annotated, will be essential.

Each of the broad categories of problems mentioned above is significant in its own right, and progress will emerge in unpredictable ways on many fronts. However, we do suggest that the following question imposes a certain mental perspective that is useful: "How long will it take before we reach the point where we have a single genome in which we have an approximate estimate of the function of every gene?" Will it be five years or fifty years? The value of such an "anchor point" becomes apparent when you consider the inferences that could immediately be drawn about broad classes of organisms distributed throughout the phylogenetic tree. While it is true that many, many detailed questions would remain completely unresolved, it is breathtaking to consider the foundation that would be established for addressing questions relating to more complex genomes, for numerous applications, and for a qualitatively deeper grasp of how life works.