PUMA is a web-based system that offers integrated access to biological data. It is intended as an environment to support the interpretation and presentation of genomes.
It is now clear that the biological community will have access to at least three or four complete genomes yet this year, and it seems likely that this number will grow rapidly during the next few years. It is important that we construct an environment to support effective analysis of these genomes. PUMA is a system that attempts to integrate access to the emerging body of biological sequence data from within a functional context; it is an integration based on a functional overview of an organism and offers access through structured presentations of the data -- most notably, metabolic pathways, alignments, and phylogenetic trees. We believe that this organization of data will be both useful and necessary as researchers attempt to fill in the emerging image of life at the molecular level.
Enough discussion, let's look at the overview for a single organism to get an idea of exactly what we mean. When you click on the following link to the overview of Escherichia coli, you will move to the PUMA overview for E.coli. When you get there, you will be positioned on a functional outline which contains metabolic pathways, protein sequences, and alignments -- all grouped into categories representing a functional breakdown of the components of the organism. We suggest that you explore the overview for a few minutes and then return to this point, and then we can continue our discussion of what we have built and why we built it.
Functional Overview for Escherichia coli.
So, now that you have an initial exposure, let us now return to the
question "What is PUMA?" We have organized the data in PUMA in terms
of typed objects with relationships to other objects. The
functional overview is designed to offer a context in which you can
locate objects. To help you migrate between the different classes of
objects, here is a EMP). The metabolic pathways have
been made available on the web, although the EMP itself remains
proprietary.
Metabolic Pathway objects are connected to enzyme and compound objects. To see what we mean, you might click on the following link to a standard version of glycolysis:
If you peruse the diagram, you should note that most of the enzymes and compounds are links (to enzyme and compound objects).
Enzyme objects are related to metabolic pathways, alignments and to the component of the Japanese GenomeNet database dealing with data on enzymes. If you click on the following link, you will go to the enzyme object for enzyme 2.7.1.11 (6-phosphofructokinase).
When you get to the enzyme object, the title contains a link to
the Japanese database. Below the title are links to alignment
objects which contain protein sequences that are part of specific
instances of the enzyme. That list of links is followed by links to
the metabolic pathway objects that utilize the enzyme. Note
that there are often numerous pathways that are essentially the same;
they often contain slight variations which become important for
specific organisms. When you see the actual overview for an organism,
only the appropriate variant will be referenced from the overview.
However, the enzyme object refers to the generic notion and
connects to all variants of the pathway.
Compound objects connect to metabolic pathway objects.
They normally connect to three distinct sets: pathways that utilize
the compound as a substrate, those that produce the compound as a
product, and those in which the compound occurs as an intermediate.
To see an instance of this, click on the following link:
phospho`enol`pyruvate
Alignments represent one of the most powerful tools for comparative analysis. We constructed a large set from entries in the Swiss Protein Database. However, after a comparison with the set generated by Randy Smith at the Baylor College of Medicine, we decided that Randy had a more complete set, that he was already maintaining the set (to support his efforts relating to more sensitive similarity searches), and that it made sense to utilize his collection.
As recently as just two or three years ago, the creation and maintenance of such a collection of alignments appeared to be a truly formidable task. Now that initial collections of alignments have appeared, and alignments done manually by experts are being released, we have the ability to offer the casual user access to a wealth of data.
If you click on the following link, you will gain access to one of the
alignment objects:
448.29
Note that we use Randy's identifiers. The alignment object
itself has links to Randy's alignment in report or FastA format. If
you look at the report format, you can gain access to information
about each of the related sequences.
Below the links offering access to Randy's alignments, you will note a link to the "corresponding DNA". This is a little additional feature that we have added. We have linked many of the Swiss Protein Database entries to the corresponding DNA sequences from EMBL. When you access this link, we get as many versions of the DNA as we can, we group the upstream regions (of up to 200 characters), we align the DNA (to agree exactly with Randy's protein alignments), and then group the downstream regions. This should allow researchers easy access to upstream regions of the proteins for use in searching for regulatory signals.
Finally, we offer a link to a phylogenetic tree computed from the protein alignment using the Fitch tool distributed by Joseph Felsenstein in the Phylip package. These trees are unrooted. Further, it is clear that more carefully done alignments would be useful in many cases (however, there are over 12,000 alignments, so we had to cut a few corners). Gradually, we hope to build up a reliable set. For now, they are intended only as aids to allow users to phylogenetically group the sequences if they extract them and perform a more extensive analysis. Note that there are often paralogous copies of proteins from the same organism, so some care should be exercised in interpreting these trees.
After you have played with one of Randy's alignments, we feel we must
show you an alternative representation:
A3176
This was one of our old alignments. We like the format a bit better,
and we are encouraging Randy to switch. Our version includes a mask on top
that shows both conservation and an estimate of reliability. Links to a
descriptive page, an alignment editor, and a phylogenetic tree representation
are available. Further, we offer a section that shows relationships between
alignments (in which subsections of the alignments were recognizably
homologous).
Now, before moving on to discuss the motivation and research goals addressed by PUMA, let us briefly discuss the overall structure of the home page for PUMA:
This link takes us to a main page, which just asks you to choose between versions with or without graphical images. Once you pick one, you will be given a choice between
The first option gives you access to our "general functional overview", which is a concept we descibe below. The second allows you to move through the phylogenetic tree, examining functional overviews of specific organisms or ancestral nodes in the tree. You have already seen the functional overview for E. coli. The last option is just a simple forms-based access to specific PUMA objects via IDs. Like many aspects of PUMA, we hope to improve our indexed access over the next few months; by December, we must be ready to offer good access to the burgeoning set of complete genomes!
PUMA attempts to maintain more-or-less current overviews of those organisms for which substantial sequencing is taking place. There are a growing number of organisms for which at least 50 sequences now exist in the Swiss Protein Database (over 300 such organisms). It seems likely that there will be thousands of such organisms within two or three years. This raises the question: "Should any attempt be made to maintain overviews for more than just a few genomes?" The answer to this question must hinge on how costly it will be to actually maintain a large collection, as opposed to maintaining overviews for only a limited set of model organisms.
Our attempt to place a large number of organisms within a coherent integration of alignments and pathways has been based on the generation of tools to automate the process. Our approach proceeded roughly as follows:
First, we constructed a general outline which attempts to group lower-level functions within a single structured view of the subsystems that make up organisms. To attempt to create such a single overview that effectively captures the groupings that exist within the diverse organisms requires a certain level of bravado. No matter what overview is compiled, it will be difficult to characterize the result as "natural" or "compelling". The general overview that we show in this version of PUMA has been subjected to criticism from a number of knowledgeable biologists. We intend to continue allowing others to review and alter it.
The outline itself is maintained as an ascii file. We attach to each
leaf in the outline patterns that can be used to scan through pathways and
alignments for those entries to be placed within the lowest-level
topic. A given pathway or alignment is often placed under several
leaves, and that seems reasonable to us. Unfortunately, it is often
the case that patterns pick up entries which really do not belong. The
process of gradually refining the pattern set, altering the
descriptions of objects that are scanned using the pathways, and
evaluating the results has been time-consuming, and much remains to be
done. However, the version that we are now presenting does, in our
view, offer a resource to the community that did not exist before.
The goal that has often been articulated is to visualize the overall phylogenetic tree and to gradually characterize what is known about the organisms represented by each node. This would allow researchers, instructors and students to follow the central design commitments represented at each node and allow one to reach a more meaningful integration of the wealth of detailed data that now exists. PUMA does not achieve the framework to support such a view, but it does represent a significant step along the way to constructing it.
For the past six months, we have posed the following question to a
number of senior biologists: "How many genomes do expect to be
completely sequenced by the turn of the century?" The answers have
ranged from 5 to 300. People whose opinions we sincerely respect
answered at both the low and high ends of those estimates. We started
asking the question because our own estimate ("hundreds") appeared
somewhat counter to current wisdom. In fact, now that it has been
established that complete microbial genomes can be sequenced for
well under a dollar per base, the answer probably hinges on what is
achieved from access to the first few completely sequenced genomes.
A cursory analysis of computational biology and genetic sequence analysis might lead one to believe that most of the central issues involve the relationship between molecular structure and function. The role of molecular structure is certainly critical, but it is worth emphasizing that a great deal can be achieved without a detailed knowledge of structural relationships. This is important, since most of what is learned from these early genomes will have an indirect and fairly long-range impact on the essential problems relating to determination of the structural basis of molecular functions. We believe that the ultimate solution of some of these structural issues will grow out of much more rapid advances that emerge more directly from the analysis of sequence.
So, what are the central problems that will be addressed by genetic sequence analysis during the next few years?
The broad classes of problems discussed above are deeply interrelated. Advances on any single problem directly impact the others. For example, one might point to the obvious relationships between
In each of these cases, comparative analysis will play a central role. Effective access to diverse categories of data will accelerate processes which now often require a number of steps. Rapid access to both protein and DNA alignments, phylogenetically organized and annotated, will be essential.
Each of the broad categories of problems mentioned above is significant in its own right, and progress will emerge in unpredictable ways on many fronts. However, we do suggest that the following question imposes a certain mental perspective that is useful: "How long will it take before we reach the point where we have a single genome in which we have an approximate estimate of the function of every gene?" Will it be five years or fifty years? The value of such an "anchor point" becomes apparent when you consider the inferences that could immediately be drawn about broad classes of organisms distributed throughout the phylogenetic tree. While it is true that many, many detailed questions would remain completely unresolved, it is breathtaking to consider the foundation that would be established for addressing questions relating to more complex genomes, for numerous applications, and for a qualitatively deeper grasp of how life works.