Database Processing

The most common application of dock is to process a database of molecules to find potential inhibitors or ligands of a target macromolecule. However, with the new separation of components in version 4.0, the database processing tools can be combined with other tasks, like stand-alone Scoring , Score Optimization , or Chemical Screen .

Database processing is signalled with the multiple_ligands parameter. A subset of the database may be processed using the ligands_maximum , initial_skip , and interval_skip reading parameters and the heavy_atoms_minimum and heavy_atoms_maximum size-selection parameters. If scoring has been selected, then molecules can be output as a ranked list using the rank_ligands parameter. When comparing molecules, the score of large molecules may be penalized using the contact_size_penalty parameter and so on for each scoring function. If ligands are not ranked, then all orientations recorded for each molecule are written (see Orientation Search on page 28 for how multiple orientations are recorded). When no orientation search is performed (i.e. stand-alone scoring), then molecules are written if they pass a score cutoff set by the contact_maximum parameter and so on for each scoring function.

Database jobs produce two output files in addition to the molecule output files. The restart_file parameter specifies the file which stores the current rank_ligands list. If the job is terminated prematurely, then it may be restarted with the -r flag (see Command-line Arguments on page 53 ) and the run is initialized with information in the existing restart file. The frequency at which the restart file is updated is specified with the restart_interval parameter. In addition, the info_file parameter specifies the file which stores information about the current progress of the run.

Database jobs may also be interacted with during execution via the presence of two input files. The dump_file parameter specifies a file whose presence will trigger the job to write out the current results to the info file, the restart file and the molecule output files. The user may create this file at any time to inspect the current results. The dock job will automatically delete the dump file after a dump has taken place. In addition, the quit_file parameter specifies a file whose presence will trigger the job to write out results (like a dump request) and also terminate execution. This parameter is useful for gently terminating a job without loss of information for restart at a later time. The presence of either of these files is only checked in between processing molecules, so it may take up to a minute for such a file to be noticed.

Preliminary Docking

Docking an entire database can take a considerable amount of time. The length of time depends primarily on the sampling parameters for the orientation and conformation search and the minimization parameters. Even when docking is distributed over multiple workstations, the calculation can take several days or weeks. Since the optimal search parameters are site dependent, it is important to do some preliminary docking calculations with subset of the database to identify good parameters.

As sampling parameters are increased, the results will initially improve but will eventually converge. The optimal parameters correspond to where the results have just converged. Multiple short docking jobs can be submitted using unix shell scripts. The results that should be monitored are presented in Ewing and Kuntz [ 6 ]. The most important is the weighted rank correlation, which reports how well the rankings of the top-scoring molecules have been predicted.

The following is a discussion of different ways to construct a subset of the molecule database.

Extracting specific molecules

Since the PTR format database file contains the molecule name and description, entries can be retrieved based on these fields. Use unix fgrep to select the molecule, or molecules.

fgrep " BENZENE " database.ptr > benzene.ptr

fgrep -f subset.list database.ptr > subset.ptr

The file molecules.list would contain a list of the names you would like to extract. The subset molecules can be readily converted to a format for viewing with the following command.

dock -i subset.ptr -o subset.pdb

Extracting a random subset

A random subset of a molecular database can be used to help identify appropriate docking parameters, before docking an entire database. Selecting a random subset is easy using a PTR format database file. Use the following unix nawk command to select an average of one out every 1000 entries in the database.

nawk '{if (rand < 0.001) print}' database.ptr > subset.ptr

Extracting an interval subset

Alternatively, if you literally want one molecule for every 1000 molecules without any randomness, then use a different call to unix nawk.

nawk '{if ((++n % 1000) == 0) print}' database.ptr > subset.ptr

This can also be achieved by using the interval_skip multiple ligand parameter in dock. This latter method is much slower, however, because the coordinates of all the skipped molecules must be read.

Parallel Jobs

Since a database docking calculation is ideal for parallelization over multiple computers, the parallel jobs feature was added to ease the organizational burden of this task. This feature is activated with the parallel_jobs parameter. With this feature, a dock job can be launched on every workstation or cpu at a user's disposal. These jobs process a single database, each at its own pace. To prevent each job from duplicating each other's work unnecessarily, a server job is used to parse the database and hand out molecules, one at a time, to each client job. Each client job and the server job requires its own input file and output files. When all processing is complete, the user must coalesce the results from each client job.

Client or server behavior is designated using the parallel_server parameter. Setting it to "yes" causes server behavior; "no" causes client behavior. The server name is defined with the server_name parameter. Any number of client jobs can be delegated to the server using the client_total parameter. Each client name must be supplied with the client_name_1 and subsequent parameters. It is recommended that the server job be executed on the computer which stores the molecule database. Then in the event of any network difficulties, the server job is never disconnected from the database.

The client jobs need to the have the parallel_jobs parameter activated, but not the parallel_server parameter. The server_name parameter must be consistent with that supplied to the server job. The name of the client job is specified with the client_name parameter, which must be one of the client names supplied to the server. The client jobs need to be launched from the same directory as the server job since they communicate via local temporary files. This requires that client jobs can only be launched on machines that cross-mount the working directory. Clients may be taken off-line (via the quit_file ) and restarted without disrupting the server or other client jobs. If the server job is given a quit signal, then it automatically signals all client jobs to quit as well.

Client jobs may be instructed to either store a ranked list or to write out all results to file. Since the PTR format files take up so little disk space, an entire database can be written out without taken up more space than the top few thousand molecules written in SYBYL MOL2 format (about 25 megabytes of disk space). If the clients store ranked lists, then make sure that the list length for each client is equal to the total length of interest (a few hundred at least). This rule helps avoid artifacts from the parallelization when the results are coalesced.

When all jobs are complete, then the results must be combined. This process can be done seamlessly by using the unix cat command on the output molecule files. If PTR format is used, then molecules can be reranked using the unix sort command on the score field. If SYBYL MOL2 format is used, then perform stand-alone scoring on the catenated file and output the molecules in a ranked list.