EPMR: Molecular replacement by evolutionary search

User’s Guide for openEPMR, Version 8.03

Copyright © 2005 -2008 by Charles R. Kissinger. All rights reserved.

Last updated: April 1, 2008

Contents

Introduction
Usage
Options
Acknowledgements
References
Examples

Introduction

EPMR is a general-purpose molecular replacement program. Unlike most molecularreplacement programs, it does not divide the problem into separate rotation and translation searches. Instead, it uses an evolutionary search algorithm to simultaneously optimize the orientation and position of a search model (1,2). The program operates as follows:

  • An initial set of random solutions (random orientations and positions for thesearch model) is generated.
  • The correlation coefficient is calculated for each trial solution.
  • A fraction of the highest scoring solutions are retained and used to regenerate acomplete set of new trial solutions. This is done by applying random alterations tothe orientation angles and translations for each “surviving” solution.
  • The correlation coefficients for the new population are calculated, the population is again regenerated from the top scoring solutions, and this procedure is repeated for a specified number of cycles.

The algorithm provides broad, stochastic, initial sampling of the search space whilegradually focusing in on the most promising regions. It allows for efficient searching ofthe six-dimensional (or higher) space. In general, it is several orders of magnitude fasterthan a brute-force, systematic, 6-D search. At the end of the evolutionary optimization, alocal minimization is performed on the best solution. This is simply a rigid-bodyrefinement of the search molecule.

The program calculates structure factors rapidly by indexing into a molecular transform using the method of Huber and Schneider. A traditional structure factor calculation is done only once -for the search model set at the origin of a P1 cell. Subsequent structurefactor calculations are done by transforming reflection indices according to the rotationsand translations applied to the model and the relationship between the P1 and real cells,interpolating into the grid of P1 structure factors and summing over the symmetryoperators of the crystal. This is much faster than an FFT calculation. A simple Babinettype solvent correction is applied to the calculated structure factors. The values of thesolvent correction parameters (k, B) are optimized during the search.

Because of the stochastic nature of the evolutionary optimization process, the correctsolution will not be obtained on every run, even with a very good search model. Thesuccess rate is dependent on the quality of the model (2). By default, 10 optimizationattempts are done, and more will probably be required if you have a difficult problem.For search models that are poor and at the limit of detection, the search efficiency can bequite low. If you have a molecular replacement problem that has not yielded a solution by any other means, a reasonable last resort is to set up EPMR to do as many runs as yourpatience and computing resources will allow. As long as the true solution represents theglobal maximum in the correlation coefficient between Fo and Fc, even if by theslimmest of margins, the algorithm will eventually find it.

EPMR includes the following features:

  • the ability to automatically search for multiple copies of a molecule in the asymmetric unit, either sequentially or concurrently
  • the ability to search with multiple models, either sequentially or simultaneously (i.e., in competition with each other)
  • the ability to use multiple coordinate sets as parts of an “assembly” that comprisesthe complete search model
  • an option to search over all related space groups, either sequentially or simultaneously
  • rotation-only and translation-only search modes
  • an option to provide static, partial structure
  • independent optimization of each segment of a search model during the finalrigid-body refinement step
  • an option to bypass the evolutionary search and do only local, rigid-body optimization of the model

This version of EPMR is free, open-source software, distributed under the terms of theGNU General Public License. This documentation describes a developmental version ofthe program. New and better versions of EPMR will be released on a regular basis. Thelatest version of this program is always available for download fromhttp://www.epmr.info.

Feedback is welcome. E-mail concerning the program can be sent to Chuck Kissinger at ckissinger@epmr.info.

If you publish results obtained using EPMR, please cite Charles R. Kissinger, Daniel K.Gehlhaar & David B. Fogel, “Rapid automated molecular replacement by evolutionary search”, Acta Crystallographica, D55, 484-491 (1999).

Usage

The program requires either two or three input files on the command line depending uponwhether a CCP4 mtz file is used for both cell constants and reflection data. The first file specified on the command line is used as the source of information on the unit cell and space group. The file can be either an mtz file or a text file specifying the cell constantsand space group number in the order:

a b c alpha beta gamma space_group_number

These are free-format and can be divided between any numbers of lines. An example of appropriate contents for the text file is:

40.76 18.49 22.33 90 90.61 90 4

All 230 space groups are available in their standard settings. For rhombohedral spacegroups, the hexagonal setting is expected, and your data needs to be indexed accordingly.

The second file on the command line should be a standard PDB format file containing thesearch model. Any lines in this input file that are not ATOM or HETATM records areignored. The program expects correctly formatted PDB ATOM records. The names of atoms with two-letter element symbols (e.g., Fe, Ca, Se) must be appropriately left-shifted to be correctly recognized. If the PDB file contains multiple segment identifiers in columns 73 to 76, these will be used to subdivide the search model during the final rigid-body refinement. Each segment is optimized independently during that step.

If the first file on the command line is an mtz file, and you also want to use it as thesource of your reflection data, it is not necessary to provide a third file on the command line. The program will read the Fobs data from any data column in the mtz file labeled “F” or “FP” or from the first column with a label beginning with “F”. If you are not using an mtz file, the final input file is a text file that contains the observed structure factors. The only requirement is that the file has H, K, L, and Fobs as the first four items on each line, separated by spaces.

The command line:

epmr example.cell example.pdb example.hkl

or:

epmr example.mtz example.pdb

will run the program in its default mode. In this mode, the program will search for asingle copy of the search molecule in the asymmetric unit. It will run the evolutionarysearch procedure up to ten times, or until a solution with a correlation coefficient of 0.65 is obtained. Data in the resolution range between 1000 and 4 Angstroms will be used inthe search. The top solution found will be written to a file called “epmr.best.pdb”.

If you wish to provide multiple search models, put the list of file names in a file, and supply that file to the program in place of a coordinate file, with a “@” symbol before it:

epmr example.mtz @example.filelist

A single search model can be read from the standard input by putting a “-” (dash) in place of the coordinate file name on the command line:

epmr example.cell – example.hkl <example.pdb >example.log

When the program is started, it will print some information about the input data and program settings, do the initial FFT structure factor calculation, and then start theoptimization runs. At the end of each evolutionary search, a local, rigid-body refinementis performed on the result. The final orientation for each run will be reported on a linethat begins with “Solution”, followed by the run number, rotation (alpha, beta, gamma as defined for the CCP4 programs), translation (in fractional coordinates for the model afterit has been centered at the origin) and then the space group, model number, correlation coefficient and R factor. Theta1, theta2, and theta3 in the CNS convention are printed on a subsequent line. (If you intend to make use of the rotation and translation values outsideof the program, they must be applied to the search model after it has been centered at theorigin). The angular and translational relationships to the best previous solution obtainedare also printed. You can use the Linux/UNIX command:

grep Solution epmr.log

to print just the solutions. On Linux systems, the command:

grep Solution epmr.log | sort –n -k 11r,11 -k 12

will list the solutions sorted by correlation coefficient, then R-factor.


Options

The operation of the program is controlled by a set of command line options. (If theprogram is run without any command line arguments, a brief summary of the availableoptions will be listed.) The possible options are:

-A Treat models as an assembly, search with all simultaneously

This option is off by default. It causes each input model to be treated as an independently positioned and oriented segment of the complete search model. (By default, each input model is treated as an entirely separate,alternative search model.) When this option is turned on, the program willattempt to optimize the orientation and position for all search modelssimultaneously.

This is an experimental option and is not yet recommended for general use.

-a Treat models as an assembly, search with each sequentially

This option is off by default. It causes each input model to be treated as an independently positioned and oriented segment of the complete search model. (By default, each input model is treated as an entirely separate,alternative search model.) When this option is turned on, the program willfind the optimal solution for the first model, keep that as static structure,search for the optimal solution for the next model, store that as static structure, and so on. Both this option and the “-A” option can be combined with “-m” or “-M” to search for multiple copies of an assembly.

-b number The minimum “bump” distance -the smallest unpenalized distance between the center of mass of a solution and that of any symmetry mates

The default value is 0.0 (no packing restrictions). This is applied to all trialsolutions that are generated during the course of the search. When asolution violates this minimum distance, the correlation coefficient calculated for that solution is scaled down by the ratio of the shortestobserved distance over the minimum allowed distance. In the case of searches for multiple molecules in the asymmetric unit, this also sets theminimum distance between a solution and any previously found solutions.(This applies both to previous solutions found during the run and to partialstructure entered with the -s option. See the description of the -s option below for instructions on entering multiple fragments of partial structure for use with this option.)

This option imposes a simple penalty on solutions that pack poorly. It can be helpful in some searches, but decreases the efficiency of others, particularly if a large value is used. This option appears to be less useful in single molecule searches than in searches for multiple molecules in theasymmetric unit, but definitely should be tried if you are having troublegetting a solution that packs well. It is up to you to decide what an appropriate minimum intermolecular distance should be. (Remember thatit is the distance between centers of mass.) It is best to be conservative the program will not search efficiently without some room to movesolutions around through positions that pack poorly.

-C Search simultaneously over all space groups in the same point group and crystal class as the input space group

This option is off by default. The space group is treated as an additionalvariable in the evolutionary search. In other words, the alternative space groups compete with each other during the evolution. Currently, thisoption works well for routine molecular replacement calculations with agood search model. For difficult cases, it is safer to use the “c” (lowercase) option instead, which will search each of the candidate space groups separately.

-c Search sequentially over all space groups in the same point group and crystal class as the input space group

This option is off by default. Each space group is tried in turn. The numberof optimization runs specified by the “n” option will be completed foreach space group choice, and all input models will be tried in each spacegroup before moving to the next.

-e integer Set the seed value for the random number generator to a specific value.

By default (or if the seed value is set to a value of zero using this option),the seed is generated from the system clock at run time. This option is notnecessary for normal operation of the program, but can be useful fortesting purposes. Two separate runs of the program that use the same seed value will produce identical results.

-g integer The number of “generations” (cycles of optimization).

The default value is 50. It is not normally necessary (or recommended) to change this value. However, setting this value to zero (-g0) will bypass theevolutionary search and feed your input model directly to the localoptimizer. This allows you to use EPMR as a convenient rigid-bodyrefinement program.

-h number High-resolution limit for diffraction data used in the search (in Angstroms)

The default value is 4.0 Angstroms. It can be useful to try different high-resolution limits. It is rarely effective to set this value to less than 5.0.

-l number Low-resolution limit for diffraction data (Angstroms).

The default value is 1000.0 Angstroms (effectively no cutoff). MRcalculations can be highly sensitive to the inclusion or exclusion of thelowest resolution data, and also to the accuracy of that data. In some cases,poorly measured low-resolution data can negatively impact the search. If you are having problems, you could try excluding low resolution datausing this option. A low-resolution limit of 15 Angstroms is often effective.

-M integer The number of copies of the molecule in the asymmetric unit to find simultaneously

The default value is 1. Values greater than 1 cause multiple orientationsand positions to be optimized for the search model.

This option has not yet been optimized in this new version of the programand is not yet recommended for general use.

-m integer The number of copies of the molecule in the asymmetric unit to find sequentially

The default value is 1. A value of 2 would cause the program to search forone copy of the molecule, save the solution as partial structure and continue searching for a second solution.

-n integer The number of independent optimization attempts

The default value is 10, which is intended for routine cases. For difficultcases, values up to 100 are worth trying. The program will stop before thecompletion of the number of runs specified here if a solution is obtained that has a correlation coefficient that exceeds a specified threshold (flag t, below).

-o name The file name prefix for the output coordinate files

The default is “epmr”. If you run multiple jobs in the same directory, you will have to use this flag to avoid writing over files from other runs. If you specify “-” as the file name prefix, the coordinates will be written to thestandard output of the program instead of a separate file.

If you specify the option “-w2”, the program will name the output files foreach run as “prefix”.”run_number”.pdb (e.g., epmr.1.pdb). If you are

searching sequentially over multiple space groups, search models, and/orcopies in the asymmetric unit, the file name will have additional numericindicators for those (e.g., epmr.3.4.2.5.pdb would indicate the thirdspacegroup choice, fourth model, second copy, fifth optimization attempt).The coordinate file containing the top solution from all runs with aspecific combination of space group, model and copy-number is named with “best” replacing the run number.

-p integer The population size (number of trial solutions evaluated in each cycle of optimization)

The default value is 300. Increasing this value beyond 300 will increasethe search efficiency, but you will get more benefit from increasing thenumber of optimization attempts (-n) instead.

-R Rotation search only

-S Try all search models simultaneously

This option is off by default. The choice of the model is an additionalvariable in the evolutionary search. This is an experimental option still under development.

-s filename Read static structure from the specified file

If you have partial structure to input, include this flag and follow it withthe name of the PDB file containing the correctly positioned partialstructure. You can separate the partial structure into as many files as youwish and use this flag multiple times on the command line. It is only necessary to divide the partial structure up this way, however, if you areusing the -b flag (see above) and are inputting multiple “pieces” of partialstructure (e.g., multiple monomers). The minimum packing distancecalculation will treat the partial structure within each separate file as aseparate fragment.

-T Translation search only

This option is off by default. It will cause the program to search onlytranslation space, keeping the orientation of the search model unchanged.This could be useful, for instance, when you have a search model that hasbeen pre-oriented by another program or through knowledge of non-crystallographic symmetry. Note that the orientation WILL be optimized during the final rigid-body optimization after the evolutionary search, so the orientation is likely to change slightly and could change significantly during this step.

-t number The threshold value of the correlation coefficient that indicates an acceptable solution, which will stop the program

The default value is 0.65. The program will stop when a solution correlation coefficient exceeds this threshold; if you want the program to continue for a specified number of runs no matter what, set this value to

1.0. It is best to keep this value relatively high to avoid having theprogram stop on an incompletely converged solution. Unlike previousversions of EPMR, this value is not adjusted during sequential searches for multiple copies of a molecule.

-v integer Verbosity: controls the amount of information written to the standard output of the program

The default is 1. A value of 0 (zero) results in nothing being written (unless the coordinates are written to the standard output using the option, -o ”). A value of 2 causes more detailed information on the progress ofthe evolutionary search to be written.

-w integer The quantity of solutions you want to write out to PDB files

The default value is 1. A value of 0 (zero) here means no coordinates willbe written out. A value of 1 means only the top solution from all of theruns will be written out. A value of 2 means all solutions will be written out. (The -o flag controls the name of the output PDB files.)

Acknowledgements

The original EPMR program was written by Chuck Kissinger and Dan Gehlhaar atAgouron Pharmaceuticals. David Fogel (Natural Selection, Inc.) contributed numerousideas that were essential to the original development of the program. Bradley Smith(Pfizer) was responsible for rewriting and improving the original program and testing numerous alternative algorithms and ideas. SGX Pharmaceuticals and AnadysPharmaceuticals have supported the development of this open-source version of EPMR.

This program is distributed under the GNU General Public License. The program makesuse of the VecMath class library implemented by Kenji Hiranabe, which is distributed under a different license and which requires the following copyright and permission notice:

Copyright (C) 1997,1998,1999 Kenji Hiranabe, Eiwa System Management, Inc. This program is freesoftware. Implemented by Kenji Hiranabe (hiranabe@esm.co.jp), conforming to the Java(TM) 3D API specification by Sun Microsystems. Permission to use, copy, modify, distribute and sell this software and itsdocumentation for any purpose is hereby granted without fee, provided that the above copyright notice appearin all copies and that both that copyright notice and this permission notice appear in supportingdocumentation. Kenji Hiranabe and Eiwa System Management,Inc. makes no representations about thesuitability of this software for any purpose. It is provided "AS IS" with NO WARRANTY.


References

1) Kissinger, CR, Gehlhaar, DK & Fogel, DB, “Rapid automated molecularreplacement by evolutionary search”, Acta Crystallographica, D55, 484-491 (1999).

2) Kissinger, CR, Smith, BA, Gehlhaar, DK & Bouzida, D, “Molecular replacementby evolutionary search”, Acta Crystallographica, D57, 1474-1479 (2001).

3) Huber, R. & Schneider, M., “A group refinement procedure in protein crystallography using Fourier transforms”, J. Appl. Cryst. 18, 165-169 (1985).

Examples

Example 1

Search for one molecule in the asymmetric unit, perform up to 10 attempts of theevolutionary search procedure (or until a cc above 0.65 is obtained), and write out thebest solution to the file “epmr.best.pdb”:

epmr example.cell example.pdb example.hkl > example.log

Example 2

Do a translation search, use static partial structure and do up to 50 runs for each molecule:

epmr -T –s example_A.pdb –n 50 example.cell example_B.pdb example.hkl > example.log

Example 3

Search for one molecule in the asymmetric unit (default), use data from 80 to 3.5 Å, do up to twenty runs, write out only the top solution (default) to a file with the prefix“example_solution”, and use static structure:

epmr –l 80 –h 3.5 -s example_partial.pdb -o example_solution -n 20 example.cell example.pdb example.hkl > example.log

Example 4

Search for three identical molecules in the asymmetric unit, write out all solutions, up to10 runs (default) for each molecule. Input two fragments of partial structure. Penalizesolutions that pack within 15.0 Å (center-to-center distance) of any symmetry mates orany partial structure that was input or generated during the run:

epmr –m3 -w2 -s example_partial1.pdb -s

example_partial2.pdb -b15.0 example.cell example.pdb

example.hkl > example.log

Example 5

Use EPMR in a program pipeline. The search model is read from the output of themythical program “search_model_generator” and the solutions are written to the input of“solution_evaluator”. Cell and reflection data are read from an mtz file. Search all related space groups sequentially, search sequentially for two identical molecules in theasymmetric unit, write out all solutions, and do up to 100 optimization runs for each molecule:

search_model_generator | epmr –o-–c -m2 -w2 –n 100 example.mtz -| solution_evaluator