ACMI v1.3.1 manual

(c) 2007 Frank DiMaio
(c) 2008 Ameet Soni


1 INTRODUCTION
--------------

ACMI is a suite of programs designed to automate construction of a 
macromolecular model from an electron density map.  It is specifically
developed to handle low-resolution (3-4 Angstrom resolution) and 
poorly phased electron density maps.

Details of ACMI's algorithms are available from the publications listed at 
the end of this manual.  The manual is meant as an introduction to using ACMI
to construct protien models from electron density, and a reference to each
program's command-line parameters.

Generally, a user wanting to construct an all-atom protein model from an 
electron density map will want to use the following program pipeline:
     
    a) SHFEAR -- given a protein amino-acid sequence and a density map -- 
       searches the map for a set of templates corresponding to each 
       amino-acid residue in the protein.  It computes the prior probability
       distribution that of each amino-acid's presence at every location in
       the map.
     
    b) BP uses structural knowledge of the protein to refine the probabilities
       computed by SHFEAR.  Its probabilistic inference algorithm computes the
       posterior distribution of each amino acid's presence at every location
       in the map; that is, the probabilty taking into account the location
       of every other amino acid.
     
    c) SIDECHAIN_PF uses a Monte-Carlo sampling algorithm over BP's posterior 
       distributions to generate a _set_ of physically feasible protein 
       structures, that best explains the computed posterior distribution.

These three programs are connected using the idea of a working directory. Each
program writes its output to the working directory; the next program in the
pipeline reads in from this working directory, processes this input, then
writes its output to this same directory.




2 TUTORIAL
----------

A small, well-phased 2.5-Angstrom-resolution density map -- with corresponding
amino-acid sequence and crystal parameters -- is provided in the folder 
'test/data/'.  The density map is contained in the file 'test.fs4.map', the 
amino-acid sequence in 'test.seq', and the crystal parameters in 'test.cryst'.

The script 'do_test.sh' shows a simple pipeline for the provided data.  Below, 
we show each command in the pipeline, annotated with the output produced.


    ./shfear -w test/  -m test/data/test.fs4.map \
                       -c test/data/test.cryst \
                       -s test/data/test.seq \
                       -r 2.5 -o 0.9 -N 20 -p 1 1
					   
NOTE: Before running this command, make sure the the environment variable
"ACMI_NMERDB" is set to specify the folder containing the nmer database.
                       
This command runs SHFEAR's template searching over the density map.  For each
amino-acid position in test.seq, SHFEAR considers the 5-amino-acid sequence 
_centered_ at this position, and  searches ACMI's fragment database for 
structures with a 5-AA sequence with high local similarity.  It performs an 
exhaustive 6D search over the map for each matching structure, computing a 
correlation coefficient at every location and rotation.

The density map, crystal parameters (identical to the 'CRYST1' line in a 
PDB file) and sequence information (one-letter codes, each chain on a line, 
with multiple asymmetric copies indicated with multiple lines) are specified
with the '-m', '-c', and '-s' flags, respectively.  The '-r' flag optionally
provides the density map resolution.  

The '-o' flag says to ignore 90% of the map in this search -- that 
is, only look for fragments in the best 10% of the map.  We set this value 
higher than the default 0.8 (ignoring only 80% of the map) because this is a 
well-phased map; setting it to 0.9 should find as good a model in about half
the time.

The '-N' flag specifies the minimum number of templates for which to search
per residue.  The default value, 50, typically works well; we set it to
20 here for performance reasons.

Finally, the -p flag is used for parallelizing the algorithm.  The default,
'-p 1 1' (read 1-of-1) runs the entire job on a single processor.  If there
are two machines available, running two jobs, one specifying '-p 1 2', the 
other '-p 2 2' will finish about twice as fast.

The working directory is specified with the '-w' flag (here it is the folder 
'test/').  SHFEAR creates a subdirectory in the working directory named 
'priors'.  In this folder, for each 5-amino-acid sequence XXXXX, SHFEAR 
creates 5 files:

    match_XXXXX.cc.mat 
        - a map of correlation coefficients between the best matching
          templates (C-alpha-centered) and the density map.
          
    match_XXXXX.betaF.mat,  match_XXXXX.gammaF.mat
        - a map of the "preferred" location of the adjacent N-terminal
          C-alpha, specified in spherical coordinates.
        
    match_XXXXX.betaB.mat,  match_XXXXX.gammaB.mat
        - a map of the "preferred" location of the adjacent C-terminal
          C-alpha, specified in spherical coordinates.
        
Each of these maps is in MATLAB v6 format, and can be loaded and visualized 
in MATLAB.  Data is extended to cover the entire unit cell (not just the
asymmetric unit).

                       
    ./BP -w test/  -c test/data/test.cryst \
                   -s test/data/test.seq  
                   
This command runs ACMI's inference algorithm on the previously computed 
priors.  The arguments -c and -s are identical 

The working directory -- again indicated with '-w' -- must be the
same working directory in the previous step.  BP reads in the contents of the
'priors' folder, and creates an output directory, 'posteriors'.  

In this folder, for each residue X of chain Y, BP creates the output file 
'test_chnY_resX_marginal.mat'.  This file contains the posterior distribution
(that is, the distribution taking all other amino acid locations into account) 
of residue X's C-alpha location.  Each of these maps are also in MATLAB v6 
format, and can be loaded and visualized in MATLAB.

BP also creates a subfolder to the working directory, 'backbone_models'.  At
each iteration of the algorithm, BP computes a backbone (C-alpha-only) model
of the protein, and places it in this folder.
                  
                  
    ./sidechain_pf -w test/  -m test/data/test.fs4.map \
                             -c test/data/test.cryst \
                             -s test/data/test.seq \
                             -r 2.5 -b 1-1

This command uses the posterior disributions to guide a Monte Carlo sampling
algorithm, producing an ensemble of all-atom protein models.  The arguments
'-m', '-s', and '-c' indicate the map, crystal, and sequence files, as in the 
other core programs.  Resolution is indicated with -r, while an additional
argument, '-b' indicates a range of structures to generate.  In this case,
we generate just 1 structure.  

Allowing a range of indices makes parallelizing straightforward, e.g. one can 
compute structures 1-5 on one machine, and 6-10 on another (repeated runs with
a the same index yield the same structure.

The working directory, specified with '-w', is the same as before. 
SIDECHAIN_PF reads the contents of the 'posteriors' directory from this 
folder, and creats a new output folder, 'Nparticle_models' (where N is the 
number of particles per run, by default 100).  It outputs a set of all-atom
models, named prot_allatom_pf.K.pdb, where K is the structure's index
(corresponding to the range specified with '-b').

SIDECHAIN_PF also creates a directory 'fragment_cache'

This set of structures comprises the final model produced by ACMI.




3 CORE PROGRAMS
---------------

This section describes individual command line arguments to the core programs
SHFEAR, BP, and SIDECHAIN_PF.



3.1 SHFEAR 

Given the amino acid sequence and a map, this program looks up each 5-mer in the
database, finds a number of instances, and searches the map for these instances.
For each residue in the protein, it outputs a map of correlation coefficients
between the best-matching instance and the electron density map.

Usage: ./shfear [options [args]]

  --help,       -h          Displays this information
  --working,    -w    [REQUIRED] Working Directory
        This is the top-level output directory, shared by all programs in ACMI
        SHFEAR writes its output maps to a subdirectory, 'priors'.
  
  --map,        -m    [REQUIRED] Input fs4 or CCP4 map file
        Full path to the fs4-format or CCP4-format density map.
        
  --cryst,      -c    [REQUIRED] Input crystal file
        A one-line text file that has the exact same syntax as the 'CRYST1'
        line in a PDB file.
  
  --seq,        -s    [REQUIRED] Input sequence file
        The protein's sequence as a one-letter code ('_'s indicate gaps). 
        Each chain is specified on a separate line.
        NOTE! If there are multiple copies in the asymmetric unit, 
              you will need to enter a line for each.
        
  --reso,       -r    Density map resolution (default 2.0)
  --bandwidth,  -B    Spherical harmonic bandwidth (default 12)
  --radius,     -R    Fragment radius (default 5)
  --nfrags,     -N    Number of fragments considered per amino acid
                            (default 50)
        Search parameters.  These are optional, and defaults give
        reasonable results in most cases.  Higher values for 'bandwidth' and
        'nfrags' may yield higher-quality results at the cost of additional 
        runtime.
        
  --grid,       -x    Approximate map grid spacing (default 1A)
        Grid spacing is now computed automatically, given an approximate
        grid spacing (works the same as 'fft' in CCP4).
        The default is reasonable in most cases.
  
  --omit,       -o    Using density values, omit fraction of map 
                            (default 0.8)
        A first-pass filter, based on density values throws away this 
        fraction of points.  If the map is especially well-phased, you 
        may consider increasing this to 0.9 (reducing the runtime by half).
        
  --parallel,   -p    Parallelize  of 
        This flag allows parallelism of shfear jobs.  Individual amino-acids
        are run on separate machines.
        
        Parallelization of jobs is done as 'k of N'; e.g., -p 1 12 indicates
        this job is the first of 12 parallel runs.
  
  --analogues,  -a    List of structural analogues for fragment generation
        Allows specification of structural analogues.  The search procedure
        will use templates from this structure, and only go to the database if
        there are no instances in the analogue.  Specifically a list of PDB files, 
        delimited by spaces.
  
  --force,      -f          Force priors to be overwritten
        Normally, if an ouput file exists in 'priors' from a previous run of
        shfear, it will not be overwritten.  
        This flag forces overwriting in such a case.
  --partial,	-P    Use partial solution PDB structure
		If a user has a partial solution already determined	(e.g. from a 
		molecular replacement solution), shfear can skip the template search
		for any solved residue by using the coordinates in the partial solution.  
		A normal search is done for all non-specified residues in the PDB file.
		*NOTE see below for partial solution format requirements
  --seleniums,	-e	 Substitute sulfurs with seleniums in Methionine residues   
		When generating templates densities from PDB pentapeptides, sulfurs in methionines are replaced with 			the density of seleniums when methionine is the central residue.

3.2 BP

   Given the output of shfear, BP runs the inference algorithm to
   compute marginal distributions of each residue's position.

Usage: ./BP [options [args]]
  
  --help,       -h          Displays this information
  --working,    -w    [REQUIRED] Working Directory
        These are the same as in shfear.  BP reads from the priors directory,
        and creates two directories for output, 'posteriors', and 
        'backbone_models'.
  
  --cryst,      -c    [REQUIRED] Input crystal file
  --seq,        -s    [REQUIRED] Input sequence file
        These are the same as in shfear.
  
  --sigma,      -o    Average squared rotation error (default 1.0)
  --m-est,      -m    Probability of incorrect rotational alignment
                            (default 0.1)
  --iters,      -i    Iterations of BP (default 25)
        Search parameters.  These are optional, and defaults give
        reasonable results in most cases. 
          
  --seleniums,  -e    PDB file containing locations of selenium atoms
  --se-prior,   -p    If Se file specified, wgt on specified Se locations
                            (default 0.9)
        These two parameters allow you to specify the location of all the 
        selenium atoms (from phasing experiments) and a weight to apply to
        these locations.  The default weight should work fine in most cases.
                            
  --memory,     -x    Maximum memory usage before paging (in MB)
  --scratch,    -d    Scratch directory for paging 
                            (will use a unique subdir)
        With especially large maps, BP often need significant amounts of
        memory, which can lead to swapping and significant performance penalty.
        
        These flags allows specification of a maximum amount of memory to 
        allocate, and a scratch  directory, which will be used for temporary 
        storage, only if required memory is >= the memory  limit.  
        Default is 4000 MB and '/scratch', respectively.
  --partial,	-P    Use partial solution PDB structure
		If a partial solution was used for shfear, the same file needs to supplied
		to BP, otherwise it will not be able to interpret the shfear results properly
		*NOTE see below for partial solution format requirements
  --fixed,	-f	    Fix residues in partial solution
		By default, the partial solution will be treated as a normal probability
		distribution, meaning BP may modify the solution.  This flag will cause BP 
		to lock the solution for the residues in the partial solution.  
  --secondary,	-s    Input secondary structure sequence file
		Input secondary structure prediction sequence.  Should be in same format as sequence file with same 			lengths.  Currently supports three classes: helix (H), loops/other (L), beta (E)


3.3 SIDECHAIN_PF

   Given the posteriors from BP, use particle filtering to obtain a set of 
   all-atom models.
   
Usage: ./sidechain_pf [options [args]]

  --help,       -h          Displays this information
  --working,    -w    [REQUIRED] Working Directory
        Same as in shfear.  sidechain_pf reads from 'posteriors', and writes to 
        a folder Nparticle_models, where 'N' is the # of particles.
  
  --map,        -m    [REQUIRED] Input fs4 or CCP4 map file
  --cryst,      -c    [REQUIRED] Input crystal file
  --seq,        -s    [REQUIRED] Input sequence file
        The same as shfear/BP.
  
  --reso,       -r    Density map resolution (default 2.0)
  --bandwidth,  -B    Spherical harmonic bandwidth (default 16)
  --rad-align,  -g    Fragment radius for alignment (default 5)
  --rad-mask,   -k    Fragment radius for computing CC (default 8)
  --mask-dist,  -d    Mask dist for computing CC if rad-mask > rad-align
                            (default 3.0)
  --nfrags,     -N    Number of fragments considered per amino acid
                            (default 50)
        Search parameters.  These are optional, and defaults give
        reasonable results in most cases. Many are the same as in shfear.
            
                            
  --npart,      -p    Number of particles (default 100)
        The number of point estimates used to model the posterior.
        The default is probably reasonable.
        
  --buckets,    -b    Bucket number (or range) of this particle set 
                            (default 1-10)
        Range of buckets, or "populations" of particles over which to run 
        (e.g -b 3-6 runs buckets 3,4,5, and 6).  No resampling occurs
        between different buckets, so each bucket produces an independent
        interpretation.  Using '-b' it is possible to parallelize PF runs: 
        for example, running buckets 1-5 on 1 machine, and 6-10 simultaneously
        on another.        
  --b-min,      -y    Min belief to continue a trace (default 0.001)
  --M-max,      -z    Max entropy to start a trace (default 2.8)
        Parameters controlling when to abort chain growth and to 
        terminate interpretation early.  Defaults give reasonable results.         
  --analogues,  -a    A list of structural analogues to use for 
                            fragment generation
  --partial,	-P    Use partial solution PDB structure
		If a partial solution is available, sidechain_pf can skip the sampling
		procedure for any pre-determined residues.  This will seed all particles
		with the partial solution first.
		*NOTE see below for partial solution format requirements           
    



4 EXTRA UTILITIES
-----------------

Several additional programs are provided with the ACMI distribution.

The program GET_RMSD is used to evaluate an ACMI model against a 
crystallographer-built model.

The program SIDECHAIN_NAIVE takes the backbone model produced by
BP and stitches sidechains on this predicted backbone.

Finally, the program BUILD_NMER_DB is provided as a reference to users
wishing to construct their own 5-mer fragment library.



4.1 GET_RMSD

   A short program to analyze computed results.  Unless the 'verbose' option
   is specified, arguments are output in a single comma-separated line, in
   the following order:
   
       1) Number of chains in the model
       2) Percentage of "true" C-alphas found
       3) Percentage of "true" sidechains correctly identified
       4) Percentage of predicted structure corresponding to "true" structure
       5) Percentage of predicted structure corresponding to 
          correctly-identified "true" structure
       6-8) C-alpha, all-atom, and sidechain-only RMS error of 
            WHAT WAS CORRECTLY PREDICTED
    
    Values 2 and 3 correspond to the "recall" of the model, while
    values 4 and 5 correspond to the "precision".
    
    If 'nobreak' was not specified, 8 additional values provide the same 
    metrics for the model in which the predicted chain was optimally broken.
   
   
   Usage: ./get_RMSd [options [args]]

  --help,       -h          Displays this information
  --true,       -t    [REQUIRED] True PDB
  --pred,       -p    [REQUIRED] Predicted PDB
        Paths to the true and the predicted PDB files.
        
  --correct,    -c    Consider AA correct if CA is within this distance 
                            of the true CA location (default=2A)
        Self-explanitory.
                            
  --outpdb,     -o          Write PDB in correct asymm unit to STDOUT
        Outputs to STDOUT the predicted PDB, but rotated to lie in the same 
        asymmetric unit as the true PDB.  Useful for displaying the results.
        
  --stats,      -s          Write matching statistics to STDOUT
        Prints a comma-seperated list comparing prediction confidence 
        (specified in the B-factor column of the predicted PDB), and the 
        distance from the true CA location.
  
  --nobreak,    -n          Don't try to break chains
        By default, get_RMSd will try to break false connections in the model.
        Use this flag to disable this behavior.

  --verbose,    -v          Use verbose output
        By default, get_RMSd will output on a single comma-separated line.
        This argument forces human-readable output.
        
        
        
4.2 SIDECHAIN_NAIVE

    Given a backbone model produced by BP, this program naively creates an
    all-atom model by stitching sidechains on this backbone-only model.
    
    The arguments are similar to SIDECHAIN_PF.  SIDECHAIN_NAIVE writes to 
    STDOUT.
    
Usage: ./sidechain_pf [options [args]]

  --help,       -h          Displays this information
  --working,    -w    [REQUIRED] Working Directory
        Same as in shfear.  sidechain_pf reads from 'posteriors', and writes to 
        a folder Nparticle_models, where 'N' is the # of particles.
  
  --map,        -m    [REQUIRED] Input fs4 or CCP4 map file
        The same as shfear/sidechain_pf.
        
  --cryst,      -p    [REQUIRED] Input backbone model
        A backbone model (C-alpha-only) of the protein of interest.
  
  --reso,       -r    Density map resolution (default 2.0)
  --bandwidth,  -B    Spherical harmonic bandwidth (default 16)
  --rad-align,  -g    Fragment radius for alignment (default 5)
  --rad-mask,   -k    Fragment radius for computing CC (default 8)
  --mask-dist,  -d    Mask dist for computing CC if rad-mask > rad-align
                            (default 3.0)
  --nfrags,     -N    Number of fragments considered per amino acid
                            (default 50)
        Search parameters.  These are optional, and defaults give
        reasonable results in most cases.
            
                            
  --analogues,  -a    A list of structural analogues to use for 
                            fragment generation



4.3 BUILD_NMER_DB

This program allows one to construct a custom fragment library.  
The program takes two command line arguments: 
    
    a) A text file where the first four charaters of each line 
       give a PDB accession code, and the fifth gives a chain 
       identifier (with a single header line).
    
    b) The location of the fragment database.

Before building the database, the PDB files should be downloaded and placed in 
'nmerdb/fraglib', and named 'pdbXXXX.pdb', where XXXX is the accession code.

A sample such text file is given in 
'nmerdb/cullpdb_pc30_res3.0_R1.0_d050604_chains4257.txt' 
The intended use of the program is to use the Dunbrack lab's culled PDB lists
at 'http://dunbrack.fccc.edu/Guoli/pisces_download.php'.

The "live" database is always at 'nmerdb/nmers.db' (and ACMI always looks for
PDB files in 'nmerdb/fraglib'). 



5 PARALLELIZATION
-----------------

The utility SHFEAR is the component of ACMI that requires the most significant
computation time.  However, it is trivially parallelized using the '-p' flag. 
Inputs are specified in a k-of-n format, e.g. '-p 12 35' indicates that this run
of SHFEAR is to be job #12 after breaking the task into 35 jobs.  Jobs are split
at the amino-acid level; specifying more parallel jobs than amino acids in the
protein will not provide additional parallelism (extra jobs will immediately
halt).

SHFEAR's parallelism works extremely well with the CONDOR distributed computing
environment.  Details of CONDOR are available on the project's home page, at
http://www.cs.wisc.edu/condor/.

Included with this distribution is a sample SHFEAR condor-submission file,
'run.condor'.  In this file, the testset protein in the folder '_02b_70653' is
split into 323 jobs (one for each amino-acid).  These jobs are submitted to the job
queuing system with the command 'condor_submit run.condor'.




6 ACMI PUBLICATIONS
-------------------

SHFEAR is described in:

F. DiMaio, A. Soni, G. Phillips and J. Shavlik (2007).  Improved methods for 
template-matching in electron density maps using spherical harmonics.  Proc. BIBM.


BP is described in:

F. DiMaio, J. Shavlik, and G. Phillips (2006).  A probabilistic approach to
protein backbone tracing in electron density maps. Bioinformatics 22.  e81-e89.


SIDECHAIN_PF is described in:

F. DiMaio, D. Kondrashov, E. Bitto, C. Bingman, A. Soni,  G. Phillips, and 
J. Shavlik (under review). Creating All-Atom Protein Models from 
Electron-Density Maps using Particle-Filtering Methods.

7 INPUTTING PARTIAL SOLUTION 
----------------------------

The take home summary is:
add -P  flag to shfear, BP, and/or sidechain_pf to input a 
partial structure.  Add -f to BP to treat this partial solution as known truth, 
no need to modify the partial solution.  The pdb file needs to meet strict 
requirements in terms of how residues are numbered as well as format:

*The file needs to be in PDB format.
*The only atom that is read in is the CA of a residue.  Each residue in the partial
	solution must have one and only one CA specified.
*You do not need to remove other atoms - they will be ignored.
*Residues are numbered based on their position in the sequence file, where the first
position in the sequence is 1 (not 0).  If the first residue defined in the PDB
is sequence position 12, it must have a residue ID of 12 (NOT 1!).

shfear will place a smoothed peak of probability around the position specified in the PDB
as well as all symmetric copies, with the -w flag defining the width of the Gaussian.
BP uses the sequence file for identification issues.  If the --fixed flag is not used,
BP treats all shfear files(i.e. from 5mer search and partial solution) the same, meaning
the partial solution may be modified.  If --fixed is used, those residues pre-defined in
the partial solution are fixed in location, providing a speed up in performance since
calculations are not done for these residue.
Even if a partial solution is used for shfear and BP, sidechain_pf can optionally take
in the structure.  If no structure is given, it uses the normal sampling procedure using
the information from BP (which processed the partial solution information).  If the 
structure is given, sidechain_pf skips sampling of residues in the structure and uses
the partial solution as a seed for the start of the sampling of the remaining residues.
    

8 CHANGES LOG
---------------------------
From 1.3.1 to 1.3.0
Major:
-ACMI can handle CCP4 density maps (in addition to fsfour maps).  
-BP can incorporate provided secondary structure (in development phase)
-Ability to substitute seleniums into methionines for shfear

Minor:
-Seleniums only boost probabilities for methionines now
-Fixed bug causing BP to crash when shfear files are not complete
-Fixed bug to handle improper space group names. 

*****************

From 1.2.1 to 1.3.0
Major:
-sidechain_pf is able to handle partial solutions.  The given partial solution is used
	as the seed for each particle
-shfear is able to define confidence in the partial solution via a Gaussian width parameter

Minor:
shfear:
-place Gaussian convolution on given partial solution atom locations
-added parameters to define kernel width for Gaussian
-removed mixture parameter


sidechain_pf:
-output solution contains chainID and resID based on original sequence file
-output raw file before moving molecule to one assymetric unit (some chains get broken up by shift)

*****************

From 1.2 to 1.2.1:

Major changes:

src/shfear.cpp - added capability to handle partial solutions to the structure

Minor fixes:
BP - removed bounding box option, fixed paging bug, pre-compute forward messages

******************

From 1.1 to 1.2

src/BP.cpp, src/BP.h
        added ability to include partial structure (untested)