-
Notifications
You must be signed in to change notification settings - Fork 24
cmcalibrate help
cmcalibrate
is slow and requires a lot of memory. This page is meant to help users having difficulty with cmcalibrate
.
First, cmcalibrate is only required if you are going to use your CM file with cmsearch or cmscan. Otherwise, there is no reason to run cmcalibrate.
To calibrate the CM file RF00001.cm
, do:
$ cmcalibrate RF00001.cm
You should see output like this:
# cmcalibrate :: fit exponential tails for CM E-values
# INFERNAL 1.1.2 (July 2016)
# Copyright (C) 2016 Howard Hughes Medical Institute.
# Freely distributed under a BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# CM file: RF00001.cm
# number of worker threads: 32
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Calibrating CM(s):
#
# predicted actual
# running time percent complete running time
# model name (hr:min:sec) [........25........50........75..........] (hr:min:sec)
# -------------------- ------------ ------------------------------------------ ------------
5S_rRNA 00:01:53 [========================================] 00:03:44
#
# Calibration summary statistics:
#
# exponential tail fit mu exponential tail fit lambda total number of hits
# ------------------------------- ------------------------------- -------------------------------
# model name glc cyk glc ins loc cyk loc ins glc cyk glc ins loc cyk loc ins glc cyk glc ins loc cyk loc ins
# -------------------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
5S_rRNA -6.37 -1.61 0.62 3.39 0.410 0.425 0.677 0.595 17589 17573 339357 213632
#
# CPU time: 5466.31u 30.45s 01:31:36.76 Elapsed: 00:03:44.48
[ok]
To see the available command-line options for cmcalibrate
, do:
$ cmcalibrate -h
# cmcalibrate :: fit exponential tails for CM E-values
# INFERNAL 1.1.2 (July 2016)
# Copyright (C) 2016 Howard Hughes Medical Institute.
# Freely distributed under a BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: cmcalibrate [-options] <cmfile>
Basic options:
-h : show brief help on version and usage
-L <x> : set random seq length to search in Mb to <x> [1.6] (0.01<=x<=160.)
Options for predicting running time and memory requirements:
--forecast : don't do calibration, predict running time and exit
--nforecast <n> : w/--forecast, predict time with <n> processors (maybe for MPI)
--memreq : don't do calibration, print required memory and exit
--noforecast : do calibration, but skip running time prediction
Options controlling exponential tail fits:
--gtailn <n> : fit the top <n> hits/Mb in histogram for glocal modes [df: 250]
--ltailn <n> : fit the top <n> hits/Mb in histogram for local modes [df: 750]
--tailp <x> : set fraction of histogram tail to fit to exp tail to <x>
Optional output files:
--hfile <f> : save fitted score histogram(s) to file <f>
--sfile <f> : save survival plot to file <f>
--qqfile <f> : save Q-Q plot for score histograms to file <f>
--ffile <f> : save lambdas for different tail fit probs to file <f>
--xfile <f> : save scores in fit tail to file <f>
Other options:
--seed <n> : set RNG seed to <n> (if 0: one-time arbitrary seed)
--beta <x> : set tail loss prob for query dependent banding (QDB) to <x>
--nonbanded : do not use QDB
--nonull3 : turn OFF the NULL3 post hoc additional null model
--random : use GC content of random null background model of CM
--gc <f> : use GC content distribution from file <f>
--cpu <n> : number of parallel CPU workers to use for multithreads
To get an estimate of the running time required for cmcalibrate on a CM file RF00001.cm, do:
$ cmcalibrate --forecast RF00001.cm
You should see output like this:
# cmcalibrate :: fit exponential tails for CM E-values
# INFERNAL 1.1.2 (July 2016)
# Copyright (C) 2016 Howard Hughes Medical Institute.
# Freely distributed under a BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# CM file: RF00001.cm
# forecast mode (no calibration): on
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Forecasting running time for CM calibration(s) on 32 cpus:
#
# predicted
# running time
# model name (hr:min:sec)
# -------------------- ------------
5S_rRNA 00:01:53
#
# CPU time: 0.41u 0.01s `00:00:00.42 Elapsed: 00:00:00.45
[ok]
Note that it lists the running time for 32 cpus
. This should be the number of cores on the machine you are using and cmcalibrate
will use all cores by default. If you want to forecast the running time for <n>
cores, use the --nforecast <n>
option, like:
cmcalibrate --forecast --nforecast 8 RF00001.cm
When you perform the calibration, you can specify the number of cores that cmcalibrate will use with the --cpu <n>
option.
You may want to use fewer cores if the required memory (see below) is too high.
As a special case, if you want to run on a single core specify --cpu 0
.
To get an estimate of required memory for cmcalibrate, do:
$ cmcalibrate --memreq RF00001.cm
and you should see output like:
# cmcalibrate :: fit exponential tails for CM E-values
# INFERNAL 1.1.2 (July 2016)
# Copyright (C) 2016 Howard Hughes Medical Institute.
# Freely distributed under a BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# CM file: RF00001.cm
# memory-requirement mode (no calibration): on
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Predicting required memory for calibration:
#
# total Mb total Mb
# single CPU 32 CPUs
# -------------------- ---------- ----------
5S_rRNA 77.4 2424.5
#
# To enforce a single CPU be used, use the '--cpu 0' option.
# To enforce <n> CPUs be used, use '--cpu <n>'.
# By default (if '--cpu' is not used), 32 CPUs will be used.
#
# CPU time: 0.00u 0.00s 00:00:00.00 Elapsed: 00:00:00.00
[ok]
This means if you were to run cmcalibrate
it would require roughly 2424.5 Mb of RAM. If you were to use --cpu 0
to specify a single core be used, it would only require about 77.4 Mb of RAM. If you were to use --cpu 4
it would require roughly ~320 Mb of RAM.
These estimates are rough estimates. As a usually safe rule of thumb, I make sure that twice as much memory is available as --memreq
estimates when I run cmcalibrate
.
There is no way to reduce the required memory for calibration below the amount reported for a single CPU.
The -L <f>
option controls the total length of random sequence that cmcalibrate
searches, where is in Mb. The default value for <f>
is 1.6 (Mb). Smaller values of will result in quicker searches but less accurate E-value statistics. The default value of 1.6 was chosen as a good compromise between cmcalibrate
running time and resulting E-value accuracy. I do not recommend using <f>
values less than 0.4
.
With low values of <f>
, you may get error messages like "Not enough hits to fit exponential tail" and the calibration will fail. This is because there's a minimum number of high scoring hits the calibration needs to fit an exponential tail, and reducing the search size from 1.6 down to 0.4 (for example) may mean that the minimum number of hits is not achieved. This 'not enough hits' error is more likely to happen with large models, unfortunately.
You can use -L <f>
in combination with --forecast
option to see how changing <f>
impacts the forecasted running time, like:
cmcalibrate -L 0.4 --forecast RF00001.cm
The --beta <x>
option can also speed up cmcalibrate
by setting <x>
to a value higher than the default of 1E-15. The --beta
option controls the width of the bands used during the DP search (see Nawrocki, Eddy, 2007: http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0030056 for more info). The <x>
value for --beta
is the amount of probability mass allowed outside the bands, so greater probability loss makes the bands tighter, the DP go faster, and thus the calibration go faster. The default is 1E-15, so setting it to a higher value like 1E-4 (0.0001) will accelerate the search.
You can use --beta <f>
in combination with --forecast
option to see how changing <f>
impacts the forecasted running time, like:
cmcalibrate --beta 1E-4 --forecast RF00001.cm
or in combination with -L
too:
cmcalibrate --beta 1E-4 -L 0.4 --forecast RF00001.cm
The first step of cmcalibrate
is to run a short simulation to estimate the total running time that the full calibration will take. This 'short' simulation can take a long time for large models. To skip it, use the --noforecast
option.