-
Notifications
You must be signed in to change notification settings - Fork 8
Frequently Asked Questions
Because we just love British comedies.
Delete the following directory: rm $HOME/.bigjob/python/lib/python2.<PYTHON_MINOR_VERSION>/site-packages/BigJob-<BIGJOB_VERSION>-py2.X.egg/
To update a bigjob package execute:
easy_install -U bigjob
Q: Can tasks be distributed across two machines of two different infrastructures (e.g.: Eric on LONI & Ranger on XSEDE) which require different credential (SSH user or Globus certificates)?
Currently, SAGA Context is supported. If SSH is used, different credentials can be configured in the ~/.ssh/config file (see man ssh_config). SAGA-Python (Bliss) currently does not support Globus.
Redis is the most stable and fast backend (requires Python >2.5) and the recommended way of using BigJob. Redis can easily be run in user space. It can be downloaded at: http://redis.io/download (just ~500 KB). Once you downloaded and compiled Redis, start a Redis server on the machine of your choice:
$ redis-server
[489] 13 Sep 10:11:28 # Warning: no config file specified, using the default config. In order to specify a config file use 'redis-server /path/to/redis.conf'
[489] 13 Sep 10:11:28 * Server started, Redis version 2.2.12
[489] 13 Sep 10:11:28 * The server is now ready to accept connections on port 6379
[489] 13 Sep 10:11:28 - 0 clients connected (0 slaves), 922160 bytes in use
Then set the COORDINATION_URL parameter (on top of most examples) in the example to the Redis endpoint of your Redis installation, e.g.
redis://<hostname>:6379
The coordination url is passed to the constructor of the PilotComputeService
respectively to the PilotDataService
, e.g.
pilot_compute_service = PilotComputeService(coordination_url="redis://<hostname>:6379")
It is recommend to setup a password for your Redis server. Otherwise, other users will be able to access and manipulate your data stored in the Redis server.
The UNIX screen
tool can / should be used to re-connect to a running BigJob session on a remote machine. For documentation on screen, please see Screen Manpage.
You should not just submit a BigJob from your local machine to a remote host and then close the terminal without the use of screen.
Q: If BigJob is being used to launch Pilot-Jobs on multiple machines, does SAGA have to be installed on the all the machines?
It is recommended to have SAGA-Python (Bliss) is installed on the resources running the pilot. BJ will work with SAGA-Python on the resource, but will not support file staging.
Please make sure that the resource has a suitable Python version installed. The following command should return a valid Python version (Python 2.7 in the optimal case):
$ ssh localhost "python -V"
Python 2.7.2
Q: Does BigJob stage work & data unit files onto the target resource where the work unit has to execute?
Yes, there is SSH-based support for file stage-in.
The BigJob manager expects a URL to a SAGA Job Service as a parameter (lrms_url). The respective SAGA adaptor needs to be installed and working (please test the adaptor properly with SAGA before using BJ). Currently, BigJob works with the following SAGA Job adaptors:
SAGA/PBS: lrms_url = "pbs://localhost"
SAGA/SSH: lrms_url = "ssh://oliver2.loni.org"
SAGA/PSB+SSH: lrms_url = "pbs+ssh://oliver2.loni.org"
SAGA/Globus: lrms_url = "gram://oliver1.loni.org/jobmanager-pbs" (only SAGA C++, deprecated)
Bliss (>0.2.3) is the best support SAGA version for BigJob. It is the default version!
Yes, but it is deprecated.
Q: My stdout file doesn't contain the output of /bin/date but "ssh: connect to host localhost port 22: Connection refused"
BigJob utilizes ssh for the execution of sub-jobs. Please ensure that your local SSH daemon is up and running and that you can login without password.
BigJob attempts to install itself, if it can't find a valid BJ installation on a resource (i.e. if import bigjob fails). By default BigJob search for $HOME/.bigjob/python
for a working BJ installation. Please, make sure that the correct Python is found in your default paths. If BJ attempts to install itself despite being already installed on a resource this can be a sign that the wrong Python is found.
BigJob utilizes a configuration file named bigjob.conf
located in the root of the BigJob installation (e.g. $HOME/.bigjob/python/lib/python2.<PYTHON_MINOR_VERSION>/site-packages/BigJob-<BIGJOB_VERSION>-py2.X.egg/
):
# Logging config
# logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR, logging.CRITICAL
logging.level=logging.INFO
Alternatively you can set the logging level in the code:
import logging
from bigjob import logger
logger.setLevel(logging.FATAL)
or via the environment variable BIGJOB_VERBOSE
. For example, for full debug log output use:
export BIGJOB_VERBOSE=5
BigJob logger can be obtained and can further handlers can be added to the logger object. For example with the below set of instructions, the application and BigJob debug messages are written to namd_bigwork.log file
logger = logging.getLogger('bigjob')
fh = logging.FileHandler('namd_bigwork.log',mode='w')
fh.setLevel(logging.DEBUG)
logger.addHandler(fh)
logger.debug("Logging to namd_bigwork.log at DEBUG level")
BigJob expands the tokens $HOME
and ~
in the Compute Unit Description in the attribute executable
and working_directory
with the home directory of the respective resource.
Yes, if your BigJob manager (or application) terminates before all ComputeUnits terminate, you can reconnect to a running pilot by
providing a pilot_url
to the PilotCompute
constructor. For example:
pilot = PilotCompute(pilot_url="redis://localhost:6379/bigjob:bj-a7bfae68-25a0-11e2-bd6c-705681b3df0f:localhost")
By default, BJ creates a directory structure relative to the BJ working directory specified in start_pilot_job
:
<BIGJOB_WORKING_DIRECTORY>/bj-54aaba6c-32ec-11e1-a4e5-00264a13ca4c/sj-55010912-32ec-11e1-a4e5-00264a13ca4c
<BIGJOB_WORKING_DIRECTORY>/bj-54aaba6c-32ec-11e1-a4e5-00264a13ca4c/sj-55153072-32ec-11e1-a4e5-00264a13ca4c
For each sub-job a own directory is created. Subjobs can be executed in any directory by setting the working directory to the desired directory in the sub-job description:
jd.working_directory = "<your directory of choice>"
Yes, it works. However, there are limitations: Kraken requires the user to use aprun to launch jobs. Aprun can only be called once per batch job - BJ compute unit launch mechanism which spawns 1 process per compute unit is not compatible with aprun.
You can however execute a single compute unit concurrently by setting the NUMBER_SUBJOBS
variable to:
jd.environment=["NUMBER_SUBJOBS=2"]
Very likely, the SAGA C++ adaptor is not correctly configured. If PBSPro adaptor is used export PBS_HOME
or if Torque adaptors are used export TORQUE_HOME
environment variable to the corresponding scheduling installation location. For example:
$ which qsub
/usr/local/bin/qsub
$ export PBS_HOME=/usr/local
Yes. It is possible. The pre-requisites are 1. SAGA 2. Globus client tools 3. Add cluster(LONI) CA to list of trusted certificates authority. The below three steps are used to setup LONI certificates and for more information please use https://docs.loni.org/wiki/LONI_Certificates.
cd $HOME/.globus/certificates
wget https://docs.loni.org/mediawiki-1.13.3-docsloni/images/9/9c/a3bf9f3c.0 --no-check-certificate
wget https://docs.loni.org/mediawiki-1.13.3-docsloni/images/d/d9/a3bf9f3c.signing_policy --no-check-certificate