Skip to content

BigJob Tutorial Part 6: Coupled Ensemble Example

pradeepmantha edited this page Aug 22, 2012 · 3 revisions

This page is part of the BigJob Tutorial.


Coupled Ensembles

The script provides a simple workflow which submit a set of jobs(A) and jobs(B) and wait until they are completed and then submits set of jobs(C). It demonstrates synchronization mechanisms provided by SAGA Pilot-API.

What types of workflows would this be useful for?

If the executable in C has dependencies on some of the output generated from jobs A and B.

In your $HOME directory, open a new file coupled_ensemble.py with your favorite editor (e.g., vim) and paste the following content:

import os
import time
import sys
from pilot import PilotComputeService, ComputeDataService, State
    	
### This is the number of jobs you want to run
NUMBER_JOBS=4
COORDINATION_URL = "redis://[email protected]:6379"
    
if __name__ == "__main__":
    
    pilot_compute_service = PilotComputeService(COORDINATION_URL)
    
    pilot_compute_description = { "service_url": "sge://localhost",
                                  "number_of_processes": 12,
                                  "allocation": "XSEDE12-SAGA",
                                  "queue": "development",
                                  "working_directory": os.getenv("HOME")+"/agent",
                                  "walltime":10
                                }
    
    pilot_compute_service.create_pilot(pilot_compute_description=pilot_compute_description)
    
    compute_data_service = ComputeDataService()
    compute_data_service.add_pilot_compute_service(pilot_compute_service)
    
    print ("Finished Pilot-Job setup. Submitting compute units")
    
    # submit a set of CUs, call it A
    for i in range(NUMBER_JOBS):
        compute_unit_description = { "executable": "/bin/echo",
                                     "arguments": ["Hello","$ENV1","$ENV2"],
                                     "environment": ['ENV1=env_arg1','ENV2=env_arg2'],
                                     "number_of_processes": 1,            
                                     "output": "A_stdout.txt",
                                     "error": "A_stderr.txt"
                                    }    
        compute_data_service.submit_compute_unit(compute_unit_description)
    
    
    # submit a set of CUs, call it B
    for i in range(NUMBER_JOBS):
        compute_unit_description = { "executable": "/bin/date",
                                     "arguments": [],
                                     "environment": {},
                                     "number_of_processes": 1,
                                     "output": "B_stdout.txt",
                                     "error": "B_stderr.txt",
                                    }
        compute_data_service.submit_compute_unit(compute_unit_description)
     
    print ("Wait for CUs of task set A & B to complete")
    compute_data_service.wait()
    
    # submit a set of CUs, call it C
    for i in range(NUMBER_JOBS):
        compute_unit_description = { "executable": "/bin/echo",
                                     "arguments": ["Hello","$ENV1","$ENV2"],
                                     "environment": ['ENV1=env_arg1','ENV2=env_arg2'],
                                     "number_of_processes": 1,
                                     "spmd_variation":"single",
                                     "output": "C_stdout.txt",
                                     "error": "C_stderr.txt",
                                    }
        compute_data_service.submit_compute_unit(compute_unit_description)
    
    print ("Wait for CUs of task set C to complete")
    compute_data_service.wait()
    
    print ("Terminate Pilot Jobs")
    compute_data_service.cancel()    
    pilot_compute_service.cancel()

Execute the script using command

python coupled_ensemble.py

Can you identify whether jobs of set (C) executed after jobs of set A & B?

You will have to go into the working directory (which is $HOME/agent in this case), then the directory named after the pilot-service (i.e. bj-####), and then the compute unit directories associated with that pilot-service (i.e. sj-###). Note that bj-#### is the unique identifier of each Pilot-Job. sj-### is the unique identifier associated with each subjob. When creating your own jobs, you can change the name of the subjob directories to have more human-readable names.

Check the time stamps of the compute unit output file (sj-stdout.txt).