Example Jobs
C example
The Problem and the Code
Consider the following simple, well-suited job for a cluster: comparison of independent Monte Carlo calculations of π. The following C-program implements random sampling of points withing a square bounding a circle. (The probability of landing inside the circle can be shown to be π/4) <syntaxhighlight lang="c">
- include <stdio.h>
- include <stdlib.h>
- include <time.h>
int main(int argc, char *argv[]) {
int i,N,incirc=0; double x,y,circrad2;
sscanf(argv[1], "%d", &N); // get iteration number from input srand(time(NULL)); // seed random number generator
circrad2=1.0*RAND_MAX; circrad2*=circrad2; // Define radius squared
for(i=0;i<N;i++){ x=1.0*rand(); y=1.0*rand(); // get rand. point and incirc += (x*x+y*y) < circrad2; // check if inside circle }
printf("pi=%.12f\n",4.0*incirc/N); // display probability return 0;
} </syntaxhighlight>
Compiling this program (that we may save as calcpi.c) <syntaxhighlight lang="bash"> gcc calcpi.c -o calcpi </syntaxhighlight> yields an executable calcpi that is ready for submission.
Preparation for Job Submission
To prepare the job execution space and inform Condor of the appropriate run environment, create a job description file (e.g. calcpi.condor)
Executable = calcpi Requirements = ParallelSchedulingGroup == "stats group" Universe = vanilla output = calcpi$(Process).out error = calcpi$(Process).err Log = calcpi.log Arguments = 100000000 should_transfer_files = YES when_to_transfer_output = ON_EXIT Queue 50
The last line specifies that 50 instances should be scheduled on the cluster. The description file specifies the executable and the arguments passed to it during execution. (In this case we are requesting that all instances iterate 10e9 times in the program's sampling loop.) The requirement field insists that the job stay on the Statistics Cluster. (All statistics nodes are labeled with "stats group" in their Condor ClassAds) Output and error files are targets for standard out and standard error streams respectively. The log file is used by Condor to record in real time the progress in job processing. Note that this setup labels output files by process number to prevent a job instance from overwritting files belonging to another. The current values imply that all files are to be found in the same directory as the description file.
The universe variable specifies the condor runtime environment. For the purposes of these independent jobs, the simplest "vanilla" universe suffices. In a more complicated parallel task, with checkpointing and migration, MPI calls etc., more advanced run-time environments are employed, often requiring specilized linking of the binaries. The lines specifying transfer settings are important to avoid any assumptions about accessibility over nfs. They should be included whether or not any output files (aside from standard output and error) are necessary.
Job Submission and Management
While logged in on stats, the job is submitted with: <syntaxhighlight lang="bash"> condor_submit calcpi.condor </syntaxhighlight> The cluster can be queried before or after submission to check its availability. Two very versatile commands exist for this purpose: condor_status and condor_q. The former returns the status of the nodes (broken down by virtual machines that can each handle a job instance.) The latter command shows the job queue including the individual instances of every job and the submission status (e.g. idling, busy etc.) Using condor_q a few seconds after submission shows:
-- Submitter: stat31.phys.uconn.edu : <192.168.1.41:44831> : stat31.phys.uconn.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 33.3 prod 1/30 15:37 0+00:00:02 R 0 9.8 calcpi 100000000 33.4 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000 33.5 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000 33.6 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000 33.7 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000 33.8 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000 6 jobs; 0 idle, 6 running, 0 held
By this time, only 6 jobs are left on the cluster, all with status 'R' - running. Various statistics are given including a job ID number. This handle is useful if intervention is required like manual removal of frozen job instances from the cluster. Now, comparing the results (e.g. with command cat calcpi*.out) shows
... pi=3.141215440000 pi=3.141447360000 pi=3.141418120000 pi=3.141797520000 ...
R example
The Problem and the Code
Consider the following simple, well-suited job for a cluster: independent Monte Carlo calculations of π. The following R-program implements random sampling of points withing a square bounding a circle. (The probability of landing inside the circle can be shown to be π/4)
#!/home/statsadmin/R/bin/Rscript # Prepare: collect command line arguments, # set iteration number and a unique seed args <- commandArgs() set.seed(Sys.time()) n <- as.numeric(args[length(args)-1]) # Collect n samples x <- runif(n) y <- runif(n) # Compute and output the value of pi pihat <- sum(x * x + y * y < 1) / n * 4 pihat write(pihat, args[length(args)]) proc.time()
Let us save this script as calcpi.R. Note the very important first line of this script. Without it, executing the script would require a command like Rscript calcpi.R Specifying the location of the interpreter in the first line after '#!' and adding the permission to execute this script with a command:
chmod a+x calcpi.R
greatly simplifies the handling of this program - especially useful for submission to the cluster.
Preparation for Job Submission
To prepare the job execution space and inform Condor of the appropriate run environment, create a job description file (e.g. Rcalcpi.condor)
executable = calcpi.R universe = vanilla Requirements = ParallelSchedulingGroup == "stats group" should_transfer_files = YES when_to_transfer_output = ON_EXIT arguments = 10000000 pihat-$(Process).dat output = pi-$(Process).Rout error = pi-$(Process).err log = pi.log Queue 50
The last line specifies that 50 instances should be scheduled on the cluster. The description file specifies the executable, an independent process universe called "vanilla" and a requirement that the job should be confined on the Statistics Cluster. Next, the important "transfer files" parameters specify that any necessay input files (not relevant here) should be transfered to the execution nodes and all files generated by the program should be transfered back to the launch directory. (These avoid any assumptions about directory accessibility over nfs.)
The arguments to be passed to the executable are just what the script expects: iteration number and output file name. The output, error and log file parameters represent the stdout, stderr and Condor job log target files respectively. Note the unique labeling of these files according to the associated process with the $(Process) place holder.
Job Submission and Management
The job is submitted with: <syntaxhighlight lang="bash"> condor_submit Rcalcpi.condor </syntaxhighlight> The cluster can be queried before or after submission to check its availability. Two very versatile commands exist for this purpose: condor_status and condor_q. The former returns the status of the nodes (broken down by virtual machines that can each handle a job instance.) The latter command shows the job queue including the individual instances of every job and the submission status (e.g. idling, busy etc.) Using condor_q some time after submission shows:
-- Submitter: stat31.phys.uconn.edu : <192.168.1.41:44831> : stat31.phys.uconn.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 7.0 stattestusr 3/25 15:03 0+00:00:00 R 0 9.8 calcpi.R 10000000 7.6 stattestusr 3/25 15:03 0+00:00:04 R 0 9.8 calcpi.R 10000000 7.10 stattestusr 3/25 15:03 0+00:00:00 R 0 9.8 calcpi.R 10000000 7.28 stattestusr 3/25 15:03 0+00:00:00 R 0 9.8 calcpi.R 10000000 7.45 stattestusr 3/25 15:03 0+00:00:00 R 0 9.8 calcpi.R 10000000 7.49 stattestusr 3/25 15:03 0+00:00:00 R 0 9.8 calcpi.R 10000000 6 jobs; 0 idle, 6 running, 0 held
By this time, only 6 jobs are left on the cluster, all with status 'R' - running. Various statistics are given including a job ID number. This handle is useful if intervention is required like manual removal of frozen job instances from the cluster. A command condor_rm 7.28 would remove just that instance, whereas condor_rm 7 will remove this entire job. Now, comparing the results (e.g. with command cat pihat-*.dat) shows
... 3.141672 3.141129 3.14101 3.142149 3.141273 ...
Acknowledgement
Examples provided by Igor Senderovich