Changes

Jump to navigation Jump to search
2,767 bytes added ,  15:00, 20 January 2017
no edit summary
Line 1: Line 1: −
== Job Submission ==
+
== Job policy ==
    
=== ClassAds ===
 
=== ClassAds ===
 
The Statistics Cluster is equipped with a powerful job queuing system called [http://research.cs.wisc.edu/htcondor/ Condor]. This framework provides efficient use of resources by matching user needs to the available resources by taking into account both the priorities for the hardware and the preferences of the job. Matching resource requests to resource offers is accomplished through the <b><i>ClassAds</i></b> mechanism. Each virtual machine publishes its parameters as a kind of <u>class</u>ified <u>ad</u>vertisement to attract jobs. A job submitted to Condor for scheduling may list its requirements and preferences.
 
The Statistics Cluster is equipped with a powerful job queuing system called [http://research.cs.wisc.edu/htcondor/ Condor]. This framework provides efficient use of resources by matching user needs to the available resources by taking into account both the priorities for the hardware and the preferences of the job. Matching resource requests to resource offers is accomplished through the <b><i>ClassAds</i></b> mechanism. Each virtual machine publishes its parameters as a kind of <u>class</u>ified <u>ad</u>vertisement to attract jobs. A job submitted to Condor for scheduling may list its requirements and preferences.
 +
 +
=== AccountingGroup ===
 +
There are 4 types of jobs that can be submitted to the cluster. These are:
 +
{| class="wikitable" style="text-align: center; background: #A9A9A9"
 +
| style="background: #f9f9f9; width: 200px" | <b>Job Type</b>
 +
| style="background: #f9f9f9; width: 200px" | <b>Resource Quota</b>
 +
| style="background: #f9f9f9; width: 200px" | <b>Maximum Runtime</b> 
 +
| style="background: #f9f9f9; width: 400px" | <b>Line Required in Submit File</b>
 +
|-
 +
| style="background: #f9f9f9; width: 200px" | standardjob
 +
| style="background: #f9f9f9; width: 200px" | 450
 +
| style="background: #f9f9f9; width: 200px" | 24 hours
 +
| style="background: #f9f9f9; width: 400px" | default, no additional line in submit file
 +
|-
 +
| style="background: #f9f9f9; width: 200px" | longjob
 +
| style="background: #f9f9f9; width: 200px" | 250
 +
| style="background: #f9f9f9; width: 200px" | 48 hours
 +
| style="background: #f9f9f9; width: 400px" | +AccountingGroup = "group_statistics_longjob.username"
 +
|-
 +
| style="background: #f9f9f9; width: 200px" | shortjob
 +
| style="background: #f9f9f9; width: 200px" | 250
 +
| style="background: #f9f9f9; width: 200px" | 8 hours
 +
| style="background: #f9f9f9; width: 400px" | +AccountingGroup = "group_statistics_shortjob.username"
 +
|-
 +
| style="background: #f9f9f9; width: 200px" | testjob
 +
| style="background: #f9f9f9; width: 200px" | 50
 +
| style="background: #f9f9f9; width: 200px" | 20 minutes
 +
| style="background: #f9f9f9; width: 400px" | +AccountingGroup = "group_statistics_testjob.username"
 +
|}
 +
 +
*When jobs are submitted to the cluster, Condor will assign resources to jobs to satisfy the resource quotas of the group. If 1000 jobs of each group are submitted, each group should have met its resource quota and the remaining jobs will sit waiting for the next resource.
 +
 +
*To prevent users from holding onto resources, maximum runtimes are enforced. When a job has gone beyond it's maximum runtime, a job in the queue has the potential to preempt the overtime job.
 +
 +
*Jobs may also be preempted if one group is over quota and new jobs from a different group are submitted. The new group is able to preempt the extra jobs up to the new group's quota and without reducing the running group's quota.
 +
 +
 +
<b>Users are expected to adjust their jobs to meet these run time requirements.</b>
    
=== User Priority ===
 
=== User Priority ===
When jobs are submitted, Condor must allocate available resources to the requesting users. It does so by using a value called <i>userprio</i> (user priority). The lower the value of <i>userprio</i> the higher the priority for that user. For example, a user with <i>userprio</i> 5 has a higher priority than a user with <i>userprio</i> 50. The share of available machines that a user should be allocated is continuously calculated by Condor and changes based on the resource use of the individual. If a user has more machines allocated than the <i>userprio</i>, then the value will worsen by increasing over time. If a user has less machines allocated than the <i>userprio</i>, then it will improve by decreasing over time. This is how Condor fairly distributes machine resources to users.
+
When jobs are submitted, Condor must allocate available resources to the requesting users. In addition to adhering to the Accounting Groups, it does so by using a value called <i>userprio</i> (user priority). The lower the value of <i>userprio</i> the higher the priority for that user. For example, a user with <i>userprio</i> 5 has a higher priority than a user with <i>userprio</i> 50. The share of available machines that a user should be allocated is continuously calculated by Condor and changes based on the resource use of the individual. If a user has more machines allocated than the <i>userprio</i>, then the value will worsen by increasing over time. If a user has less machines allocated than the <i>userprio</i>, then it will improve by decreasing over time. This is how Condor fairly distributes machine resources to users.
    
On the stats cluster, each student and faculty member are given a specific <i>priority factor</i> of 1000. This is used to calculate the effective priority of a user. Any non-UConn user of the cluster has a priority factor of 2000 so that priority is given to UConn users.  As users claim machines their effective priority will adjust accordingly.
 
On the stats cluster, each student and faculty member are given a specific <i>priority factor</i> of 1000. This is used to calculate the effective priority of a user. Any non-UConn user of the cluster has a priority factor of 2000 so that priority is given to UConn users.  As users claim machines their effective priority will adjust accordingly.
 +
 +
 +
== Job Submission ==
    
=== Submit File ===
 
=== Submit File ===
Line 15: Line 56:  
</pre>
 
</pre>
   −
A simple description file goes as follows:
+
A simple, standard group description file goes as follows:
    
<pre>Requirements = ParallelSchedulingGroup == "stats group"
 
<pre>Requirements = ParallelSchedulingGroup == "stats group"
Line 31: Line 72:  
when_to_transfer_output = ON_EXIT
 
when_to_transfer_output = ON_EXIT
 
on_exit_remove = (ExitCode =?= 0)
 
on_exit_remove = (ExitCode =?= 0)
transfer_output_remaps = "<default_filename> = /home/<username>/jobs/<updated_filename>"
+
transfer_output_remaps = "<default_output_filename> = /home/<username>/jobs/<updated_output_path_and_filename>"
    
Queue 50</pre>
 
Queue 50</pre>
    +
Make sure that the <b>last line</b> in your submit file is "Queue <number>".
   −
Most of the variables are self-explanatory. The <b>executable</b> is a path to the program binary or executable script. The shown use of the <b>requirements</b> variable is important here to constrain job assignment to Statistics Cluster nodes only. All available nodes are tagged with <i>ParallelSchedulingGroup</i> variable in the ClassAds, so this is an effective way to direct execution to particular cluster segments. The <b>output</b>, <b>error</b> and <b>log</b> create the respective records for each job numbered by Condor with the <i>$(Process)</i> variable. A detailed example of a job is available [http://gryphn.phys.uconn.edu/statswiki/index.php/Example_Jobs here]. If your job requires input from another file, the following can be added above the output line:
+
Most of the variables are self-explanatory. The <b>executable</b> is a path to the program binary or executable script. The shown use of the <b>requirements</b> variable is important here to constrain job assignment to Statistics Cluster nodes only. All available nodes are tagged with <i>ParallelSchedulingGroup</i> variable in the ClassAds, so this is an effective way to direct execution to particular cluster segments. The <b>output</b>, <b>error</b> and <b>log</b> create the respective records for each job numbered by Condor with the <i>$(Process)</i> variable. A detailed example of a job is available [http://gryphn.phys.uconn.edu/statswiki/index.php/Example_Jobs here].
 +
 
 +
If your job requires input from another file, the following can be added above the output line:
 
<pre>
 
<pre>
 
input = input.file
 
input = input.file
Line 46: Line 90:  
For optimal allocation of resources, <b><i>serial jobs ought to be submitted to Condor as well</i></b>. This is accomplished by omitting the number of job instances leaving only the directive <i>Queue</i> in the last line of the job description file outlined above. Obviously, <i>$(Process)</i> placeholder is no longer necessary since there will be no enumeration of output files.
 
For optimal allocation of resources, <b><i>serial jobs ought to be submitted to Condor as well</i></b>. This is accomplished by omitting the number of job instances leaving only the directive <i>Queue</i> in the last line of the job description file outlined above. Obviously, <i>$(Process)</i> placeholder is no longer necessary since there will be no enumeration of output files.
   −
== Job policy ==
+
=== AccountingGroup Example ===
The policy for standard jobs is to allow jobs to run a maximum of 24 hours. Long jobs are allowed to run for 48 hours before being killed but these jobs must have the following included in their submit file
+
A simple, testjob group description file goes as follows:
 +
 
 +
<pre>Requirements = ParallelSchedulingGroup == "stats group"
 +
+AccountingGroup = "group_statistics_testjob.username"
 +
Universe  = vanilla
 +
Executable = myprog
 +
Arguments = $(Process)
 +
request_cpus = 1
   −
<pre>
+
output    = myprog-$(Process).out
+AccountingGroup = “group_statistics_longjob.username”
+
error    = myprog-$(Process).err
 +
Log      = myprog.log
   −
...other ClassAds...
+
transfer_input_files = myprog
 +
should_transfer_files = YES
 +
when_to_transfer_output = ON_EXIT
 +
on_exit_remove = (ExitCode =?= 0)
 +
transfer_output_remaps = "<default_output_filename> = /home/<username>/jobs/<updated_output_path_and_filename>"
   −
Queue #
+
Queue 50</pre>
</pre>
     −
Make sure that the last line in your submit file is "Queue <number>". Users are expected to adjust their jobs to meet these run time requirements. Jobs that go beyond these limits can be removed by administrators.
+
Remember to replace ".username" with your stats cluster username. This sample submit script can be used for shortjob and longjob groups by replacing "testjob" with either "shortjob" or "longjob".
191

edits

Navigation menu