Difference between revisions of "Held Job Troubleshooting"
Jump to navigation
Jump to search
(Created page with "== Overview == There exist various administrative scripts which run on the cluster automatically. If you find that your job has been held, designated by 'H' as a status, pleas...") |
|||
Line 6: | Line 6: | ||
<pre> | <pre> | ||
− | condor_q -l <jobid> | + | condor_q -l <jobid> | grep HoldReason |
</pre> | </pre> | ||
== Hold Reaons == | == Hold Reaons == | ||
=== Over Maximum Run Count === | === Over Maximum Run Count === | ||
+ | The ClassAd <i>HoldReason</i> states | ||
+ | <pre> | ||
+ | <user> job <jobid> removed because its RunCount # > 99 | ||
+ | </pre> | ||
+ | |||
+ | This means that your job has started 99 times already and is attempting to start again. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails. | ||
=== Used More Memory Than Requested === | === Used More Memory Than Requested === | ||
=== Used More Memory Than Slot Provided === | === Used More Memory Than Slot Provided === | ||
=== Used More Disk Than Requested === | === Used More Disk Than Requested === |
Revision as of 20:26, 5 April 2017
Overview
There exist various administrative scripts which run on the cluster automatically. If you find that your job has been held, designated by 'H' as a status, please use the following guidelines to understand why.
Viewing Job ClassAds
When Condor holds a job, the ClassAd 'HoldReason' can be modified to explain the cause. To see the ClassAds of a job, use the command
condor_q -l <jobid> | grep HoldReason
Hold Reaons
Over Maximum Run Count
The ClassAd HoldReason states
<user> job <jobid> removed because its RunCount # > 99
This means that your job has started 99 times already and is attempting to start again. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.