Held Job Troubleshooting
Jump to navigation
Jump to search
Overview
There exist various administrative scripts which run on the cluster automatically. If you find that your job has been held, designated by 'H' as a status, please use the following guidelines to understand why.
Viewing Job ClassAds
When Condor holds a job, the ClassAd 'HoldReason' can be modified to explain the cause. To see the ClassAds of a job, use the command
condor_q -l <jobid> | grep HoldReason
Hold Reaons
Over Maximum Run Count
The ClassAd HoldReason states
<user> job <jobid> removed because its RunCount # > 99
This means that your job has started 99 times already and is attempting to start again. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.