Held Job Troubleshooting
Overview
There exist various administrative scripts which run on the cluster automatically. If you find that your job has been held, designated by 'H' as a status, please use the following guidelines to understand why.
Viewing Job ClassAds
When Condor holds a job, the ClassAd 'HoldReason' can be modified to explain the cause. To see the ClassAds of a job, use the command
condor_q -l <jobid> | grep HoldReason
Hold Reaons
Over Maximum Run Count
The ClassAd HoldReason states
<user> job <jobid> removed because its RunCount # > 99
This means that your job has started # times which is more than the maximum allowed restarts. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.
Used More Memory Than Requested
The ClassAd HoldReason states
<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <RequestedMemory> * 1.2
This means that your job used more memory than the default minimum memory as well as exceeded the requested memory scaled by a factor of 1.2. If a user does not explicitly request memory, this is calculated by a formula in Condor.
The user should either
- Request memory slightly larger than the used memory OR
- Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps
Used More Memory Than Slot Memory Allocation
The ClassAd HoldReason states
<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <SlotMemory> * 1.2 + 500
This means that your job used more memory than the default minimum memory as well as exceeded the allocated slot memory scaled by a factor of 1.2 + 500.
The user should either
- Request memory slightly larger than the used memory OR
- Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps
Used More Disk Than Requested
The ClassAd HoldReason states
<user> job <jobid> removed because its RequestDisk # > 12000000 and # > <RequestedDisk> * 1.2
This means that your job used more disk than the default minimum disk space as well as exceeded the requested disk scaled by a factor of 1.2. If a user does not explicitly request disk, this is calculated by a formula in Condor.
The user should either
- Request disk slightly larger than the used disk OR
- Alter the code to use less disk space.