Difference between revisions of "Held Job Troubleshooting"
Line 10: | Line 10: | ||
== Hold Reaons == | == Hold Reaons == | ||
+ | |||
=== Over Maximum Run Count === | === Over Maximum Run Count === | ||
The ClassAd <i>HoldReason</i> states | The ClassAd <i>HoldReason</i> states | ||
Line 16: | Line 17: | ||
</pre> | </pre> | ||
− | This means that your job has started | + | This means that your job has started # times which is more than the maximum allowed restarts. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails. |
+ | |||
=== Used More Memory Than Requested === | === Used More Memory Than Requested === | ||
− | === Used More Memory Than Slot | + | The ClassAd <i>HoldReason</i> states |
+ | <pre> | ||
+ | <user> job <jobid> removed because its MemoryUsage # > 1200 and # > <RequestedMemory> * 1.2 | ||
+ | </pre> | ||
+ | |||
+ | This means that your job used more memory than the default minimum memory as well as exceeded the requested memory scaled by a factor of 1.2. If a user does not explicitly request memory, this is calculated by a formula in Condor. | ||
+ | |||
+ | The user should either | ||
+ | # Request memory slightly larger than the used memory OR | ||
+ | # Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps | ||
+ | |||
+ | === Used More Memory Than Slot Memory Allocation === | ||
+ | The ClassAd <i>HoldReason</i> states | ||
+ | <pre> | ||
+ | <user> job <jobid> removed because its MemoryUsage # > 1200 and # > <SlotMemory> * 1.2 + 500 | ||
+ | </pre> | ||
+ | |||
+ | This means that your job used more memory than the default minimum memory as well as exceeded the allocated slot memory scaled by a factor of 1.2 + 500. | ||
+ | |||
+ | The user should either | ||
+ | # Request memory slightly larger than the used memory OR | ||
+ | # Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps | ||
+ | |||
=== Used More Disk Than Requested === | === Used More Disk Than Requested === | ||
+ | The ClassAd <i>HoldReason</i> states | ||
+ | <pre> | ||
+ | <user> job <jobid> removed because its RequestDisk # > 12000000 and # > <RequestedDisk> * 1.2 | ||
+ | </pre> | ||
+ | |||
+ | This means that your job used more disk than the default minimum disk space as well as exceeded the requested disk scaled by a factor of 1.2. If a user does not explicitly request disk, this is calculated by a formula in Condor. | ||
+ | |||
+ | The user should either | ||
+ | # Request disk slightly larger than the used disk OR | ||
+ | # Alter the code to use less disk space. |
Revision as of 20:34, 5 April 2017
Overview
There exist various administrative scripts which run on the cluster automatically. If you find that your job has been held, designated by 'H' as a status, please use the following guidelines to understand why.
Viewing Job ClassAds
When Condor holds a job, the ClassAd 'HoldReason' can be modified to explain the cause. To see the ClassAds of a job, use the command
condor_q -l <jobid> | grep HoldReason
Hold Reaons
Over Maximum Run Count
The ClassAd HoldReason states
<user> job <jobid> removed because its RunCount # > 99
This means that your job has started # times which is more than the maximum allowed restarts. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.
Used More Memory Than Requested
The ClassAd HoldReason states
<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <RequestedMemory> * 1.2
This means that your job used more memory than the default minimum memory as well as exceeded the requested memory scaled by a factor of 1.2. If a user does not explicitly request memory, this is calculated by a formula in Condor.
The user should either
- Request memory slightly larger than the used memory OR
- Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps
Used More Memory Than Slot Memory Allocation
The ClassAd HoldReason states
<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <SlotMemory> * 1.2 + 500
This means that your job used more memory than the default minimum memory as well as exceeded the allocated slot memory scaled by a factor of 1.2 + 500.
The user should either
- Request memory slightly larger than the used memory OR
- Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps
Used More Disk Than Requested
The ClassAd HoldReason states
<user> job <jobid> removed because its RequestDisk # > 12000000 and # > <RequestedDisk> * 1.2
This means that your job used more disk than the default minimum disk space as well as exceeded the requested disk scaled by a factor of 1.2. If a user does not explicitly request disk, this is calculated by a formula in Condor.
The user should either
- Request disk slightly larger than the used disk OR
- Alter the code to use less disk space.