Changes

1,021 bytes added , 23:03, 24 February 2017

no edit summary

Line 26: Line 26:

condor_q | less

</pre>

+

== Held job? ==

+

If you check the status of your job and find it's been held, there are a couple possibilities.

+

=== Memory thrashing ===

+

Your job is using more memory than you requested and is thus causing the server to use swap (hard disk) instead of memory. This will thrash the system and is not good. A script is run periodically to check for such jobs. If a job is found that thrashes the machine, the job is held.

+

Check your log file for a particular job ID and find the point where it says that your job was held and that it was checkpointed. It should list the disk and memory your job used and requested. If the disk is significantly larger than requested, this job was using swap.

+

=== Submitted too many jobs ===

+

The cluster can only process so many jobs in the queue. If you submitted tens of thousands of jobs, they will be held. Our cluster cannot handle negotiating so many queued jobs at one time.

+

Never submit more than 5,000 jobs at a time. Remember, other users might also submit a lot of jobs too.

== Check which machine the job is running on ==

Barnes

191

edits

Changes

How to Monitor Jobs (view source)

Revision as of 23:03, 24 February 2017

Navigation menu

Search