If you check the status of your job and find it's been held, there are a couple possibilities.
+
+
=== Memory thrashing ===
+
Your job is using more memory than you requested and is thus causing the server to use swap (hard disk) instead of memory. This will thrash the system and is not good. A script is run periodically to check for such jobs. If a job is found that thrashes the machine, the job is held.
+
+
Check your log file for a particular job ID and find the point where it says that your job was held and that it was checkpointed. It should list the disk and memory your job used and requested. If the disk is significantly larger than requested, this job was using swap.
+
+
=== Submitted too many jobs ===
+
The cluster can only process so many jobs in the queue. If you submitted tens of thousands of jobs, they will be held. Our cluster cannot handle negotiating so many queued jobs at one time.
+
+
Never submit more than 5,000 jobs at a time. Remember, other users might also submit a lot of jobs too.