lightning user document heading  
NCAR
Last update: 09/14/2005

Lightning user doc contents

Tips for using lightning more effectively

Stopping a runaway batch job

Many batch jobs on NCAR supercomputers are long-term integrations that consist of a sequence of jobs. As each job finishes, a script submits the next job. (An example of a resubmitting job script appears in the IBM SP-cluster systems reference manual at Control resubmission of your batch job.)

However, if the job has an error that causes it to fail quickly after starting, the script can submit the next job which also fails quickly, and the job resubmissions become a runaway process. This situation wastes resources, can result in a massive flood of unwanted email or output, and possibly worst of all, the job may be hard to kill with the usual batch kill command because the jobs are failing and resubmitting so quickly that the user cannot identify them in time to kill them.

Here's a solution: Instead of using the bkill command, examine the script that is resubmitting the jobs to determine the name of the file that is being submitted to the batch queue, then move that file to another location. This will stop the resubmissions.


Next page | Table of contents - Lightning user guide

If you have questions about this document, please contact CISL Customer Support. You can also reach us by telephone 24 hours a day, seven days a week at 303-497-1278. Additional contact methods: consult1@ucar.edu and during business hours in NCAR Mesa Lab Suite 39.

© Copyright 2005. University Corporation for Atmospheric Research (UCAR). All Rights Reserved.

Address of this page: http://www.cisl.ucar.edu/docs/lightning/tips.jsp