Tips for using lightning more effectively
Many batch jobs on NCAR supercomputers are long-term integrations
that consist of a sequence of jobs. As each job finishes, a script
submits the next job. (An example of a resubmitting job script appears
in the IBM SP-cluster systems reference manual at
Control resubmission of your
batch job.)
However, if the job has an error that causes it to fail quickly
after starting, the script can submit the next job which also fails
quickly, and the job resubmissions become a runaway process. This
situation wastes resources, can result in a massive flood of unwanted
email or output, and possibly worst of all, the job may be hard to kill
with the usual batch kill command because the jobs are failing and
resubmitting so quickly that the user cannot identify them in time to
kill them.
Here's a solution: Instead of using the bkill command, examine the
script that is resubmitting the jobs to determine the name of the file
that is being submitted to the batch queue, then move that file to
another location. This will stop the resubmissions.
Next page |
Table of contents - Lightning user
guide
If you have questions about this document, please contact
CISL Customer Support.
You can also reach us by telephone 24 hours a day, seven days a week at
303-497-1278.
Additional contact methods:
consult1@ucar.edu
and during
business hours
in NCAR Mesa Lab Suite 39.
© Copyright 2005. University Corporation for Atmospheric
Research (UCAR). All Rights Reserved.
Address of this page:
http://www.cisl.ucar.edu/docs/lightning/tips.jsp
|