|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
|
Jobs queried by qstat can be in different queue states:
The most typical situation In SGE is a stalled job, or situation in which a node does not accept any jobs (see also Job or Queue Reported in Error State E)
As with any troubleshooting the first step is to read SGE logs. See SGE log files
There are several reasons such a situation can arise:
'qstat -f' will show any nodes marked as in error state 'qstat -g c' will give you a quick summary of total slots, slots in use, and slots offline/errored, etc...
qstat -s p shows pending jobs, which is all those with state "qw" and "hqw".
qstat -s h shows hold jobs, which is all those with state "hqw".
qstat -u "*" | grep " qw"
alias jobstat 'source <code_script>
echo "Running jobs: " ` qstat -u '*' | awk ' { if ($5 == "r") print $0 }' | wc -l` ;
echo "Pending jobs: " ` qstat -u '*' | awk ' { if ($5 == "qw" || $5 == "hqw") print $0 }' | wc -l ` ;
echo "------------------------" ;
echo "Total " `qstat -u '*' | wc -l `
The first thing in such a situation is to see output of the command qstat -f (see Queue states for explanation of possible states)
Here is an example of this command output of qstat -f:[1] root@node17: # qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@node16 BIP 0/0/32 0.00 lx24-amd64 --------------------------------------------------------------------------------- all.q@node17 BIP 0/0/12 0.06 lx24-amd64 --------------------------------------------------------------------------------- all.q@node52 BIP 0/0/12 12.02 lx24-amd64 --------------------------------------------------------------------------------- all.q@node53 BIP 0/0/80 39.72 lx24-amd64 --------------------------------------------------------------------------------- all.q@node54 BIP 0/0/80 0.02 lx24-amd64 --------------------------------------------------------------------------------- all.q@wx3481-ustc BIP 0/0/8 -NA- lx24-amd64 au --------------------------------------------------------------------------------- c12.q@node52 BIP 0/0/12 12.02 lx24-amd64 --------------------------------------------------------------------------------- c32.q@node16 BIP 0/0/32 0.00 lx24-amd64 --------------------------------------------------------------------------------- c32.q@node53 BIP 0/0/32 39.72 lx24-amd64 --------------------------------------------------------------------------------- c32.q@node54 BIP 0/0/32 0.02 lx24-amd64 E --------------------------------------------------------------------------------- c40.q@node53 BIP 0/0/40 39.72 lx24-amd64 --------------------------------------------------------------------------------- c40.q@node54 BIP 0/0/40 0.02 lx24-amd64 E --------------------------------------------------------------------------------- m12a.q@node52 BIP 0/12/12 12.02 lx24-amd64 --------------------------------------------------------------------------------- m32a.q@node16 BIP 0/0/32 0.00 lx24-amd64 --------------------------------------------------------------------------------- m40a.q@node54 BIP 0/0/40 0.02 lx24-amd64 E --------------------------------------------------------------------------------- m40b.q@node53 BIP 0/40/40 39.72 lx24-amd64To correct situation when job is in E state you can try to use qmod -c command (you can issue qmod -c "*" to do blanket resetting of all such situations on all the nodes).
For example:
qmod -c c40.q c32.q m40a.q root@node17 changed state of "c40.q@node54" (no error) Queue instance "c40.q@node53" is already in the specified state: no error Queue instance "c32.q@node16" is already in the specified state: no error root@node17 changed state of "c32.q@node54" (no error) Queue instance "c32.q@node53" is already in the specified state: no error root@node17 changed state of "m40a.q@node54" (no error)qhost Print out execution host configuration and load. For example:
# qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - node16 lx24-amd64 32 0.01 126.0G 1018.1M 62.5G 0.0 node17 lx24-amd64 12 0.09 5.8G 637.4M 11.7G 204.0K node52 lx24-amd64 12 12.02 47.2G 2.0G 49.1G 0.0 node53 lx24-amd64 80 39.56 126.0G 1.4G 62.5G 0.0 node54 lx24-amd64 80 0.02 126.0G 502.9M 62.5G 0.0 node8 lx24-amd64 8 - 15.7G - 15.6G -
|
qstat -g c Print out the current queue utilization
qstat -u <user> Only show jobs of a special user
qstat -j <job id> Print out detailed information about the job with the specified job id. If it is not enabled, you need to enable this type of reporting Enabling scheduling information in command qstat -j
The most typical is Your job "starves" in the waiting queue
problem description | possible reason | example | solution |
Your job "starves" in the waiting queue | The farm is full | check the output of "qstat -g c" for available nodes | |
You requested resources which cannot be fullfilled: | -l h_cpu > 48:00:00 | you can just request cpu time < 49 hours | |
Only some of a set of identical jobs die | You did not specify your requirements correctly | You did not specify h_cpu | If h_cpu is not specified, your job might run on short queues. If your job needs more than 30 minutes cpu time, it will be killed |
Too many jobs access data on the same file server at once | Use AFS! | ||
Do not submit too many jobs at once. If you really need to, try using the qsub "-hold_jid" option. | |||
All your jobs die at once | There are problems writing the log files (job's STDOUT/STDERR) | The log directory (located in AFS) contains too many files. SGE's error mail (qsub parameter '-m a') contains a line saying something like "/afs/naf.desy.de/...: File too large". | Do not store more than 1000 output files per directory. |
The output directory is not writable. SGE's error mail contains a line saying something like "/afs/naf.desy.de/...: permission denied". | Check directory permissions. | ||
The log directory does not exist on the execution host. | You can only use network enabled filesystems (AFS, NFS) as log directory. Local directories (e.g. /usr1/scratch) won't work. | ||
Your job uses threads and works perfectly if not run under the batch system. Running in SGE it dies with obscure error messages stating you are not able to start specific threads | SGE sets the stack size limit to the same value as h_vmem by default. That should be considered a bug of SGE. | Specify the h_stack job resource like this: -l h_stack=10M. 10M is a good value for most applications | |
qrsh failes with an error message complaining 'Your "qrsh" request could not be scheduled, try again later' | The farm is full and qrsh wants to occupy a slot at once. | Try "qrsh -now n |
|
qrsh starts normally, it finishes unexpectedly during program execution | You requested not enough memory (the default is 128M) | Request more memory, e.g. 1GB: qrsh -l h_vmem=1G |
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
biowiki.org
After submitting your job to Grid Engine you may track its status by using either the qstat command, the GUI interface QMON, or by email.
Monitoring with qstatThe qstat command provides the status of all jobs and queues in the cluster. The most useful options are:
- qstat: Displays list of all jobs with no queue status information.
- qstat -u hpc1***: Displays list of all jobs belonging to user hpc1***
- qstat -f: gives full information about jobs and queues.
- qstat -j [job_id]: Gives the reason why the pending job (if any) is not being scheduled.
You can refer to the man pages for a complete description of all the options of the qstat command.
Monitoring Jobs by Electronic MailAnother way to monitor your jobs is to make Grid Engine notify you by email on status of the job.
In your batch script or from the command line use the -m option to request that an email should be send and -M option to precise the email address where this should be sent. This will look like:
#$ -M myaddress@work
#$ -m beasWhere the (-m) option can select after which events you want to receive your email. In particular you can select to be notified at the beginning/end of the job, or when the job is aborted/suspended (see the sample script lines above).
And from the command line you can use the same options (for example):
qsub -M myaddress@work -m be job.sh
How do I control my jobsBased on the status of the job displayed, you can control the job by the following actions:
Monitoring and controlling with QMON
Modify a job: As a user, you have certain rights that apply exclusively to your jobs. The Grid Engine command line used is qmod. Check the man pages for the options that you are allowed to use.
- Suspend/(or Resume) a job: This uses the UNIX kill command, and applies only to running jobs, in practice you type
qmod -s/(or-r)job_id (where job_id is given by qstat or qsub).
- Delete a job: You can delete a job that is running or spooled in the queue by using the qdel command like this
qdel job_id (where job_id is given by qstat or qsub).
You can also use the GUI QMON, which gives a convenient window dialog specifically designed for monitoring and controlling jobs, and the buttons are self explanatory.
For further information, see the SGE User's Guide ( PDF, HTML).
May 07, 2017 | biowiki.org
Does your job show "Eqw" or "qw" state when you run qstat , and just sits there refusing to run? Get more info on what's wrong with it using:
$ qstat -j <job number>
Does your job actually get dispatched and run (that is, qstat no longer shows it - because it was sent to an exec host, ran, and exited), but something else isn't working right? Get more info on what's wrong with it using:
$ qacct -j <job number> (especially see the lines "failed" and "exit_status")
If any of the above have an "access denied" message in them, it's probably a permissions problem. Your user account does not have the privileges to read from/write to where you told it (this happens with the -e and -o options to qsub often). So, check to make sure you do. Try, for example, to SSH into the node on which the job is trying to run (or just any node) and make sure that you can actually read from/write to the desired directories from there. While you're at it, just run the job manually from that node, see if it runs - maybe there's some library it needs that the particular node is missing.
To avoid permissions problems, cd into the directory on the NFS where you want your job to run, and submit from there using qsub -cwd to make sure it runs in that same directory on all the nodes.
Not a permissions problem? Well, maybe the nodes or the queues are unreachable. Check with:
qstat -f
or, for even more detail:
qstat -F
If the "state" column in qstat -f has a big E , that host or queue is in an error state due to... well, something. Sometimes an error just occurs and marks the whole queue as "bad", which blocks all jobs from running in that queue, even though there is nothing otherwise wrong with it. Use qmod -c <queue list> to clear the error state for a queue.
Maybe that's not the problem, though. Maybe there is some network problem preventing the SGE master from communicating with the exec hosts, such as routing problems or a firewall misconfiguration. You can troubleshoot these things with qping , which will test whether the SGE processes on the master node and the exec nodes can communicate.
N.B.: remember, the execd process on the exec node is responsible for establishing a TCP/IP connection to the qmaster process on the master node , not the other way around. The execd processes basically "phone home". So you have to run qping from the exec nodes , not the master node!
Syntax example (I am running this on a exec node, and sheridan is the SGE master):
$ qping sheridan 536 qmaster 1
where 536 is the port that qmaster is listening on, and 1 simply means that I am trying to reach a daemon. Can't reach it? Make sure your firewall has a hole on that port, that the routing is correct, that you can ping using the good old ping command, that the qmaster process is actually up, and so on.
Of course, you could ping the exec nodes from the master node, too, e.g. I can see if I can reach exec node kosh like this:
$ qping kosh 537 execd 1
but why would you do such a crazy thing? execd is responsible for reaching qmaster , not the other way around.
If the above checks out, check the messages log in /var/log/sge_messages on the submit and/or master node (on our Babylon Cluster , they're both the node sheridan ):
$ tail /var/log/sge_messages
Personally, I like running:
$ tail -f /var/log/sge_messages
before I submit the job, and then submit a job in a different window. The -f option will update the tail of the file as it grows, so you can see the message log change "live" as your job executes and see what's happening as things take place.
(Note that the above is actually a symbolic link I put in to the messages log in the qmaster spool directory, i.e. /opt/sge/default/spool/qmaster/messages .)
One thing that commonly goes wrong is permissions. Make sure that the user that submitted the job using qsub actually has the permissions to write error, output, and other files to the paths you specified.
For even more precise troubleshooting... maybe the problem is unique only to some nodes(s) or some queue(s)? To pin it down, try to run the job only on some specific node or queue:
$ qsub -l hostname=<node/host name> <other job params>
$ qsub -l qname=<queue name> <other job params>
Maybe you should also try to SSH into the problem nodes directly and run the job locally from there, as your own user, and see if you can get any more detail on why it fails.
If all else fails...Sometimes, the SGE master host will become so FUBARed that we have to resort to brute, traumatizing force to fix it. The following solution is equivalent to fixing a wristwatch with a bulldozer, but seems to cause more good than harm (although I can't guarantee that it doesn't cause long-term harm in favor of a short-term solution).
Basically, you wipe the database that keeps track of SGE jobs on the master host, taking any problem "stuck" jobs with it. (At least that's what I think this does...)
I've found this useful when:
- You submit >10,000 jobs to SGE, which uses too much system resources resulting in their inability to get dispatched to exec hosts, and start getting the "failed receiving gdi request" error on something as simple as qstat . You can't use qdel to wipe the jobs due to the same error.
- A job is stuck in the r state (and if you try to delete it, the dr state) despite the fact that the exec host is not running the job, not is even aware of it. This can happen if you reboot a stuck/unresponsive exec host.
The solution:
ssh sheridan su - service sgemaster stop cd /opt/sge/default/ mv spooldb spooldb.fubared mkdir spooldb cp spooldb.fubared/sge spooldb/ chown -R sgeadmin:sgeadmin spooldb service sgemaster startWipe spooldb.fubared when you are confident that you won't need its contents again.
Jul 19 , 2010 | Rocks-Discuss
Mike Hanby mhanby at uab.edu
Mon Jul 19 12:27:15 PDT 2010
- Previous message: [Rocks-Discuss] "Some compute nodes not accepting jobs"
- Next message: [Rocks-Discuss] All nodes down
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Some times if a job fails in a way that makes SGE think the node might be at fault, SGE will mark the node in error.
'qstat -f' will show any nodes marked as errored 'qstat -g c' will give you a quick summary of total slots, slots in use, and slots offline/errored, etc...You may be able to find some information in /opt/gridengine/default/spool/qmaster/messagesIt's a good idea to run the 'qstat -g c' command periodically. It can help head off some support calls, especially if you have some users who are node watchers ;-)
Also, check 'qhost' from time to time to make sure none of your nodes are overloaded (i.e. jobs behaving badly) as it will display load, memory used and swap used.
Mike >
try qalter -w v jobnumber
imperial.ac.uk
SGE jobs can fails for a variety of reasons, which may either be scheduling problems preventing the job from being executed, or runtime errors due to scripting errors etc.Scheduling Problems
A job will remain in a queued state if SGE can not allocate appropriate resources as requested in the job submission. This may be either due to jobs of other users on the cluster which are ahead of the queued job in the queue, or it may be that the resources requested are not available on the cluster i.e. more memory was requested than is provided by any cluster node, or the job has been submitted by a user not registered to use the cluster.
The 'qalter' command provides a method for verifiying whether a job is capable of being run assuming no other jobs are present in the queue. Executing 'qalter -w v [jobid]' will sequentially check each queue and report on whether a queue capable of executing the job is present on the system.
Example: Unregistered User Submitting Jobs
A job submitted by a user who is not registered to submit jobs to the cluster will result in jobs appearing in the queue, but remaining in a queued state. Executing 'qalter -w v [jobid]' in this circumstance will produce an output such as that show below. The job is reported for having no permission to run in any of the queues since the user is not registered. In this case, contact [email protected] to arrange access to the cluster.
[root@codon /]# qalter -w v 3811734 Job 3811734 has no permission for cluster queue "quick" Job 3811734 has no permission for cluster queue "emaas" Job 3811734 has no permission for cluster queue "1day_16" Job 3811734 has no permission for cluster queue "3day_16" Job 3811734 has no permission for cluster queue "3day_32" Job 3811734 has no permission for cluster queue "6hour_16" Job 3811734 has no permission for cluster queue "7day_16" Job 3811734 has no permission for cluster queue "7day_32" Job 3811734 has no permission for cluster queue "infinite_128" Job 3811734 has no permission for cluster queue "infinite_16" Job 3811734 has no permission for cluster queue "infinite_32" verification: no suitable queuesExample: Incorrect memory request
A job which is submitted with a resource request which can never be fulfilled can also be identified by executing 'qalter -w v [jobid]'. For example, in the following case a job was submitted requesting a slot with 5Gb of memory, however standard slots on the cluster offer either 2Gb or 4Gb memory, consequently this request can not be fulfilled. Jobs requiring access to more memory than offered by a single slot should be submitted as SMP jobs requesting a number of slots (see submitting parallel jobs for details)
[bss-admin@codon ~]$ qalter -w v 3811742
Job 3811742 has no permission for cluster queue "quick"
Job 3811742 has no permission for cluster queue "emaas"
Job 3811742 (-l h_vmem=5G) cannot run in queue "1day_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "3day_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "3day_32" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "6hour_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "7day_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "7day_32" because of cluster queue
Job 3811742 has no permission for cluster queue "infinite_128"
Job 3811742 (-l h_vmem=5G) cannot run in queue "infinite_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "infinite_32" because of cluster queue
verification: no suitable queues
qping -info myhost16 6445 execd 1 # check status of execd from master
[ID 1288901.1]Modified 19-APR-2012 Type DIAGNOSTIC TOOLS Status PUBLISHED
This documentation contains the scripts for data collection and configuration information in order to troubleshoot and resolve HPC Grid Engine issues.
Attachments
- collects info on qmaster (11.55 KB) -- this is a shell script
- collects grid config info (32.09 KB) -- this is a Perl script
Google matched content |
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: July, 28, 2019