Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

SGE Consumable Resources

News	SGE Resource Quota Sets	Recommended Links	Slot limits and restricting number of slots per server	SGE Consumable Resources	License tokens processing and limitation of the number of concurrent jobs	slots queue attribute
SGE Queues	Grid Engine Config Tips	SGE Parallel Environment	Configuring Hosts From the Command Line	Perl Admin Tools and Scripts	Humor	Etc

Introduction
Types of consumables
Defining consumables

Introduction

The best way for managing licenses in SGE is the use of consumable resources (CR). Floating licenses can easily be managed with a global CR. The classic example of built-in consumable resource in SGE are slots.

The SGE batch scheduling system allows for arbitrary "consumable resources" to be created that users can then make requests against. Thus they can be used to limit access to software licenses based on availability of license tokens. What a job that uses a special software package starts it they request one (or more) license from SGE and consumable resource will decrement the counter for that license pool. If no more resources are available (i.e. the internal counter is at 0), then the job will be delayed until a currently-used resource is freed up.

Types of consumables

The consumable parameter can have three values:

'yes' ('y' abbreviated). Only for numeric attributes (INT, DOUBLE, MEMORY, TIME - see type above).
'no' ('n')
'JOB' ('j'). Only for numeric attributes (INT, DOUBLE, MEMORY, TIME - see type above).

It can be set to 'yes' and 'JOB' only for numeric attributes (INT, DOUBLE, MEMORY, TIME - see type above). If set to 'yes' or 'JOB' the consumption of the corresponding resource can be managed by Sun Grid Engine internal bookkeeping. In this case Sun Grid Engine accounts for the consumption of this resource for all running jobs and ensures that jobs are only dispatched if the Sun Grid Engine internal bookkeeping indicates enough available consumable resources. Consumables are an efficient means to manage limited resources such a available memory, free space on a file system, network bandwidth or floating software licenses.

There are two types of consumables: per job and per slot.

A consumable defined by 'y' is a per slot consumable. Which means the limit is multiplied by the number of slots being used by the job before being applied.
In case of 'j' the consumable is a per job consumable. This resource is debited as requested (without multiplication) from the allocated master queue. The resource needs not be available for the slave task queues.

Consumables can be combined with default or user defined load parameters (see sge_conf(5) and host_conf(5)), i.e. load values can be reported for consumable attributes or the consumable flag can be set for load attributes.

The Sun Grid Engine consumable resource management takes both the load (measuring availability of the resource) and the internal bookkeeping into account in this case, and makes sure that neither of both exceeds a given limit.

To enable consumable resource management the basic availability of a resource has to be defined. This can be done on a cluster global, per host and per queue basis while these categories may supersede each other in the given order (i.e. a host can restrict availability of a cluster resource and a queue can restrict host and cluster resources).

Defining consumables

The definition of resource availability is performed with the complex_values entry in host_conf(5) and queue_conf(5).

Basically a complex is a resource of value that can be requested by a job with the -l switch to qsub By setting a complex to be consumable, it means that when a job requests that complex the number available is decreased.

The complex_values definition of the "global" host specifies cluster global consumable settings. To each consumable complex attribute in a complex_values list a value is assigned which denotes the maximum available amount for that resource. The internal bookkeeping will subtract from this total the assumed resource consumption by all running jobs as expressed through the jobs' resource requests.

Notes:

Jobs can be forced to request a resource and thus to specify their assumed consumption via the 'force' value of the requestable parameter (see above).
A default resource consumption value can be pre-defined by the administrator for consumable attributes not explicitly requested by the job (see the default parameter below). This is meaningful only if requesting the attribute is not enforced as explained above.

See the Sun Grid Engine Installation and Administration Guide for examples on the usage of the consumable resources facility.

Here is hoe how to achieve "license tokens" consumption in SGE (aka licence token management)

Add consumable (configure a consumable "per job" complex attribute)
```
% qconf -mc
```
#name shortcut type relop requestable consumable default urgency
```
accel accel INT <= YES JOB 0 0
```

add the total tokens to the "global" host

% qconf -me global

complex_values accel=19

submit job requesting slots and license tokens:
```
% qsub -l accel=10 -pe mpi 8 <myjob.sh>
```

The "per job" setting ensure that the requested tokens are *not* multiplied with the number of requested slots.

Another example from Setting Up A Global Consumable Resource in Grid Engine

Step 1: Configure the "global" complex

First create/modify a complex called "global" (the name is reserved, like the complexes which are managing resources on a per host/queue basis are called "host" and "queue"). This can be found by clicking the "Complexes Configuration" button in qmon.

Enter the following values for the complex (verilog is used in this example):
#name   shortcut type value relop requestable consumable default
#-------------------------------------------------------------
verilog vl       INT  0     <=    YES         YES        0 
The above says: there is a complex attribute called "verilog" with the shortcut name "vl" and it is of type integer. The "value" for consumable resources has no meaning here (therefore it is 0). This resource is requestable (YES), and it is consumable (YES).

The "default" field should be set to 0 (it is a default value for users who don't request anything, but for a global value it is not useful here).

When using qmon, do not forget to press the "Add" button to add the new complex definition to the table below before applying with the "Ok" button.

After the complex is configured, it can be viewed by running the following command at the prompt:
% qconf -sc global
Step 2: Configure the "global" host

Since a global consumable resource is being created (all hosts have access to this resource), the pseudo host "global" must be configured.

Using qmon:

qmon -> Host Configuration -> Execution host

Select the "global" host and click on "Modify". Select the tab titled "Consumable/Fixed Attributes". It is correct that the "global" complex does not show in the window (the global host has it by default, just as a host has the "host" complex by default).

Now click on the "Name/Value" title bar on the right (above the trash bin icon). A window pops up and there will be the resource "verilog". Select OK and verilog will be added to the first column of the table. Now enter the number of licenses of verilog in the second column.

Press "Ok" and the new resource and number in the will appear in the "Consumables/Fixed Attributes" window. Click the "Done" button to close this window.

Step 3: View the consumable attribute

To view the attribute, type the following:
% qstat -F verilog
   
   queuename   qtype   used/tot.   load_avg    arch             states
   ---------------------------------------------------------------------------
   balrog.q     BIC      0/4         0.45    solaris64    gc:verilog=10.000000
   ---------------------------------------------------------------------------
   bilbur.q     BIC      0/4         0.46     solaris     gc:verilog=10.000000
   ---------------------------------------------------------------------------
   dwain.q      BIC      0/4         0.82      irix6      gc:verilog=10.000000
See qstat(1) for the various meanings of "gc", etc. (Try "qstat -F" to see a long list of attributes associated with each queue).

"gc" means it is a (g)lobal (c)onsumable resource

Since it is global, all queues have inherited this value.

Step 4: Use the consumable attribute

The following submits a job, and requests the verilog resource:
% qsub -l vl=1 myjob.sh
When the job is running, the effect can be seen by running qstat:
% qstat -F vl
   
   queuename      qtype    used/tot.   load_avg    arch      states
   ----------------------------------------------------------------------------
   balrog.q        BIC       0/4         0.40    solaris64  gc:verilog=9.000000
   ----------------------------------------------------------------------------
   gloin.q2        BIC       0/4         0.02      osf4     gc:verilog=9.000000
   ----------------------------------------------------------------------------
   lis.q           BIC       0/4         0.35     glinux    gc:verilog=9.000000
   ----------------------------------------------------------------------------
   ori.q           BIC       1/4         0.15     glinux    gc:verilog=9.000000
   
     3026     0 sleeper.sh andy         t     11/02/1999 15:55:25 MASTER           
To see which running job requested which resources:
% qstat -F vl -r -s r

   queuename      qtype    used/tot.    load_avg    arch     states
   ----------------------------------------------------------------------------
   [...]
   ----------------------------------------------------------------------------
   ori.q           BIC       1/4          0.12      glinux  gc:verilog=9.000000
   
     3026     0 sleeper.sh andy         r     11/02/1999 15:55:25 MASTER           
          Full jobname:     sleeper.sh
          Hard Resources:   verilog=1
          h_fsize=0 (default)

Top updates <p>Your browser does not support iframes.</p>
Bulletin	Latest	Past week	Past month	Google Search

NEWS CONTENTS

200102 : gridengine - How can I set the maximum number of running jobs per user on SGE ( Server Fault )
200102 : Reserving resources (RAM, disc, GPU) by MerlinWiki ( SGE - MerlinWiki )
200102 : Added by John Pormann, last edited by John Pormann ( Jul 16, 2008 )
200102 : amazon ec2 - SGE Auto configured consumable resource ( Server Fault )
200102 : gpgpu - Scheduling GPU resources using the Sun Grid Engine (SGE) ( Stack Overflow )

Old News ;-)

gridengine - How can I set the maximum number of running jobs per user on SGE

Server Fault
We're using SGE (Sun Grid Manager). We have some limitations on the total number of concurrent jobs from all users.
I would like to know if it's possible to set a temporary, voluntary limit on the number of concurrent running jobs for a specific user.

For example user dave is about to submit 500 jobs, but he would like no more than 100 to run concurrently, e.g. since he knows the jobs do lots of I/O which stuck the filesytem (true story, unfortunately).

Is that possible?
Kamil Kisiel
8,12412250

asked Sep 24 '10 at 0:25
David B

You can define a complex with qconf -mc. Call it something like high_io or whatever you'd like, and set the consumable field to YES. Then in either the global configuration with qconf -me global or in a particular queue with qconf -mq <queue name> set high_io=500 in the complex values. Now tell your users to specify -l high_io=1 or however many "tokens" you'd like them to use. This will limit the number of concurrent jobs to whatever you set the complex value to.

The other way to do this is with quotas. Add a quota with qconf -arqs that looks something like:
 {
        name         dave_max_slots
        description  "Limit dave to 500 slots"
        enabled      true
        limit        users {dave} to slots=500
 }
Thanks Kamil and sorry for the late reply. A couple of follow-ups, since I'm quite new to qconf. Regarding your first suggestion, could you be a bit more explicit? What is "consumable"?
After configuring as mentioned, fo I simply tell the user to qsub with -l high_io=1? – David B Sep 28 '10 at 9:39

Basically a complex is a resource of value that can be requested by a job with the -l switch to qsub. By setting a complex to be consumable, it means that when a job requests that complex the number available is decreased. So if a queue has 500 of the high_io complex, and a job requests 20, there will be 480 available for other jobs. You'd request the complex just as in your example. – Kamil Kisiel Sep 28 '10 at 22:42
Thank you Kamil. Sorry I can't vote up (not enough reputation yet). – David B Oct 1 '10 at 9:08

Reserving resources (RAM, disc, GPU) by MerlinWiki

SGE - MerlinWiki

matyldaX

scratchX

ram_free, mem_free

disk_free, tmp_free

gpu

We have found that for some tasks, it is advantageous to specify the info on required resources to SGE. It has sense in case an excessive use of RAM/netowrk storage is expected. The limits are soft and hard (parameters -soft, -hard), the limits themselves are:
 -l resource=value
For example, in case a job needs at least 400MB RAM: qsub -l ram_free=400M my_script.sh Another often requested resource is the space in /tmp:

qsub -l tmp_free=10G my_script.sh.

Or both:
qsub -l ram_free=400M,tmp_free=10G my_script.sh
Of course, it is possible (and preferable if the number does not change) to use the construction #$ -l ram_free=400M directly in the script. The actual status of given resource on all nodes can be obtained by: qstat -F ram_free, or more things by: qstat -F ram_free,tmp_free.

Details on other standard available resources are in /usr/local/share/SGE/doc/load_parameters.asc. In case you do not specify value for given resource, implicit value will be used (for space on /tmp it is 1GB, for RAM 100MB)

WARNING: You need to distinguish, if you request resources that are available at the time of submission (so called non-consumable resources), or if you need to allocate given resource for the whole runtime of your computation - for example, your program will need 400MB of memory but in the first 10 min of computation, it will allocate only 100MB. In case you use the standard resource mem_free, and during the first 10min another jobs will be submitted to the given node, SGE will interpret it in the following way: you wanted 400MB but you finally use only 100MB so that the rest of 300MB will be given to someone else (i.e. it will submit another task requesting this memory).

For these purposes, it is better to use consumable resources, that are computed independently on the current status of the task - for memory it is ram_free, for disc tmp_free. For example, resource ram_free does not look at the actual free RAM, but it computes the occupation of RAM only based on the requests of individual scripts. It works with the size of RAM of the given machine and subtracts the amount requested by the job that should be run on this machine. In case the job does not specify ram_free, implicit value of ram_free=100M will be used.

For the disk space in /tmp (tmp_free), the situation is more tricky: in case a job does not clean up properly its mess after it finishes, the disk can actually have less space than defined by the resource. Unfortunately, nothing can be done about this.

Known problems with SGE

Use of paths - for home directory it is necessary to use the official path - i.e. /homes/kazi/... or /homes/eva (or simply the variable $HOME). In case the path of the internal mountpoint of the automounter is used - i.e. - /var/mnt/... an error will occur. (this is not an error of SGE, the internal path is not fully functional for access)

Availability of nodes - due to the existence of nodes with limited access (employees' PCs), it is necessary to specify a list of nodes, on which your job can run. This can be done using parameter -q. The machines that are available are nodes in IBM Blades and also some computer labs in case you turn the machines on over night. The list of queues for -q must be only on one line even if it is very long. For the availability of given groups of nodes, the parameter -q can be used in the following way:
#$ -q all.q@@blade,all.q@@PCNxxx,all.q@@servers
Main groups of computers are: @blade, @servers, @speech, @PCNxxx, @PCN2xxx - the full and actual list can be obtained by qconf -shgrpl

The syntax for access is QUEUE@OBJECT - i.e. all.q@OBJECT. The object is either one computer, for example all.q@svatava, or a group of computers (which begins also by @ - @blade) i.e. all.q@@blade.

The computers in the labs are sometimes restarted by students during computation - we can't do much about this. In case you really need the computation to finish (i.e. it is not easy to re-run a job in case it is brutally killed) use newly defined groups of computers:
@stable - @blade, @servers - servers that run all the time w/o restarting
@PCOxxx, @PCNxxx - computer labs, there is a possibility that any node might be restarted at any time,
      a student or someone can shut the machine down by error or "by error". It is more or less sure that these
      machines will run smoothly over night and during weekends. There is also a group for each independent lab e.g. @PCN103.
Runnnig other scripts than bash - it is necessary to specify the interpret on the first line of your script (it is probably already there), for example #!/usr/bin/perl, etc.

Does your script generate a heavy traffic on matyldas ? It is necessary to set -l matyldaX=10, (for example 10 - i.e. in total 100/10 = 10 concurrent jobs from given matyldaX), where X is the number of matylda used (in case you use several matyldas, specify -l matyldaX=Y several times). We have created an SGE resource for each matylda (each matylda has 100 points in total) and the jobs using -l matyldaX=Y are submitted until given matylda has free points. This can be used to balance the load of given storage server from the user side. The same holds for servers scratch0X.

Attention to parameter -cwd, is is not guaranteed that it will work all the time, better use cd /where/do/i/want at the beginning of your script.

In case a node is restarted, a job will still be shown in SGE, although it is not running any more. This is because SGE is waiting until the node confirms termination of the computation (i.e. until it boots Linux again and starts the SGE client). In case you use qdel to delete a job, it will be only marked by flag d. Jobs marked by this flag are automatically deleted by the server every hour.

Parallel jobs - OpenMP

For parallel tasks with threads, it is enough to use parallel environment smp and to set the number of threads:
#!/bin/sh 
#
#$ -N OpenMPjob
#$ -o $JOB_NAME.$JOB_ID.out
#$ -e $JOB_NAME.$JOB_ID.err
#
# PE_name    CPU_Numbers_requested
#$ -pe smp  4
#
cd SOME_DIR_WITH_YOUR_PROGRAM
export OMP_NUM_THREADS=$NSLOTS
 
./your_openmp_program [options]
Parallel jobs - OpenMPI

Open MPI is now fully supported, and it is the default parallel environment (mpirun is by default Open MPI)

The SGE parallel environment is openmpi

The allocation rule is $fill_in$ which means that the preferred allocation is on the same machine.

Open MPI is compiled with tight SGE integration:

mpirun will automatically submit to machines reserved by SGE

qdel will automatically clean all MPI stubs

In the parallel task, do not forget (preferably directly in the script) to use parameter -R y, this will turn on the reservation of slots, i.e. you won't be jumped by processes requesting less slots.

in case a parallel task is launched using qlogin, there is no variable containing information on what slots were reserved. A useful tool is then qstat -u `whoami` -g t | grep QLOGIN, which says what parallel jobs are running.

Listing follows:
#!/bin/bash
# ---------------------------
# our name 
#$ -N MPI_Job
#
# use reservation to stop starvation
#$ -R y
#
# pe request
#$ -pe openmpi 2-4
#
# ---------------------------
# 
#   $NSLOTS          
#       the number of tasks to be used

echo "Got $NSLOTS slots."

mpirun -n $NSLOTS /full/path/to/your/executable

Added by John Pormann, last edited by John Pormann

Jul 16, 2008

The SGE batch scheduling system allows for arbitrary "consumable resources" to be created that users can then make requests against. In general, this is used to limit access to a pool of software licenses or make sure that memory usage is planned for properly. E.g. when a user wants to use a special software package, they request 1 license from SGE and it will decrement its internal counter for that license pool. If no more resources are available (i.e. the internal counter is at 0), then the job will be delayed until a currently-used resource is freed up.

We can also create arbitrary consumable resources to help users self-limit their usage of the DSCR. We can set up a resource, or counter, that will be decremented every time you submit a job. This way, you can submit 1000's of jobs to SGE, but you won't be swamping the machines or otherwise impeding other users.

If a user is given their own job-control resource, say 'cpus_user001', they should then submit jobs with an extra resource request using the '-l' option:
% qsub -l cpus_user001=1 myjob.q
Before running the job, SGE will make sure that there are sufficient resources. Thus, if there are 100 resources set aside for 'cpus_user001', then the 101st simultaneous job-request will have to wait for one of the previous jobs to complete ''even if there are empty machines in the cluster.''
Alternately, you can embed this within your SGE submission script. At the top of the file ("myjob.q" in the above example), you can insert:
$ -l cpus_user001=1

amazon ec2 - SGE Auto configured consumable resource

Server Fault

up vote0down vote favorite

I am using a tool called starcluster http://star.mit.edu/cluster to boot up an SGE configured cluster in the amazon cloud. The problem is that it doesn't seem to be configured with any pre-set consumable resources, excepts for SLOTS, which I don't seem to be able to request directly with a qsub -l slots=X. Each time I boot up a cluster, I may ask for a different type of EC2 node, so the fact that this slot resource is preconfigured is really nice. I can request a certain number of slots using a pre-configured parallel environment, but the problem is that it was set up for MPI, so requesting slots using that parallel environment sometimes grants the job slots spread out across several compute nodes.

Is there a way to either 1) make a parallel environment that takes advantage of the existing pre-configured HOST=X slots settings that starcluster sets up where you are requesting slots on a single node, or 2) uses some kind of resource that SGE is automatically aware of? Running qhost makes me think that even though the NCPU and MEMTOT are not defined anywhere I can see, that SGE is somehow aware of those resources, are there settings where I can make those resources requestable without explicitely defining how much of each are available?

Thanks for your time!

qhost output:

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
master                  linux-x64       2  0.01    7.3G  167.4M     0.0     0.0
node001                 linux-x64       2  0.01    7.3G  139.6M     0.0     0.0

qconf -mc output:

#name               shortcut   type        relop requestable consumable default  urgency 
#----------------------------------------------------------------------------------------
arch                a          RESTRING    ==    YES         NO         NONE     0
calendar            c          RESTRING    ==    YES         NO         NONE     0
cpu                 cpu        DOUBLE      >=    YES         NO         0        0
display_win_gui     dwg        BOOL        ==    YES         NO         0        0
h_core              h_core     MEMORY      <=    YES         NO         0        0
h_cpu               h_cpu      TIME        <=    YES         NO         0:0:0    0
h_data              h_data     MEMORY      <=    YES         NO         0        0
h_fsize             h_fsize    MEMORY      <=    YES         NO         0        0
h_rss               h_rss      MEMORY      <=    YES         NO         0        0
h_rt                h_rt       TIME        <=    YES         NO         0:0:0    0
h_stack             h_stack    MEMORY      <=    YES         NO         0        0
h_vmem              h_vmem     MEMORY      <=    YES         NO         0        0
hostname            h          HOST        ==    YES         NO         NONE     0
load_avg            la         DOUBLE      >=    NO          NO         0        0
load_long           ll         DOUBLE      >=    NO          NO         0        0
load_medium         lm         DOUBLE      >=    NO          NO         0        0
load_short          ls         DOUBLE      >=    NO          NO         0        0
m_core              core       INT         <=    YES         NO         0        0
m_socket            socket     INT         <=    YES         NO         0        0
m_topology          topo       RESTRING    ==    YES         NO         NONE     0
m_topology_inuse    utopo      RESTRING    ==    YES         NO         NONE     0
mem_free            mf         MEMORY      <=    YES         NO         0        0
mem_total           mt         MEMORY      <=    YES         NO         0        0
mem_used            mu         MEMORY      >=    YES         NO         0        0
min_cpu_interval    mci        TIME        <=    NO          NO         0:0:0    0
np_load_avg         nla        DOUBLE      >=    NO          NO         0        0
np_load_long        nll        DOUBLE      >=    NO          NO         0        0
np_load_medium      nlm        DOUBLE      >=    NO          NO         0        0
np_load_short       nls        DOUBLE      >=    NO          NO         0        0
num_proc            p          INT         ==    YES         NO         0        0
qname               q          RESTRING    ==    YES         NO         NONE     0
rerun               re         BOOL        ==    NO          NO         0        0
s_core              s_core     MEMORY      <=    YES         NO         0        0
s_cpu               s_cpu      TIME        <=    YES         NO         0:0:0    0
s_data              s_data     MEMORY      <=    YES         NO         0        0
s_fsize             s_fsize    MEMORY      <=    YES         NO         0        0
s_rss               s_rss      MEMORY      <=    YES         NO         0        0
s_rt                s_rt       TIME        <=    YES         NO         0:0:0    0
s_stack             s_stack    MEMORY      <=    YES         NO         0        0
s_vmem              s_vmem     MEMORY      <=    YES         NO         0        0
seq_no              seq        INT         ==    NO          NO         0        0
slots               s          INT         <=    YES         YES        1        1000
swap_free           sf         MEMORY      <=    YES         NO         0        0
swap_rate           sr         MEMORY      >=    YES         NO         0        0
swap_rsvd           srsv       MEMORY      >=    YES         NO         0        0

qconf -me master output (one of the nodes as an example):

hostname              master
load_scaling          NONE
complex_values        NONE
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE

qconf -msconf output:

algorithm                         default
schedule_interval                 0:0:15
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   false
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   0
default_duration                  INFINITY

qconf -mq all.q output:

qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make orte
rerun                 FALSE
slots                 1,[master=2],[node001=2]
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY

John St. John
1082

The solution I found is to make a new parallel environment that has the $pe_slots allocation rule (see man sge_pe). I set the number of slots available to that parallel environment to be equal to the max since $pe_slots limits the slot usage to per-node. Since starcluster sets up the slots at cluster bootup time, this seems to do the trick nicely. You also need to add the new parallel environment to the queue. So just to make this dead simple:

qconf -ap by_node

and here are the contents after I edited the file:

pe_name            by_node
slots              9999999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

Also modify the queue (called all.q by starcluster) to add this new parallel environment to the list.

qconf -mq all.q

and change this line:

pe_list               make orte

to this:

pe_list               make orte by_node

I was concerned that jobs spawned from a given job would be limited to a single node, but this doesn't seem to be the case. I have a cluster with two nodes, and two slots each.

I made a test file that looks like this:

#!/bin/bash

qsub -b y -pe by_node 2 -cwd sleep 100

sleep 100

and executed it like this:

qsub -V -pe by_node 2 test.sh

After a little while, qstat shows both jobs running on different nodes:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     25 0.55500 test       root         r     10/17/2012 21:42:57 all.q@master                       2      
     26 0.55500 sleep      root         r     10/17/2012 21:43:12 all.q@node001                      2

I also tested submitting 3 jobs at once requesting the same number of slots on a single node, and only two run at a time, one per node. So this seems to be properly set up!

gpgpu - Scheduling GPU resources using the Sun Grid Engine (SGE)

Stack Overflow
We have a cluster of machines, each with 4 GPUs. Each job should be able to ask for 1-4 GPUs. Here's the catch: I would like the SGE to tell each job which GPU(s) it should take. Unlike the CPU, a GPU works best if only one process accesses it at a time. So I would like to:
Job #1  GPU: 0, 1, 3
Job #2  GPU: 2
Job #4  wait until 1-4 GPUs are avaliable
The problem I've run into, is that the SGE will let me create a GPU resource with 4 units on each node, but it won't explicitly tell a job which GPU to use (only that it gets 1, or 3, or whatever).

I thought of creating 4 resources (gpu0, gpu1, gpu2, gpu3), but am not sure if the -l flag will take a glob pattern, and can't figure out how the SGE would tell the job which gpu resources it received. Any ideas?

gpu gpgpu sungridengine multiple-gpu
Daniel Blezek
2,6521915 When you have multiple GPUs and you want your jobs to request a GPU but the Grid Engine scheduler should handle and select a free GPUs you can configure a RSMAP (resource map) complex (instead of a INT). This allows you to specify the amount as well as the names of the GPUs on a specific host in the host configuration. You can also set it up as a HOST consumable, so that independent of the slots your request, the amount of GPU devices requested with -l cuda=2 is for each host 2 (even if the parallel job got i.e. 8 slots on different hosts).
qconf -mc
    #name               shortcut   type        relop   requestable consumable default  urgency     
    #----------------------------------------------------------------------------------------------
    gpu                 gpu        RSMAP         <=      YES         HOST        0        0
In the execution host configuration you can initialize your resources with ids/names (here simply GPU1 and GPU2).
qconf -me yourhost
hostname              yourhost
load_scaling          NONE
complex_values        gpu=2(GPU1 GPU2)
Then when requesting -l gpu=1 the Univa Grid Engine scheduler will select GPU2 if GPU1 is already used by a different job. You can see the actual selection in the qstat -j output. The job gets the selected GPU by reading out the $SGE_HGR_gpu environment variable, which contains in this case the chose id/name "GPU2". This can be used for accessing the right GPU without having collisions.

If you have a multi-socket host you can even attach a GPU directly to some CPU cores near the GPU (near the PCIe bus) in order to speed up communication between GPU and CPUs. This is possible by attaching a topology mask in the execution host configuration.
qconf -me yourhost
hostname              yourhost
load_scaling          NONE
complex_values        gpu=2(GPU1:SCCCCScccc GPU2:SccccSCCCC)
Now when the UGE scheduler selects GPU2 it automatically binds the job to all 4 cores (C) of the second socket (S) so that the job is not allowed to run on the first socket. This does not even require the -binding qsub param.

More configuration examples you can find on www.gridengine.eu.

Note, that all these features are only available in Univa Grid Engine (8.1.0/8.1.3 and higher), and not in SGE 6.2u5 and other Grid Engine version (like OGE, Sun of Grid Engine etc.). You can try it out by downloading the 48-core limited free version from univa.com.

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019

Top Visited <p>Your browser does not support iframes.</p>
Bulletin	Latest	Past week	Past month	Google Search