Bright Cluster Manager
While it is called cluster manager, this is a essentially a pretty generic Linux configuration management 
system with cluster tilt.  It allows "bare metal" reimaging of nodes from 
the a set of images (server can enrolled into a group, each of which can be assigned an image; if 
node is not assigned to anygroup the default image is used), which typically is stored on the headnode 
or special provisioning server (for really large clusters). There can be multiple images, one for each type 
of the node. 
      
This is a distribution-agnostic tool that is able to
support heterogeneous hardware, not only HPC clusters.  This allows the deployment of any kind of
Linux distribution (standard or customized) to any kind of target machine.
Other then HCP cluster suitable deployment might include computer labs and render farms.
Like SystemImager, Bright CM works with file based (rather than block
based) system images using rsync. An image is stored as a directory hierarchy of files
representing a snapshot of a machine, containing all the files and directories from the root of that 
machine's file system. Images can be acquired in multiple ways, including retrieval from a sample 
system.
One method of image creation is using a pre-installed node (the 
golden-client). In this way, the user can customize and tweak the golden-client's configuration 
according to his needs, verify it's proper operation. this helps to assure that the image, once deployed, 
will behave in the same way as
the golden-client. Incremental updates are possible by syncing an updated golden-client to the 
image, then syncing that image to deployed machines.
Images are hosted in a central repository on the headnode of the cluster or on a special server called the image-server, and they 
are 
distributed among the nodes  using rsync. 
 
This is a commercial software developed by Bright computing. Development office is in Amsterdam, 
NL. Backed by ING Bank as shareholder. 
Along  with supplying valuable (albeit complex) package they also provide value as a 
pretty capable packagers of important software like cluster managers (SGE, Torque, PBSpro ), etc.  This part 
of their added value should probably be taken into account when you purchase the license because 
almost identical to the functionality of System Imager (which became abandonware in 2015).  
With Red Hat introducing Red Hat for HPC (later renamed Red Hat for scientific computing) Red Classic computer node 
license model is no longer works. And means that if node has this flavor of RHEL installed,  
you need to customize each node as registration is now separate for each node. The idea of  working with images based on DHCP and 
identical images for all nodes Bright Manager relied upon is broken. You also can't patch the image on the headnode via chroot, because now your nodes are 
licensed a different distribution which is only a subset of Enterprise. As if this is a different flavor 
of OS.  
This problem is not fatal -- you can always create the image from a node instead (working with 
the selected node as a golden image), but this is a 
new situation as now you need to preserve licensing information for each and every node separately and 
populate is as an additional step after the installation.  
As many complex Unix management systems, Bright CM  modifies many system files in a way you do not understand 
and that make integration of new software more complex and troubleshooting almost impossible. 
Look for example at definition of parallel environment 
in SGE, it contains references to some CM scripts. SGE environment is loaded via modules which are are 
also in SM directories. To change certain parameters in SGE (for example the default number  of 
cores on the node) actually requires changing them in CM too.  It is unclear why, but  that's 
how it its. See also
Why is 
my workload manager configuration changed by Bright?
You can learn some useful staff from those modifications, but they create unique troubleshooting 
problems. The problem is that you need to guess that CM is the culprit. It is not obvious in the case 
I described.  I, for example, guessed it due to return to previous behaviour after I 
changed manually SCE configuration to correct incorrect count of cores of the servers
If you manage a small cluster (let's say less than 50 nodes, at some point you might ask yourself 
the question: whether the game is worth the candles. The key attraction of CM -- the ability seamlessly 
restore computational node from the image can be implemented in several other ways. Beyond that Bright 
cluster manager does not provide any indispensable functionality  for small clusters.  
It has three  typical problems inherent in such systems: 
   - It creates an additional, pretty complex layer that obstruct viewing and understanding lower 
   layers.  This is especially important for troubleshooting, which is badly affected if you 
   need to debug issues that touch CM functionality.  For example after CM is installed on the 
   headnode you can't change the hostname easily.  Also default solution when nodes use specific 
   private network is suboptimal is cases you need to connect nodes to external env during computations 
   (unless you have extra interface, which is often not the case for blades; you probably can use virtual 
   interfaces, though). 
 
   - It introduces custom command language, which if you use it episodically is a pain in the 
   neck and can be used only on the base of previously documented examples.   As the language is not used often, you 
   do need a cheat sheet to use most typical commands. 
   They are not intuitive and the syntax sometimes is pretty weird. For anything more complex then typical 
   operations,  you depend on CM support, which, actually, is pretty good. 
 
   - Documentation is weak. Typically no attempts to provide most common usage 
   examples. Or explain the ideas behind particular feature.  Too much focus of description of 
   full capabilities of the system which are impressive  -- it can manage really big clusters, 
   including exotic features needed only or mainly for large clusters, etc. In this sense CM documentation 
   is really bad.  Node provisioning chapter, the chapter about the feature that is the crown 
   jewel 
   of the package is poorly written and does not have examples of typical usage. for example, in no 
   way you can figure out from it how to create a new image (hint: you need first to clone  an 
   existing image and then overwrite it).  No explanation of key ideas no detailed examples of 
   typical usage. Nothing. Those guys simply can't understand the importance of documentation and instead 
   are engaged in a rat race of adding feature (malignant featurism ;-) 
 
There are several interesting parts of Bright CM. Among them: 
   - Working with images. this is the most impressive part. Bright created pretty elaborate 
   system of working with image.
   
      - Different images can be assigned to groups of nodes. 
 
      - Different mode of synchronization with the image can be assigned for each node
 
      - Image can be created from any node. so instead of working with the image in chroot mode you 
      can work with the real node and recreate an image from it. 
 
   
    
   - Schedulers offered by Bright (SGE, Torque, etc) provide considerable additional value, 
   helping to offset the licensing cost for the manager itself.  In is not that easy to 
   install correctly Torque (RPM are often junk) or SGE (only Son of SGE is usable free 
   distribution) and Bright provides some support for version it bundle with the cluster manager. They also can 
   be used in production env outside typical clusters.
 
   -  Boot process. the idea to associate different ways of booting the server with different 
   functionality of the manager is pretty slick. For PXE mode in addition you can specify verious ways 
   of synchronization with the image.
 
   - The ability to work directly with Dell DRAC. for example you can reboot the server, or 
   a group fop servers directly from Bright CM, without of a hassle of opening DRAC for each such node. 
   
 
   - Environment modules integration. Provisioning of several free schedulers (SGE, Torque, 
   etc)
 
   - Usable, well designed GUI-client based on Firefox. many operation on nodes can be performed 
   using it instead less user-friendly cmsh. It has some useful monitoring capabilities as as such provides 
   additional value to the package. 
 
   - The ability to update Dell firmware automatically. 
 
   - Bright cluster management daemon monitors several important aspects of functioning of every 
   node, and reports any problems it detects in the software or the hardware, so that you can take 
   action. 
 
   - Provisioning of Puppet.
 
   - Power management capabilities
 
The working with images part is the most interesting part of Bright Cluster Manager. Few systems 
implement it as consistently as CM. Here the designers demonstrated some original thinking (for example 
the role of "boot record" as indicator of how the node should behave).  You can create image from 
the node and distribute it to other nodes. CM takes care of all customarization needed. If node is configured 
for network book (or if boot record is absent) CM automatically reimage the node, or synchronized it 
with image if image already exists. Otherwise you have a regular book. that means that inserting/removing 
boot image changes the behaviors of a group of the server in a very useful way.  
You can have multiple images which are stored as actual directory tree of files. 
Managing images is done using chroot and is not very convenient, but as there is a possibility to 
creating image from a node you can do everything on a selected node instead, then create an image from 
this node and distribute it to other nodes.  
Bright CM can be used with non-cluster blade enclosures (set of 16 or more real servers), if you 
can justify the cost of the license.  This is an interesting pandable and scalable turnkey solution 
for managing blades, which is especially attractive for Dell blades as here Bright interacts well with 
DRAC. For example, for webserver  farm. Or DMZ servers.  
It also installs a lot of useful software, such as pdsh and environment modules. The latter are installed 
with integrated examples of package which can serve as a framework for developing you own set of environmental 
modules. Generally the environment modules supplied are of high quality. 
By default, nodes boot from the network when using Bright Cluster Manager. This is called 
a network boot, or sometimes a PXE boot. The head node (or some other node, called provising node) runs 
a tftpd server from within xinetd. It supplies the boot loader for the default or 
assigned to the node software image.
You can also install regular boot record on the node and use PXE boot only as needed.
Bright can provision and sync nodes only via PXE boot. 
Aspects of power management in Bright Cluster Manager include:
   - managing the main power supply to nodes through the use of power distribution units, baseboard 
   management controllers, or CMDaemon. mainly with Dell. 
 
   - monitoring power consumption over time 
 
   - setting CPU scaling governors for power-saving 
 
   - setting power-saving options in workload managers 
 
   - ensuring the passive head node can safely take over from the active head during failover 
 
   - allowing cluster burn tests to be carried out 
 
That creates some opportunities for power savings, which is extremely important in large clusters. You 
can for example shut down inactive nodes and bring them back if there are jobs in queue that wait for 
resources.
00        sAs clusters often are used by a large number of researcher user management presents some problems. 
Bright CM  allow (via The usernodelogin setting of cmsh) to restricts direct user logins from outside 
the HPC scheduler, and is thus one way of preventing the user from using node resources in an unaccountable 
manner. The usernodelogin setting is applicable to node categories only, not to individual nodes.
# cmsh
[bright71]% category use default
[bright71->category[default]]% set usernodelogin onlywhenjob
[bright71->category*[default*]]% commit
The attributes for usernodelogin are:
   - always (the default): This allows all users to ssh directly 
   into a node at any time.
 
   - never: This allows no user other than root to directly 
   ssh into the node.
 
   - onlywhenjob: This allows the user to ssh directly into 
   the node when a job is running on it. 
 
Bright Cluster Manager runs its own LDAP service to manage users, rather than using unix user and 
group files. That means that  users and groups are managed via the centralizing LDAP database server 
running (assesble via cmgui) on the head node, and not via entries in /etc/passwd or /etc/group files.
You can use cmsh too. for example
[root@bright71 ~]# cmsh
[bright71]% user
[bright71->user]%
[bright71->user]% add user maureen
[bright71->user*[maureen*]]%
[bright71->user*[maureen*]]% commit
[bright71->user[maureen]]% show
You can set user and group properties via  the set command. Typing set and then either using tab 
to see the possible completions, or following it up with the enter key, suggests several parameters 
that can be set, one of which is password:
Example
[bright71->user[maureen]]% set
Name:
set - Set specific user or group property
Usage:
set <parameter>
set user <name> <parameter>
set group <name> <parameter>
You can editing groups with append and remove from commands.   They are used to add extra 
users to, and remove extra users from a group. For example, it may be useful to have a compiler group 
so that several users can share access to the intel compiler. 
Dell BIOS management in Bright Cluster Manager means that for nodes that run on Dell hardware, the 
BIOS settings and BIOS firmware updates can be managed via the standard Bright front end utilities to 
CMDaemon, cmgui and cmsh. 
In turn, CMDaemon configures the BIOS settings and applies firmware updates to each node via a standard 
Dell utility called racadm. The racadm utility is part of the Dell OpenManage software stack. The Dell 
hardware supported includes R430, R630, R730, R730XD, R930 FC430, FC630, FC830 M630, M830 and C6320 
The utility racadm must be present on the Bright Cluster Manager head node. The utility is installed 
on the head node if Dell is selected as the node hardware manufacturer during Bright Cluster Manager 
installation. IPMI must be working on all of the servers. This means that it should be possible to communicate 
out-of-band from the head node to all of the compute nodes, via the IPMI IP address. 
That's typical for complex software packages. But still this is pretty annoying.  Truth be told 
that cmsh has buil-in help which parcially compasate absence of deataied documentation on it. But absence 
fo a typical usage examples is really bad. 
Important nuances are not mentioned.  Generally this documentation is useful only in one case: 
if you never read it and rely on CM customer support. If they point you to the documentation, just ignore 
it. Just record how they solved the problem and create you custom documentation from it. After several 
tickets you will have a valuable private database. 
ther also run knoledge base that might contin valubale information. 
CM changes the behavior of some components for example SGE in a way that complicates troubleshooting. 
for example in one case it enforced wrong number of cores on the servers. And if you correct it in SGE 
all.q after a while it returns to the incorrect number. 
If initial configuration is incorrect you are in trouble in more the n one way. for example with 
SGE I noticed a very interesting bug: if you server has 24 cores and in all.q mistakenly initially configured 
with the number of slot equal to  12 cores, you are in trouble. You change it via qconf
command in SGE is think that you are done. Wrong. After a while it returns to the incorrect number. 
At this moment you want to kill CM designers because they are clearly amateurs. 
Another case I already mentioned: if the node does not have a boot record it can be reimages from 
the image and if you have differences between the current state of the node and image all differences 
are lost. In ideal case you should not. But life is far from ideal.  
NOTE: this kind of Microsoft style advertising of the product. They present a nice GUI, but forget 
to mention that GUI is not everything and you can't manage the cluster from it. 
== quote ==
The sophisticated node provisioning and image management system in Bright Cluster Manager® 
allows you to do the following:
   - 
   

 
   - Install individual nodes or complete clusters from bare metal within minutes. This applies to 
   big data clusters and OpenStack private clouds in addition to HPC clusters.
 
   - Create, manage and use as many node images as required.
 
   - Create, manage and use images that are very different (for example, based on different Linux 
   kernels or distributions of Linux, Apache Hadoop and OpenStack).
 
   - Create or change images substantially without breaking compatibility with application software.
 
   - Assign images to individual nodes or groups of nodes with a single command or mouse click.
 
   - Make changes to node images on the head node, without having to login to regular nodes.
 
   - Synchronize a regular node image on the head node from a hard disk on a regular node.
 
   - Apply RPM package commands 
   to node images, manually or automatically (for example, using
   Yum).
 
   - Update images incrementally, only transferring changes to the nodes.
 
   - Update images live, without having to reboot nodes.
 
   - Configure how disks should be partitioned (optionally using software
   RAID and/or
   LVM ).
 
   - Protect disks or disk partitions from being overwritten.
 
   - Provision images to memory and run nodes diskless.
 
   - Use revision control to keep track of changes to node images.
 
   - Return to a previously stored node image if and when required.
 
   - Backup all node images by backing up only the head node.
 
   - Automatically update BIOS images or change BIOS configurations without keyboard or console access 
   to the nodes.
 
Bright Computing engineers will be on hand to demonstrate all the 7.1 updates that enable customers 
to deploy, manage, use, and maintain complete HPC clusters over bare metal or in the cloud even more 
effectively. Leading the list of enhancements is fully integrated support for
Intel® Enterprise Edition for Lustre (IEEL), integrated Dell BIOS operations, and
open source 
Puppet. Improved integration with several workload managers and a refactored Web portal round out 
the exciting enhancements.
Those who need to deploy, use and maintain a POSIX-compliant parallel file system will find the integrated 
IEEL support lets them do so efficiently and with the well-known Bright Cluster Manager interface. Fully 
integrated support for Puppet ensures the right services are up and running on the right platforms, 
through enforced configurations. With integrated support for Dell BIOS firmware and configuration settings, 
users can deploy and maintain supported Dell servers from the BIOS level, using Bright's familiar interface.
Broader and deeper support for Slurm, Sun Grid Engine, and Univa Grid Engine ensures that Bright 
Cluster Manager for HPC fully integrates the capability to optimally manage HPC workloads. Users can 
quickly and easily monitor their HPC workloads through the updated web interface provided by Bright's 
user portal. Version 7.1 also incorporates refactored internals for improved performance, as well as 
finer-grained management control that includes customized kernels. 
"We are excited to share the latest updates and enhancements we've made to Bright Cluster Manager 
for HPC. Collectively, they further reduce the complexity of on-premise HPC and help our customers extend 
their on-premise HPC environment into the cloud," said Matthijs van Leeuwen, Bright Computing Founder 
and CEO. "The latest version allows our customers to manage their HPC environment alongside their platforms 
for Big Data Analytics, based on Apache Hadoop and Apache Spark, from a single management interface."
For more information, visit
http://www.brightcomputing.com/Solutions-HPC
- 20170713 : How do  I know when a clone operation has completed?   ( How do  I know when a clone operation has completed?, Jul 13, 2017 ) 
 
- 20170713 : How  do I upgrade a Torque package?   ( How  do I upgrade a Torque package?,  ) 
 
- 20170713 : How  do I integrate a custom torque installation with Bright Cluster Manager?   ( How  do I integrate a custom torque installation with Bright Cluster Manager?,  ) 
 
- 20170713 :  Installing bacula installs pbspro, which I don't want. What should I do?   (  Installing bacula installs pbspro, which I don't want. What should I do?,  ) 
 
- 20170713 :  How do I add a Bright ISO as a YUM repository?   (  How do I add a Bright ISO as a YUM repository?,  ) 
 
- 20170713 : When  does a node need to be restarted?   ( When  does a node need to be restarted?,  ) 
 
- 20170713 : How  can I flash update all my nodes (from Linux)?   ( How  can I flash update all my nodes (from Linux)?,  ) 
 
- 20170619 :  How to easily install    configure the Torque-Maui open source scheduler in Bright  by Robert Stober  ( Jun 19, 2017 , www.brightcomputing.com   ) 
 
- 20170619 : OpenStack  Neutron Mellanox ML2 Driver Configuration in Bright   ( OpenStack  Neutron Mellanox ML2 Driver Configuration in Bright,  ) 
 
- 20170619 : Bright  Cluster Manager 7 for HPC - New   ( Bright  Cluster Manager 7 for HPC - New,  ) 
 
- 200102 : How  do I set up a local Bright repository?   ( How  do I set up a local Bright repository?,  ) 
 
  We would like to know when exactly the clone of an image has 
         completed. This is so we can automate some image update and test processes. Ie: we clone an 
         image, apply updates to the clone, assign that updated image to a category, and reboot a 
         node for testing the updated image.
However, the current "clone/commit" process goes into the background. This makes 
         programmatically determining when it finished rather difficult. Can we make the commit of 
         an image clone wait for completion in the cmsh shell so our script will wait before 
         attempting to apply updates?
In 6.0 the
         --wait option to the commit command 
         makes cmsh wait for any background task 
         to complete. A list of tasks that are waiting for 
         completion can be seen with cmsh -A -c "task list"
   For versions of BCM prior to 6.0, the following technique can be used:
   
The CMDaemon will not start the background copy operation if the target directory already 
         exists. So what you can do from a bash script is something like this:
cp -a /cm/images/default-image /cm/images/new-image
cmsh -c "softwareimage; clone default-image new-image; commit"
   The first line guarantees the copy is done (and exits after the cp is done). That means 
         that the second line  does pretty much nothing except for housekeeping, which lets cmd then 
         know of new-image. In particular for the second line, cloning, which normally runs in the 
         background to carry out the copy, doesn't do any copying because that was already done.
   Applying updates to the images can then be carried out without needing to test if the 
         clone has completed.
   
      
         When does a node need to be restarted? Why does a node need 
         to be restarted? Can I ignore it? How do I clear that status?Can I ignore it?
         Not really, unless you really know what you are doing. You can see if a node needs 
         restarting from the device status command (alias: ds):
          
         In cmsh:
         bright60% device status
         apc01 .................... [   UP   ] health check failed
         devhp .................... [   UP   ] health check failed
         node001 .................. [UP ] restart-required
         node002 .................. [   UP   ] health check failed
         Or from cmgui -> nodes[node001] -> hostname[state]: restart-required.
         When does a node need to be restarted?
         A restart-required flag is set when a commit is done on a node that changes the state of:
          category/image/ip/hostname/diskSetup/pxelabel/initialize script/finalize script/install 
         boot record.
         Similar rules apply for category and image commit.
         These settings all have fields used by the node-installer.
         It is possible to get false positives. For example adding a newline to a script will mark the 
         node as restart-required.
         There are however potentially many things that can differ when changes are made, and no guarantee 
         that all settings from the new category have been applied until you reboot the node. The reason 
         why a restart-required message is there, is to warn you that the node may be in a weird state 
         (e.g., if moving a node from category B to a new category A, it may still be using the software 
         image that has been set for category B).
         Why does a node need to be restarted? 
         The reason for the failure is often given within parentheses:
         bright60% device status
         node060 .................. [ UP ] (eth0 changed) restart-required
         node061 .................. [ UP ] (category changed) restart-required
         Sometimes the info message gives a clue on the reason for failure:
          [bright60->device]% status node001
 node001 .................. [  DOWN  ] pingable, restart-required, health check failed
         In which case you can investigate the reason further. Eg, check the health checks with.
         [bright60->device]% latesthealthdata node001
 Health Check                 
         Severity Value            Age (sec.) 
         Info Message                           
         
 ---------------------------- -------- ---------------- ---------- ----------------------------------------
 nanchecker                   
         10       FAIL             
         1090                                              
         
 DeviceIsUp                   
         40       FAIL             
         10                                                
         
 ssh2node                     
         0        PASS             
         1090       Not UP according to CMDaemon           
         
 [bright60->device]%
         How do I clear that status?
         You can clear the install-required flag without a reboot in cmsh by closing and opening 
         the node:
 device open --reset -n node001..node100
      
    
   
      
         How can I flash update all my nodes (from Linux)?(For 
         update via DOS, see
         /faq/index.php?action=artikel&cat=20&id=94  
         )
         Some manufacturers like Dell provide a flash BIOS upgrade utility that is run from within 
         Linux. Such a utility typically requires the node to reboot from the hard drive after running 
         the utility, and only then will the upgrade be complete. It typically therefore does not work 
         with the nodes of a cluster, because nodes by default do a PXE boot.
         Because the flash upgrade utility is usually a binary, it is unclear how it works. The procedure 
         to make it work described next is based on some commonsense guesswork. The manufacturer should 
         be contacted to confirm how their utility works, before trying out the procedure described 
         next.
         The procedure described next should be 
         done with care because a node that has a damaged BIOS may not function at all. Such a node 
         is called a node that is "bricked" because it may be as much use as a brick for its intended 
         purpose.
         The procedure is based on the likelihood that the utility modifies something on the local 
         drive, probably a service which is loaded on system startup.
         The trouble with the firmware update utility doing that (a modifcation that is to run on 
         startup) is that a regular node in the cluster normally PXE boots from a software image, and 
         not from the regular node hard drive that the utility has modified. So this is why, in a default 
         cluster, updating a flash bios from Linux will not succeed for regular nodes.
         For such a case, the utility can however usually be made to work by simply setting the node 
         to non-sync on the next install. For example:
         cmsh -c "device use node001; set nextinstallmode 
         nosync; commit"
         After running the firmware installation utility, and if the node has the updated bios on 
         it after reboot and if the node is behaving ok, following procedure can then install the firmware 
         to the remaining nodes:
         1. Make sure the firmware binary is on all nodes by placing it in your software image, or 
         by using pcopy in cmsh.
         Download the firmware binary and save it to /opt on the head node
         [bright60 ~]# chmod 755 /C6220_BIOS_15R41_LN32_1.1.9.BIN
         [bright60 ~]# cmsh
         [bright60]% device pcopy -n node002..node200 
         /opt/C6220_BIOS_15R41_LN32_1.1.9.BIN /opt
         [bright60]% device commit
         2. set the nextinstallmode to NOSYNC. example:
         [bright60]% device foreach -n node002..node200 
         (set nextinstallmode nosync)
         [bright60]% device commit
         3. Use pexec to call the firmware upgrade utility on all nodes. Example:
         [bright60]% device pexec -n node002..node200 
         " /opt/C6220_BIOS_15R41_LN32_1.1.9.BIN -q"
         Note that by setting a node's 'nextinstallmode' to NOSYNC, you are telling it to skip image 
         re-synchronization on the next boot. After this boot, everything will be back to normal. It 
         is probably wise to do the update on a small number of nodes first (e.g. 5-10) so that all 
         the nodes are not bricked if something goes wrong. Starting cautiously by doing it on one node 
         is probably a good idea.
      
    
 
                                                            
                                                            How to easily install & configure the 
                                                            Torque/Maui open source scheduler in 
                                                            Bright
 | August 14, 2012 |
                                                            
                                                            workload manager
,
                                                            
                                                            HPC job scheduler
,
                                                            
                                                            Maui
,
                                                            
                                                            Torque
 
                                                               Bright Cluster Manager makes most 
                                                               cluster management tasks very easy to 
                                                               perform, and installing workload 
                                                               managers is one of them. There are 
                                                               many workload managers that are 
                                                               pre-configured, admin-selectable 
                                                               options when you install Bright, 
                                                               including PBS Pro,
                                                               
                                                               SLURM
, LSF, openlava, Torque, and
                                                               
                                                               Grid Engine
. 
The open source 
                                                               scheduler Maui is not pre-configured, 
                                                               but it's really easy to install and 
                                                               configure this software in Bright 
                                                               Cluster Manager. This article shows 
                                                               you how. 
  The process is 
                                                               to download and install the Maui 
                                                               scheduler, then to configure Bright 
                                                               to use Maui to schedule torque jobs.
Getting Started
Step1:  
Download
 
                                                               the Maui scheduler from the Adaptive 
                                                               Computing website: You will need to 
                                                               register on their site before you can 
                                                               download it. 
                                                               Step 2: Install it as shown below. 
                                                               This command will overwrite the 
                                                               Bright zero-length Maui placeholder 
                                                               file.
                                                               
                                                               # cp -f maui-3.3.1.tar.gz /usr/src/redhat/SOURCES/maui-3.3.1.tar.gz
                                                               Step 3: Build the Maui RPM.
                                                               
                                                               # rpmbuild -bb /usr/src/redhat/SPECS/maui.spec
                                                               Step 4: Install the RPM.
                                                               
                                                               # rpm -ivh /usr/src/redhat/RPMS/x86_64/maui-3.3.1-59_cm6.0.x86_64.rpm
                                                               
                                                               Preparing... 
                                                               ########################################### 
                                                               [100%]
                                                               
                                                               1:maui 
                                                               ########################################### 
                                                               [100%]
                                                               Select the node that is running the 
                                                               Torque server (usually the head node) 
                                                               resource, then the "roles" tab. 
                                                               Configure the "scheduler" property of 
                                                               the Torque Server role to use the 
                                                               Maui scheduler.
Step 5. Load the Torque and Maui 
                                                               modules. This adds the Maui commands 
                                                               to your PATH in the current shell.
                                                               
                                                               $ module load torque
                                                               
                                                               $ module load maui
The "initadd" command adds the 
                                                               Torque and Maui modules to your 
                                                               environment so that next time you log 
                                                               in they're automatically loaded.
                                                               
                                                               $ module initadd torque maui
                                                               Step 6. Submit a simple Torque job.
                                                               
                                                               $ qsub stresscpu.sh
                                                               
                                                               5.torque-head.cm.cluster
                                                               The job has been submitted and is 
                                                               running.
                                                               
                                                               $ qstat
                                                               
                                                               Job id Name User Time Use S Queue
                                                               
                                                               ------------------------- 
                                                               ---------------- --------------- 
                                                               -------- - -----
                                                               
                                                               5.torque-head stresscpu rstober 0 R 
                                                               shortq
                                                               The Maui showq command displays 
                                                               information about active, eligible, 
                                                               blocked, and/or recently completed 
                                                               jobs. Since Torque is not actually 
                                                               scheduling jobs, the showq command 
                                                               displays the actual job ordering.
                                                               
                                                               $ showq
                                                               
                                                               ACTIVE JOBS--------------------
                                                               
                                                               JOBNAME USERNAME STATE PROC REMAINING 
                                                               STARTTIME
                                                               
                                                               5 rstober Running 1 99:23:59:28 Thu 
                                                               Aug 9 11:40:45
 1 
IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0
                                                               The Maui checkjob displays detailed 
                                                               job information for queued, blocked, 
                                                               active, and recently completed jobs.
$ checkjob 5
checking job 5
State: Running
Creds:  user:rstober  group:rstober  class:shortq  qos:DEFAULT
WallTime: 00:01:31 of 99:23:59:59
SubmitTime: Thu Aug  9 11:40:44
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)
StartTime: Thu Aug  9 11:40:45
Total Tasks: 1
Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Allocated Nodes:
[node003.cm.cluster:1]
IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE
Reservation '5' (-00:01:31 -> 99:23:58:28  Duration: 99:23:59:59)
PE:  1.00  StartPriority:  1
   
   Download the latest Mellanox OFED package for Centos/RHEL 6.5
   http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers
   The package name looks like this: MLNX_OFED_LINUX-<version>-rhel6.5-x86_64 (The package can be 
   download either as an ISO or a tarball).
   The OFED package is to be copied (one way or another) to all the compute hosts which require an 
   upgrade of the firmware. (Note, only during a later stage of the article we will be describing 
   the actual installation of the OFED in the package into the software images. Right now we only want 
   the file on the live node)
   An efficient way to upgrade the firmware on multiple hosts would be to extract (in case of tar.gz 
   file) or copy (in case of using a ISO) the OFED package directory to a shared location such as /cm/shared 
   (which is mounted on compute nodes by default).
   Then we can use the pdsh tool in combination with category names to parallelize the upgrade.
   In our example we extract the OFED package to /cm/shared/ofed.
   Before we begin the upgrade we need to remove the cm-config-intelcompliance-slave package to avoid 
   conflicts:
   [root@headnode ~]# pdsh -g category=openstack-compute-hosts-mellanox "yum remove -y cm-config-intelcompliance-slave"
   (For now we will only remove it from live nodes. We will remove it from the software image later 
   in the article. Do not forget to also run this command on the headnode)
   In some cases the package 'qlgc-ofed.x86_64' may also need to be removed. In such case the mlnxofed 
   install will not proceed. A log of the installer can always be viewed in /tmp/MLNX_OFED_LINUX-<version>.<num>.logs/ofed_uninstall.log 
   to determine which package is conflicting and remove it manually.
   And then run the firmware upgrade:
   [root@headnode ~]# pdsh -g category=openstack-compute-hosts-mellanox "cd /cm/shared/ofed/MLNX_OFED_LINUX-2.3-1.0.1-rhel6.5-x86_64/ 
   && echo \"y\" | ./mlnxofedinstall --enable-sriov" | tee -a /tmp/mlnx-firmware-upgrade.log
   (Do not forget to execute these two steps on the network node and the headnode)
   Note that we are outputting both to the screen and to a temporary file (/tmp/mlnx-firmware-upgrade.log). 
   This can help spotting any errors that might occur during the upgrade.Running the 'mlnxofedinstall 
   --enable-sriov' utility does two things:
   
      - installs OFED on the live nodes
 
      - updates the firmware on the InfiniBand cards and enables the SR-IOV functionality.
 
   
   Notice, that in the case of compute nodes (node001-node003) at this point we're mostly interested 
   in the latter (firmware update and enabling SR-IOV). Since we've run this command on the live node, 
   the filesystem changes have not been propagated to the software image used by the nodes (i.e. at 
   this point they would be lost on reboot). We will take care of that later on in this article by installing 
   the OFED also to the software image. 
   In the case of headnode, however, running this command also effectively installs OFED and 
   update firmware, which is exactly what we want.
   Bright Cluster Manager 7
   
   
   Bright Cluster Manager for HPC lets customers deploy complete HPC clusters on bare metal and 
   manage them effectively. It provides single-pane-of-glass management for the hardware, operating 
   system, HPC software, and users. With Bright Cluster Manager for HPC, system administrators can get 
   their clusters up and running quickly and keep them running reliably throughout their life cycle 
   – all with the ease and elegance of a fully featured, enterprise-grade cluster manager.
   With the latest release, we've added some great new features that make Bright Cluster Manager 
   for HPC even more powerful. 
   New Feature Highlights
   Image Revision Control – We've added revision control capability which means 
   you can track changes to software images using standardized methods.
   Integrated Cisco UCS Support – With the new integrated support for Cisco UCS 
   rack servers, you can rapidly introduce flexible, multiplexed servers into your HPC environment.
   Native AWS Storage Service Support – Bright Cluster Manager 7 now supports native 
   AWS storage which means that you can use inexpensive, secure, durable, flexible and simple storage 
   services for data use, archiving and backup in the AWS cloud.
   Intelligent Dynamic Cloud Provisioning – By only instantiating compute resources 
   in AWS when they're actually needed – such as after the data to be processed has been uploaded, or 
   when on-site workloads reach a certain threshold – Bright Cluster Manager 7 can save you money.
   Bright Cluster Manager Images
   
      - 
      

      The Cluster Management GUI of Bright Cluster Manager 7 illustrating queued jobs. Some jobs are 
      running on compute nodes that have been dynamically provisioned in the AWS cloud. 
       
      - 
      

      The Cluster Management GUI of Bright Cluster Manager 7 capturing a summarized description 
       
   
   
   How do I set up a local Bright repository?
1. Copy the Bright yum repo file,
         /etc/yum.repos.d/cm.repo, from the 
         head node to the server where you're going to create the local mirror.
2. Get the repository ID:
    (on the mirror server)
    # 
         yum clean all
# yum 
         repo list
[...]
cm-rhel6-7.0 Cluster Manager 7.0 for Red Hat Enterprise Linux 6 301+8
cm-rhel6-7.0-updates Cluster Manager 7.0 for Red Hat Enterprise Linux 6 - Updates 371
[...]
   3. Sync the repository locally:
    # 
         mkdir -p /path/to/local/yum/repo/cm-rhel6-7.0
# 
         reposync --gpgcheck -l --repoid=cm-rhel6-7.0 -n
# 
         createrepo -v /path/to/local/yum/repo/cm-rhel6-7.0
# 
         mkdir -p /path/to/local/yum/repo/cm-rhel6-7.0-updates
# 
         reposync --gpgcheck -l --repoid=cm-rhel6-7.0 -n
# 
         createrepo -v /path/to/local/yum/repo/cm-rhel6-7.0-updates
   4. You may need to create local repositories for
         ceph-* and
         epel as well since some Bright 
         packages may have some dependencies which are provided by these repositories.
Softpanorama Recommended
Society
Groupthink :
Two Party System 
as Polyarchy : 
Corruption of Regulators :
Bureaucracies :
Understanding Micromanagers 
and Control Freaks : Toxic Managers :  
Harvard Mafia :
Diplomatic Communication 
: Surviving a Bad Performance 
Review : Insufficient Retirement Funds as 
Immanent Problem of Neoliberal Regime : PseudoScience :
Who Rules America :
Neoliberalism
 : The Iron 
Law of Oligarchy : 
Libertarian Philosophy
Quotes
 
War and Peace 
: Skeptical 
Finance : John 
Kenneth Galbraith :Talleyrand :
Oscar Wilde :
Otto Von Bismarck :
Keynes :
George Carlin :
Skeptics :
Propaganda  : SE 
quotes : Language Design and Programming Quotes :
Random IT-related quotes : 
Somerset Maugham :
Marcus Aurelius :
Kurt Vonnegut :
Eric Hoffer :
Winston Churchill :
Napoleon Bonaparte :
Ambrose Bierce : 
Bernard Shaw : 
Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient 
markets hypothesis :
Political Skeptic Bulletin, 2013 :
Unemployment Bulletin, 2010 :
 Vol 23, No.10 
(October, 2011) An observation about corporate security departments :
Slightly Skeptical Euromaydan Chronicles, June 2014 :
Greenspan legacy bulletin, 2008 :
Vol 25, No.10 (October, 2013) Cryptolocker Trojan 
(Win32/Crilock.A) :
Vol 25, No.08 (August, 2013) Cloud providers 
as intelligence collection hubs : 
Financial Humor Bulletin, 2010 :
Inequality Bulletin, 2009 :
Financial Humor Bulletin, 2008 :
Copyleft Problems 
Bulletin, 2004 :
Financial Humor Bulletin, 2011 :
Energy Bulletin, 2010 : 
Malware Protection Bulletin, 2010 : Vol 26, 
No.1 (January, 2013) Object-Oriented Cult :
Political Skeptic Bulletin, 2011 :
Vol 23, No.11 (November, 2011) Softpanorama classification 
of sysadmin horror stories : Vol 25, No.05 
(May, 2013) Corporate bullshit as a communication method  : 
Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): 
the triumph of the US computer engineering :
Donald Knuth : TAoCP 
and its Influence of Computer Science : Richard Stallman 
: Linus Torvalds  :
Larry Wall  :
John K. Ousterhout : 
CTSS : Multix OS Unix 
History : Unix shell history :
VI editor :
History of pipes concept :
Solaris : MS DOS 
:  Programming Languages History :
PL/1 : Simula 67 :
C :
History of GCC development : 
Scripting Languages :
Perl history   :
OS History : Mail :
DNS : SSH 
: CPU Instruction Sets :
SPARC systems 1987-2006 :
Norton Commander :
Norton Utilities :
Norton Ghost :
Frontpage history :
Malware Defense History :
GNU Screen : 
OSS early history
Classic books:
The Peter 
Principle : Parkinson 
Law : 1984 :
The Mythical Man-Month : 
How to Solve It by George Polya :
The Art of Computer Programming :
The Elements of Programming Style :
The Unix Hater’s Handbook :
The Jargon file :
The True Believer :
Programming Pearls :
The Good Soldier Svejk : 
The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society :
Ten Commandments 
of the IT Slackers Society : Computer Humor Collection 
: BSD Logo Story :
The Cuckoo's Egg :
IT Slang : C++ Humor 
: ARE YOU A BBS ADDICT? :
The Perl Purity Test :
Object oriented programmers of all nations 
: Financial Humor :
Financial Humor Bulletin, 
2008 : Financial 
Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related 
Humor : Programming Language Humor :
Goldman Sachs related humor :
Greenspan humor : C Humor :
Scripting Humor :
Real Programmers Humor :
Web Humor : GPL-related Humor 
: OFM Humor :
Politically Incorrect Humor :
IDS Humor : 
"Linux Sucks" Humor : Russian 
Musical Humor : Best Russian Programmer 
Humor : Microsoft plans to buy Catholic Church 
: Richard Stallman Related Humor :
Admin Humor : Perl-related 
Humor : Linus Torvalds Related 
humor : PseudoScience Related Humor :
Networking Humor :
Shell Humor :
Financial Humor Bulletin, 
2011 : Financial 
Humor Bulletin, 2012 :
Financial Humor Bulletin, 
2013 : Java Humor : Software 
Engineering Humor : Sun Solaris Related Humor :
Education Humor : IBM 
Humor : Assembler-related Humor :
VIM Humor : Computer 
Viruses Humor : Bright tomorrow is rescheduled 
to a day after tomorrow : Classic Computer 
Humor 
The Last but not Least  Technology is dominated by 
two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. 
Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org 
was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) 
without any remuneration. This document is an industrial compilation designed and created exclusively 
for educational use and is distributed under the Softpanorama Content License. 
Original materials copyright belong 
to respective owners. Quotes are made for educational purposes only 
in compliance with the fair use doctrine.  
FAIR USE NOTICE This site contains 
		copyrighted material the use of which has not always been specifically 
		authorized by the copyright owner. We are making such material available 
		to advance understanding of computer science, IT technology, economic, scientific, and social  
		issues. We believe this constitutes a 'fair use' of any such 
		copyrighted material as provided by section 107 of the US Copyright Law according to which 
such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) 
site written by people for whom English is not a native language. Grammar and spelling errors should 
be expected. The site contain some broken links as it develops like a living tree...
Disclaimer: 
The statements, views and opinions presented on this web page are those of the author (or 
referenced source) and are 
not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness 
of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be 
tracked by Google please disable Javascript for this site. This site is perfectly usable without 
Javascript. 
Last modified: March, 12, 2019