|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
The problem of scale also changes how we think about systems management, sometimes in surprising or counterintuitive ways. For example, an admin over 20,000 systems is far more likely to be running a configuration management engine such as Puppet/Chef or CFEngine and will therefore have fewer qualms about host-centric configuration. The large installation administrator knows that he can make configuration changes to all of the hosts centrally. It’s no big deal. Smaller installations instead tend to favor tools that minimize the necessity to configure individual hosts.
Large installation admin are rarely concerned about individual node failures. Designs that incorporate single points of failure are generally to be avoided in large application frameworks where it can be safely assumed, given the sheer amount of hardware involved, that some percentage of nodes are always going to be on the fritz. Smaller installations tend to favor monitoring tools that strictly define individual hosts centrally and alert on individual host failures. This sort of behavior quickly becomes unwieldy and annoying in larger networks.
If you think about it, the monitoring systems we’re used to dealing with all work the way they do because of this “little network” mind set. This tendency to centralize and strictly define the configuration begets a central daemon that sits somewhere on the network and polls every host every so often for status. These systems are easy to use in small environments: just install the (usually bloated) agent on every system and configure everything centrally, on the monitoring server. No per-host configuration required.
This approach, of course, won’t scale. A single daemon will always be capable of polling only so many hosts, and every host that gets added to the network increases the load on the monitoring server. Large installations sometimes resort to installing several of these monitoring systems, often inventing novel ways to roll up and further centralize the data they collect. The problem is that even using roll-up schemes, a central poller can poll an individual agent only so fast, and there’s only so much polling you can do before the network traffic becomes burdensome. In the real world, central pollers usually operate on the order of minutes.
Ganglia, by comparison, was born at Berkeley, in an academic, Grid-computing culture. The HPC-centric admin and engineers who designed it were used to thinking about massive, parallel applications, so even though the designers of other monitoring systems looked at tens of thousands of hosts and saw a problem, it was natural for the Berkeley engineers to see those same hosts as the solution.
Ganglia’s metric collection design mimics that of any well-designed parallel application. Every individual host in the grid is an active participant, and together they cooperate, organically distributing the workload while avoiding serialization and single points of failure. The data itself is replicated and dispersed throughout the Grid without incurring a measurable load on any of the nodes. Ganglia’s protocols were carefully designed, optimizing at every opportunity to reduce overhead and achieve high performance.
This cooperative design means that every node added to the network only increases Ganglia’s polling capacity and that the monitoring system stops scaling only when your network stops growing. Polling is separated from data storage and presentation, both of which may also be redundant. All of this functionality is bought at the cost of a bit more per-host configuration than is employed by other, more traditional monitoring systems.
Nagios. It is probably the most popular open source monitoring system in existence today, and is generally credited for if not inventing, then certainly perfecting the centralized polling model employed by myriad monitoring systems both commercial and free. Nagios has been imitated, forked, reinvented, and commercialized, but in our opinion, it’s never been beaten, and it remains the yardstick by which all monitoring systems are measured.
Under the hood, Nagios is really just a special-purpose scheduling and notification engine. By itself, it can’t monitor anything. All it can do is schedule the execution of little programs referred to as plug-ins and take action based on their output.
Nagios plug-ins return one of four states: 0 for “OK,” 1 for “Warning,” 2 for “Critical,” and 3 for “Unknown.” The Nagios daemon can be configured to react to these return codes, notifying administrators via email or SMS, for example. In addition to the codes, the plug-ins can also return a line of text, which will be captured by the daemon, written to a log, and displayed in the UI. If the daemon finds a pipe character in the text returned by a plug-in, the first part is treated normally, and the second part is treated as performance data.
Performance data doesn’t really mean anything to Nagios; it won’t, for example, enforce any rules on it or interpret it in any way. The text after the pipe might be a chili recipe, for all Nagios knows. The important point is that Nagios can be configured to handle the post-pipe text differently than pre-pipe text, thereby providing a hook from which to obtain metrics from the monitored hosts and pass those metrics to external systems (like Ganglia) without affecting the human-readable summary provided by the pre-pipe text.
Nagios’s performance data handling feature is an important
hook. There are quite a few Nagios add-ons that use it to export metrics from Nagios for the purpose
of importing them into local RRDs. These systems typically point the service_perfdata_command
attribute in nagios.cfg to a script that use a series of regular expressions to parse out the
metrics and metric names and then import them into the proper RRDs. The same methodology can easily
be used to push metrics from Nagios to Ganglia by pointing the service_perfdata_command
to a script that runs gmetric instead of the RRDtool import command.
First, you must enable performance data processing
in Nagios by setting process_performance_data=1
in the nagios.cfg file. Then you
can specify the name of the command to which Nagios should pass all performance data it encounters using
the service_perfdata_command
attribute.
Let’s walk through a simple example. Imagine a
check_ping
plug-in that, when executed by the Nagios scheduler, pings a host and then return
the following output:
PING OK - Packet loss = 0%, RTA = 0.40 ms|0;0.40
We want to capture this plug-in’s performance data,
along with details we’ll need to pass to gexec, including the name of the target host. Once process_performance_data
is enabled, we’ll tell Nagios to execute our own shell script every time a plug-in returns with performance
data by setting service_perfdata_command=PushToGanglia
in nagios.cfg. Then we’ll
define pushToGanglia
in the Nagios object configuration like so:
define command{ command_name pushToGanglia command_line /usr/local/bin/pushToGanglia.sh "$LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||$SERVICEPERFDATA$" }
With so many Nagios plug-ins, written by so many different authors, it’s important to carefully choose your delimiter and avoid using the same one returned by a plug-in. In our example command, we chose double pipes for a delimiter, which can be difficult to parse in some languages. The tilde (~) character is another good choice.
The capitalized words surrounded by dollar signs in
the command definition are Nagios macros. Using macros, we can request all sorts of interesting details
about the check result from the Nagios daemon, including the nonperformance data section of the output
returned from the plug-in. The Nagios daemon will substitute these macros for their respective values
at runtime, so when Nagios runs our pushToGanglia
command, our input will wind up looking
something like this:
1338674610||dbaHost14.foo.com||PING||PING OK - Packet loss = 0%, RTA = 0.40 ms||0;0.40
Our pushToGanglia.sh script will take this
input and compare it against a series of regular expressions to detect what sort of data it is. When
it matches the PING
regex, the script will parse out the relevant metrics and push them
to Ganglia using gexec. It looks something like this:
#!/bin/sh while read IN do #check for output from the check_ping plug-in if [ "$(awk -F '[|][|]' '$3 ~ /^PING$/' <<<${IN})" ] then #this looks like check_ping output all right, parse out what we need read BOX CMDNAME PERFOUT <<<$(awk -F '[|][|]' '{print $2" "$3" "$5}'<<<${IN}) read PING_LOSS PING_MS <<<$(tr ';' ' '<<<${PERFOUT}) #Ok, we have what we need. Send it to Ganglia. gmetric -S ${BOX} -n ${CMDNAME} -t PING_MS -v ${PING_MS} gmetric -S ${BOX} -n ${CMDNAME} -t PING_LOSS -v ${PING_LOSS} #check for output from the check_cpu plug-in elif [ "$(awk -F '[|][|]' '$3 ~ /^CPU$/' <<<${IN})" ] then #do the same sort of thing but with CPU data fi done
This is a popular solution because it’s self-documenting,
keeps all of the metrics collection logic in a single file, detects new hosts without any additional
configuration, and works with any kind of Nagios check result, including passive checks. It does, however,
add a nontrivial amount of load to the Nagios server. Consider that any time you add a new check, the
result of that check for every host must be parsed against the pushToGanglia
script. The
same is true when you add a new host or even a new regex to the pushToGanglia
script. In
Nagios, process_performance_data
is a global setting, and so are the ramifications that
come with enabling it.
It probably makes sense to process performance data globally if you rely heavily on Nagios for metrics collection. However, for the reasons we outlined in Chapter 1, we don’t think that’s a good idea. If you’re using Ganglia along with Nagios, gmond is the better-evolved symbiote for collecting the normal litany of performance metrics. It’s more likely that you’ll want to use gmond to collect the majority of your performance metrics, and less likely that you’ll want Nagios churning through the result of every single check in case there might be some metrics you’re interested in sending over to Ganglia.
If you’re interested in metrics from only a few Nagios
plug-ins, consider leaving the metric process_performance_data
disabled and instead writing
“wrappers” for the interesting plug-ins. Here, for example, is what a wrapper for the check_ping
plug-in might look like:
#!/bin/sh ORIG_PLUGIN='/usr/libexec/check_ping_orig' #get the target host from the H option while getopts "H:" opt do if [ "${opt}" == 'H' ] then BOX=${OPTARG} fi done #run the original plug-in with the given options, and capture its output OOUT=$(${ORIG_PLUGIN} $@) OEXIT=$? #parse out the perfdata we need read PING_LOSS PING_MS <<<$(echo ${OOUT} | cut -d\| -f2 | tr ";" " ") #send the metrics to Ganglia gmetric -S ${BOX} -n ${CMDNAME} -t PING_MS -v ${PING_MS} gmetric -S ${BOX} -n ${CMDNAME} -t PING_LOSS -v ${PING_LOSS} #mimic the original plug-in's output back to Nagios echo "${OOUT}" exit ${OEXIT}
The wrapper approach takes a huge burden off the Nagios daemon but is more difficult to track. If you don’t carefully document your changes to the plug-ins, you’ll mystify other administrators, and upgrades to the Nagios plug-ins will break your data collection efforts.
The general strategy is to replace the check_ping
plug-in with a small shell script that calls the original check_ping
, intercepts its output,
and sends the interesting metrics to Ganglia. The imposter script then reports back to Nagios with the
output and exit code of the original plug-in, and Nagios has no idea that anything extra has transpired.
This approach has several advantages, the biggest of which is that you can pick and choose which plug-ins
will process performance data.
Because Nagios has no built-in means of polling data from remote hosts, Nagios users have historically employed various remote execution schemes to collect a litany of metrics with the goal of comparing them against static thresholds. These metrics, such as the available disk space or CPU utilization of a host, are usually collected by services like NSCA or NRPE, which execute scripts on the monitored systems at the Nagios server’s behest, returning their results in the standard Nagios way. The metrics themselves, once returned, are usually discarded or in some cases fed into RRDs by the Nagios daemon in the manner described previously.
This arrangement is expensive, especially considering that most of the metrics administrators tend to collect with NRPE and NSCA are collected by gmond out of the box. If you’re using Ganglia, it’s much cheaper to point Nagios at Ganglia to collect these metrics.
To that end, the Ganglia project began including a series of official Nagios plug-ins in gweb versions as of 2.2.0. These plug-ins enable Nagios users to create services that compare metrics stored in Ganglia against alert thresholds defined in Nagios. This is, in our opinion, a huge win for administrators, in many cases enabling them to scrap entirely their Nagios NSCA infrastructure, speed up the execution time of their service checks, and greatly reduce the monitoring burden on both Nagios and the monitored systems themselves.
There are five Ganglia plug-ins currently available:
Verify that one or more values is the same across a set of hosts.
The plug-ins interact with a series of gweb PHP scripts
that were created expressly for the purpose. See
Figure 7-1. The check_host_regex.sh plug-in, for example, interacts with the PHP script:
"http://your.gweb.box/nagios/check_host_regex.php". Each PHP script takes the arguments passed from
the plug-in and parses a cached copy of the XML dump of the grid state obtained from gmetad's
xml_port
to retrieve the current metric values for the requested entities and return a Nagios-style
status code (see
gmetad for details on gmetad's xml_port
). You must functionally enable the server-side
PHP scripts before they can be used and also define the location and refresh interval of the XML grid
state cache by setting the following parameters in the gweb conf.php file:
$conf['nagios_cache_enabled'] = 1; $conf['nagios_cache_file'] = $conf['conf_dir'] . "/nagios_ganglia.cache"; $conf['nagios_cache_time'] = 45;
Figure 7-1. Plug-in principle of operation
Consider storing the cache file on a RAMDisk or tmpfs to increase performance.
If you define a service check in Nagios to use hostgroups instead of individual hosts, Nagios will schedule the service check for all hosts in that hostgroup at the same time, which may cause a race condition if gweb's grid state cache changes before the service checks finish executing. To avoid cache-related race conditions, use the warmup_metric_cache.sh script in the web/nagios subdirectory of the gweb tarball, which will ensure that your cache is always fresh.
Internally, Ganglia uses a heartbeat counter to determine
whether a machine is up. This counter is reset every time a new metric packet is received for the host,
so you can safely use this plug-in in lieu of the Nagios check_ping
plug-in. To use it,
first copy the check_heartbeat.sh script from the Nagios subdirectory in the Ganglia Web tarball
to your Nagios plug-ins directory. Make sure that the GANGLIA_URL
inside the script is
correct. By default, it is set to:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_heartbeat.php"
Next, define the check command in Nagios. The threshold is the amount of time since the last reported heartbeat; that is, if the last packet received was 50 seconds ago, you would specify 50 as the threshold:
define command { command_name check_ganglia_heartbeat command_line $USER1$/check_heartbeat.sh host=$HOSTADDRESS$ threshold=$ARG1$ }
Now for every host/host group, you want the monitored
change check_command
to be:
check_command check_ganglia_heartbeat!50
The check_ganglia_metric
plug-in compares
a single metric on a given host against a predefined Nagios threshold. To use it, copy the check_ganglia_metric.sh
script from the Nagios subdirectory in the Ganglia Web tarball to your Nagios plug-ins directory. Make
sure that the GANGLIA_URL
inside the script is correct. By default, it is set to:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_metric.php"
Next, define the check command in Nagios like so:
define command { command_name check_ganglia_metric command_line $USER1$/check_ganglia_metric.sh host=$HOSTADDRESS$? metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$ }
Next, add the check command to the service checks for any hosts you want monitored. For instance, if you wanted to be alerted when the 1-minute load average for a given host goes above 5, add the following directive:
check_command check_ganglia_metric!load_one!more!5
To be alerted when the disk space for a given host falls below 10 GB, add:
check_command check_ganglia_metric!disk_free!less!10
The operators specified in the Nagios definitions for
the Ganglia plug-ins always indicate the "critical" state. If you use a notequal
operator,
it means that state is critical if the value is not equal.
The check_multiple_metrics
plug-in is
an alternate implementation of the check_ganglia_metric script that can check multiple metrics
on the same host. For example, instead of configuring separate checks for disk utilization on /,
/tmp, and /var-which could produce three separate alerts-you could instead set up
a single check that alerted any time disk utilization fell below a given threshold.
To use it, copy the check_multiple_metrics.sh
script from the Nagios subdirectory of the Ganglia Web tarball to your Nagios plug-ins directory. Make
sure that the variable GANGLIA_URL
in the script is correct. By default, it is set to:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_multiple_metrics.php"
Then define a check command in Nagios:
define command { command_name check_ganglia_multiple_metrics command_line $USER1$/check_multiple_metrics.sh host=$HOSTADDRESS$ checks='$ARG1$' }
Then add a list of checks that are delimited with a colon. Each check consists of:
metric_name,operator,critical_value
For example, the following service would monitor the disk utilization for root (/) and /tmp:
check_command check_ganglia_multiple_metrics!disk_free_rootfs,less,? 10:disk_free_tmp,less,20
Anytime you define a single service to monitor multiple entities in Nagios, you run the risk of losing visibility into "compound" problems. For example, a service configured to monitor both /tmp and /var might only notify you of a problem with /tmp, when in fact both partitions have reached critical capacity.
Use the check_host_regex
plug-in to check
one or more metrics on a regex-defined range of hosts. This plug-in is useful when you want to get a
single alert if a particular metric is critical across a number of hosts.
To use it, copy the check_host_regex.sh script
from the Nagios subdirectory in Ganglia Web tarball to your Nagios plug-ins directory. Make sure that
the GANGLIA_URL
inside the script is correct. By default, it is:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_host_regex.php"
Next, define a check command in Nagios:
define command { command_name check_ganglia_host_regex command_line $USER1$/check_host_regex.sh hreg='$ARG1$' checks='$ARG2$' }
Then add a list of checks that are delimited with a colon. Each check consists of:
metric_name,operator,critical_value
For example, to check free space on / and
/tmp for any machine starting with web-*
or app-*
you would use something
like this:
check_command check_ganglia_host_regex!^web-|^app-!disk_free_rootfs,less,? 10:disk_free_tmp,less,10
Combining multiple hosts into a single service check will prevent Nagios from correctly respecting host-based external commands. For example, Nagios will send notifications if a host listed in this type of service check goes critical, even if the user has placed the host in scheduled downtime. Nagios has no way of knowing that the host has anything to do with this service.
Use the check_value_same_everywhere
plug-in
to verify that one or more metrics on a range of hosts have the same value. For example, let's say you
wanted to make sure the SVN revision of the deployed program listing was the same across all servers.
You could send the SVN revision as a string metric and then list it as a metric that needs to be the
same everywhere.
To use the plug-in, copy the check_value_same_everywhere.sh
script from the Nagios subdirectory of the Ganglia Web tarball to your Nagios plug-ins directory. Make
sure that the GANGLIA_URL
variable inside the script is correct. By default, it is:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_value_same_everywhere.php"
Then define a check command in Nagios:
define command { command_name check_value_same_everywhere command_line $USER1$/check_value_same_everywhere.sh hreg='$ARG1$' checks='$ARG2$' }
For example:
check_command check_value_same_everywhere!^web-|^app-!svn_revision,num_config_files
In Nagios 3.0, the action_url
attribute
was added to the host and service object definitions. When specified, the action_url
attribute
creates a small icon in the Nagios UI next to the host or service name to which it corresponds. If a
user clicks this icon, the UI will direct them to the URL specified by the action_url
attribute
for that particular object.
If your host and service names are consistent in both
Nagios and Ganglia, it's pretty simple to point any service's action_url
back to Ganglia's
graph.php using built-in Nagios macros so that when a user clicks on the action_url
icon for that service in the Nagios UI, he or she is presented with a graph of that service's metric
data. For example, if we had a host called host1, with a service called load_one representing the one-minute
load history, we could ask Ganglia to graph it for us with:
http://my.ganglia.box/graph.php?c=cluster1&h=host1&m=load1&r=hour&z=large
The hiccup, if you didn't notice, is that Ganglia's
graph.php requires a c=
attribute, which must be set to the name of the cluster
to which the given host belongs. Nagios has no concept of Ganglia clusters, but it does provide you
with the ability to create custom variables in any object definition. Custom variables must begin with
an underscore, and are available as macros in any context a built-in macro would be available. Here's
an example of a custom variable in a host object definition defining the Ganglia cluster name to which
the host belongs:
define host{ host_name host1 address 192.168.1.1 _ganglia_cluster cluster1 ... }
Read more about Nagios Macros here.
You can also use custom variables to correct differences
between the Nagios and Ganglia namespaces, creating, for example, a _ganglia_service_name
macro in the service definition to map a service called "CPU" in Nagios to a metric called "load_one"
in Ganglia.
To enable the action_url
attribute, we
find it expedient to create a template for the Ganglia action_url
, like so:
define service { name ganglia-service-graph action_url http://my.ganglia.host/ganglia/graph.php?c=$_GANGLIA_CLUSTER$&↩ h=$HOSTNAME$&m=$SERVICEDESC$&r=hour&z=large register 0 }
This code makes it easy to toggle the action_url
graph for some services but not others by including use ganglia-service-graph
in the definition
of any service that you want to graph. As you can see, the action_url
we've specified combines
the custom-made _ganglia_cluster
macro we defined in the host object with the hostname
and servicedesc
built-in macros. If the Nagios service name was not the same as the Ganglia
metric name (which is likely the case in real life), we would have defined our own _ganglia_service_name
variable in the service definition and referred to that macro in the action_url
instead
of the servicedesc
built-in.
The Nagios UI also supports custom CGI headers and
footers, which make it possible to accomplish rollover popups of the action_url
icon containing
graphs from the Ganglia graph.php. This approach requires some custom development on your part
and is outside the scope of this book, but we wanted you to know it's there. If that sounds like a useful
feature to you, we suggest checking out
this information.
When Ganglia is running, it's a great way to aggregate metrics, but when it breaks, it can cause a bit of frustration with regard to locating the cause of that breakage. Thankfully, there are a number of points to monitor, which can help stave off an inconvenient breakage.
Using check_nrpe
(or even check_procs
directly), the daemons that support Ganglia can be monitored for any failures. It is most useful to
monitor gmetad and rrdcached on the aggregation hosts and gmond on all hosts. The pertinent snippets
for local monitoring of a gmond process are:
define command { command_name check_gmond_local command_line $USER1$/check_procs -C gmond -c 1:2 } define service { use generic-service host_name localhost service_description GMOND check_command check_gmond_local }
A more "functional" type of monitoring is monitoring for connectivity on the outbound TCP ports for the varying services. gmetad, for example, listens on ports 8651 and 8652, and gmond listens on port 8649. Checking these ports, with a reasonable timeout, can give a reasonably good idea as to whether they are functioning as expected.
cron collection jobs, which are run by your cron periodic scheduling daemon, are another way of collecting metrics without using gmond modules. Monitoring failures in these scripts, by virtue of their extremely heterogeneous nature and lack of similar structures, has the potential for being a place for fairly serious collection failures. These can, for the most part, be avoided by following a few basic suggestions
Using the logger utility for bash scripts or any of the variety of syslog submission capabilities available will allow you to be able to see what your scripts are doing, instead of being bombarded by logwatch emails or just seeing collection for certain metrics stop.
Touch a stamp file to allow other monitoring tools to detect the last run of your script. That way, you can monitor the stamp file for becoming stale in a standard way. Be wary of permissions issues, as test-running a script as a user other than the one who will be running it in production can cause silent failures.
Too many cron jobs are written to collect data, but assume things like "the network is always available," "a file I'm monitoring exists," or "some third-party dependency will never fail." These will eventually lead to error conditions that either break collection completely or, worse, submit incorrect metrics.
If you're using netcat, telnet, or other network-facing methods to gather metrics data, there is a possibility that they will fail to return data before the next polling period, potentially causing a pile-up or resulting in other nasty behavior. Use common sense to figure out how long you should be waiting for results, then exit gracefully if you haven't gotten them.
It can be useful to collect metrics on the backlog
and processing metrics for your rrdcached services (if you are using them to speed up your gmetad host).
This can be done by querying the rrdcached stats
socket and pushing those metrics into
Ganglia using gmetric.
Excessive backlogs can be caused by high IO or CPU load on your rrdcached server, so this can be a useful tool to track down rogue cron jobs or other root causes:
#!/bin/bash # rrdcache-stats.sh # # SHOULD BE RUN AS ROOT, OTHERWISE SUDO RULES NEED TO BE PUT IN PLACE # TO ALLOW THIS SCRIPT, SINCE THE SOCKET IS NOT ACCESSIBLE BY NORMAL # USERS! GMETRIC="/usr/bin/gmetric" RRDSOCK="unix:/var/rrdtool/rrdcached/rrdcached.sock" EXPIRE=300 ( echo "STATS"; sleep 1; echo "QUIT" ) | \ socat - $RRDSOCK | \ grep ':' | \ while read X; do K="$( echo "$X" | cut -d: -f1 )" V="$( echo "$X" | cut -d: -f2 )" $GMETRIC -g rrdcached -t uint32 -n "rrdcached_stat_${K}" -v ${V} -x ${EXPIRE} -d ${EXPIRE} | \ done
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. http://ganglia.info/
Ganglia is split into different parts with different functions:
Here we will be installing Ganglia on RHEL / CentOS 5 x86_64, from the EPEL repo. This was the path of least resistance for me. Another option would be to compile everything from source and copy the gmond binary and config file to each node, but I find that less manageable.
Install ganglia's gmetad, gmond & web interface on your head/management node:
# yum install rrdtool ganglia ganglia-gmetad ganglia-gmond ganglia-web httpd php
Ganglia can pass info over regular udp or multicast. I do not recommend multicast unless you have an exceptionally large clusters and understand how multicast is routed. If you do define mcast in your config, you must add an exception to your iptables firewalls by editing /etc/sysconfig/iptables and adding the following, before the default REJECT rule.
-A RH-Firewall-1-INPUT -p udp -d 224.0.0.0/4 --dport 1025: -j ACCEPT -A RH-Firewall-1-INPUT -p 2 -d 224.0.0.0/4 -j ACCEPT
In my simple setup, I have decided to use straight udp on port 8601, so add the following exception to the iptables config, maybe also limiting by subnet:
-A RH-Firewall-1-INPUT -p udp -m udp --dport 8601 -j ACCEPT
Then restart the iptables firewall:
# service iptables restart
In /etc/gmond.conf, define your cluster name and send/receive ports. With this example, 192.168.1.1 is our head node and 8601 the port.
cluster { name = "compressor" owner = "unspecified" latlong = "unspecified" url = "unspecified" } udp_send_channel { host = 192.168.1.1 port = 8601 } udp_recv_channel { port = 8601 family = inet4 }and start the service:
# chkconfig gmond on # service gmond startEdit /etc/gmetad.conf and define your cluster name as a data_source
data_source "compressor" 127.0.0.1:8601and start the service:
# chkconfig gmetad on # service gmetad startstart httpd to view the web interface:
# chkconfig httpd on # service httpd startNow on all your compute nodes, install gmond:
# yum install ganglia-gmond # chkconfig gmond onMake sure to define the cluster name and send/receive ports by copying your /etc/gmond.conf to each node. Then start the service on each node:
# service gmond startAfter some data collection, you should be able to open http://localhost/ganglia/ on the head node in a web browser to see the graphs.
You can also graph your own data within ganglia with gmetric. Let's say we have a script called gprobe.sh that returns the temperature of the room. Cron the following to run every 5 or 10 minutes on your management node:
$ gmetric --name temperature --value `gprobe.sh` --type float --units Fahrenheit
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
Ganglia lets you set up grids (locations) and clusters (groups of servers) for better organization.Thus, you can create a grid composed of all the machines in a remote environment, and then group those machines into smaller sets based on other criteria.
In addition, Ganglia's web interface is optimized for mobile devices, and also allows you to export data en
.csv
and.json
formats.Our test environment will consist of a central CentOS 7 server (IP address 192.168.0.29) where we will install Ganglia, and an Ubuntu 14.04 machine (192.168.0.32), the box that we want to monitor through Ganglia's web interface.
Throughout this guide we will refer to the CentOS 7 system as the master node, and to the Ubuntu box as the monitored machine.
Installing and Configuring Ganglia
To install the monitoring utilities in the the master node, follow these steps:
1. Enable the EPEL repository and then install Ganglia and related utilities from there:
# yum update && yum install epel-release # yum install ganglia rrdtool ganglia-gmetad ganglia-gmond ganglia-webThe packages installed in the step above along with ganglia, the application itself, perform the following functions:
rrdtool
, the Round-Robin Database, is a tool that's used to store and display the variation of data over time using graphs.ganglia-gmetad
is the daemon that collects monitoring data from the hosts that you want to monitor. In those hosts and in the master node it is also necessary to install ganglia-gmond (the monitoring daemon itself):ganglia-web
provides the web frontend where we will view the historical graphs and data about the monitored systems.2. Set up authentication for the Ganglia web interface (/usr/share/ganglia). We will use basic authentication as provided by Apache.
If you want to explore more advanced security mechanisms, refer to the Authorization and Authentication section of the Apache docs.
To accomplish this goal, create a username and assign a password to access a resource protected by Apache. In this example, we will create a username called
adminganglia
and assign a password of our choosing, which will be stored in /etc/httpd/auth.basic (feel free to choose another directory and / or file name – as long as Apache has read permissions on those resources, you will be fine):# htpasswd -c /etc/httpd/auth.basic admingangliaEnter the password for adminganglia twice before proceeding.
3. Modify /etc/httpd/conf.d/ganglia.conf as follows:
Alias /ganglia /usr/share/ganglia <Location /ganglia> AuthType basic AuthName "Ganglia web UI" AuthBasicProvider file AuthUserFile "/etc/httpd/auth.basic" Require user adminganglia </Location>4. Edit /etc/ganglia/gmetad.conf:
First, use the gridname directive followed by a descriptive name for the grid you're setting up:
gridname "Home office"Then, use data_source followed by a descriptive name for the cluster (group of servers), a polling interval in seconds and the IP address of the master and monitored nodes:
data_source "Labs" 60 192.168.0.29:8649 # Master node data_source "Labs" 60 192.168.0.32 # Monitored node5. Edit /etc/ganglia/gmond.conf.
a) Make sure the cluster block looks as follows:
cluster { name = "Labs" # The name in the data_source directive in gmetad.conf owner = "unspecified" latlong = "unspecified" url = "unspecified" }b) In the udp_send_chanel block, comment out the mcast_join directive:
udp_send_channel { #mcast_join = 239.2.11.71 host = localhost port = 8649 ttl = 1 }c) Finally, comment out the mcast_join and bind directives in the udp_recv_channel block:
udp_recv_channel { #mcast_join = 239.2.11.71 ## comment out port = 8649 #bind = 239.2.11.71 ## comment out }Save the changes and exit.
6. Open port 8649/udp and allow PHP scripts (run via Apache) to connect to the network using the necessary SELinux boolean:
# firewall-cmd --add-port=8649/udp # firewall-cmd --add-port=8649/udp --permanent # setsebool -P httpd_can_network_connect 17. Restart Apache, gmetad, and gmond. Also, make sure they are enabled to start on boot:
# systemctl restart httpd gmetad gmond # systemctl enable httpd gmetad httpdAt this point, you should be able to open the Ganglia web interface at
http://192.168.0.29/ganglia
and login with the credentials from #Step 2.8. In the Ubuntu host, we will only install ganglia-monitor, the equivalent of ganglia-gmond in CentOS:
$ sudo aptitude update && aptitude install ganglia-monitor9. Edit the /etc/ganglia/gmond.conf file in the monitored box. This should be identical to the same file in the master node except that the commented out lines in the cluster, udp_send_channel, and udp_recv_channel should be enabled:
cluster { name = "Labs" # The name in the data_source directive in gmetad.conf owner = "unspecified" latlong = "unspecified" url = "unspecified" } udp_send_channel { mcast_join = 239.2.11.71 host = localhost port = 8649 ttl = 1 } udp_recv_channel { mcast_join = 239.2.11.71 ## comment out port = 8649 bind = 239.2.11.71 ## comment out }Then, restart the service:
$ sudo service ganglia-monitor restart10. Refresh the web interface and you should be able to view the statistics and graphs for both hosts inside the Home office grid / Labs cluster (use the dropdown menu next to to Home office grid to choose a cluster, Labs in our case):
Using the menu tabs (highlighted above) you can access lots of interesting information about each server individually and in groups. You can even compare the stats of all the servers in a cluster side by side using the Compare Hosts tab.
Simply choose a group of servers using a regular expression and you will be able to see a quick comparison of how they are performing:
One of the features I personally find most appealing is the mobile-friendly summary, which you can access using the Mobile tab. Choose the cluster you're interested in and then the individual host:
Summary
In this article we have introduced Ganglia, a powerful and scalable monitoring solution for grids and clusters of servers. Feel free to install, explore, and play around with Ganglia as much as you like (by the way, you can even try out Ganglia in a demo provided in the project's official website.
While you're at it, you will also discover that several well-known companies both in the IT world or not use Ganglia. There are plenty of good reasons for that besides the ones we have shared in this article, with easiness of use and graphs along with stats (it's nice to put a face to the name, isn't it?) probably being at the top.
But don't just take our word for it, try it out yourself and don't hesitate to drop us a line using the comment form below if you have any questions.
Ganglia Monitoring Tool Installation & Configuration Guide
Linux & HPCGangliaYou all may know that Ganglia is an open source monitoring tool for High Performance Computing (HPC) and by default it works on multicast. But It can be used to monitor heterogeneous unix environment as well. Here is the procedure to install & configure the tool to monitor Linux & IBM AIX servers which are interconnected with different network subnets.
Download Packages:
• For Linux:
wget http://download.fedora.redhat.com/pub/epel/5/x86_64/ganglia-gmetad-3.0.7-1.el5.x86_64.rpm
wget http://download.fedora.redhat.com/pub/epel/5/x86_64/ganglia-gmond-3.0.7-1.el5.x86_64.rpm
wget http://download.fedora.redhat.com/pub/epel/5/x86_64/ganglia-web-3.0.7-1.el5.x86_64.rpm
wget http://download.fedora.redhat.com/pub/epel/5/x86_64/ganglia-3.0.7-1.el5.x86_64.rpm• For AIX:
wget http://www.oss4aix.org/download/ganglia/RPMs-3.0.7/aix53/ganglia-gmond-3.0.7-1.aix5.3.ppc.rpm
Installation on Master Node:
Select a Linux server as master node for the tool & install all the four ganglia rpms for Linux on it.
rpm –ivh ganglia-3.0.7-1.el5.x86_64.rpm
rpm –ivh ganglia-gmond-3.0.7-1.el5.x86_64.rpm
rpm –ivh ganglia-gmetad-3.0.7-1.el5.x86_64.rpm
rpm –ivh ganglia-web-3.0.7-1.el5.x86_64.rpmConfiguration Files Location:
S.No File Description Location 1 Gmond Configuration /etc/gmond.conf 2 Gmetad configuration /etc/gmetad.conf 3 rrd file storage /var/lib/ganglia/rrds/ 4 Web files /usr/share/ganglia/ 5 Ganglia's web conf file /etc/httpd/conf.d/ganglia.conf Configuration on Master Node:
1. Edit /etc/gmond.conf file
a. In cluster tag modify as shown below
cluster {
name = "Cluster Name"
owner = "IT team"
latlong = "unspecified"
url = "unspecified"
}b. In udp_send_channels tag, add master node's IP address which will communicate to your LAN/WAN.
udp_send_channel {
mcast_join = IP Address of Master Node
port = 8649
}c. Save & close the file
2. Start the gmond daemon
/etc/init.d/gmond start
3. Run the following command to start the service automatically when the system reboots
chkconfig gmond on
4. Edit /etc/gmetad.conf
a. Add Grid name
gridname "Grid Name"
b. Add the datasource as follows
data_source "Cluster Name" IP Address of Master Node
c. Save & close the file
5. Start the gmetad daemon
/etc/init.d/gmetad start
6. Run the following command to start the service automatically when the system reboots
chkconfig gmetad on
7. Web Server configuration
Upon ganglia-web-3.0.7-1.el5.x86_64.rpm installation, ganglia.conf file will be placed in /etc/http/conf.d folder automatically.
Now web service need to be restarted to access ganglia pages.
server httpd reload
Installation on client nodes:
Install ganglia & ganglia-gmond rpms
rpm –ivh ganglia-3.0.7-1.el5.x86_64.rpm
rpm –ivh ganglia-gmond-3.0.7-1.el5.x86_64.rpmConfiguration on client Nodes :
1. Edit /etc/gmond.conf file
a. In cluster tag modified like this
cluster {
name = "Cluster Name"
owner = "IT team"
latlong = "unspecified"
url = "unspecified"
}b. In udp_send_channels tag, add master node's IP address
udp_send_channel {
mcast_join = IP Address of Master Node
port = 8649
}c. udp_recv_channels tag should be like this
udp_recv_channel {
port = 8649
}d. Save & close the file
2. Start the gmond daemon
/etc/init.d/gmond start
3. Run the following command to start the service automatically when the system reboots
chkconfig gmond on
Repeat the above three steps on all other client nodes.
Installation on Cluster Node (AIX):
Install only ganglia-gmond rpm for AIX 5.3
rpm –ivh ganglia-gmond-3.0.7-1.aix5.3.ppc.rpm
Configuration on Cluster Nodes (AIX)
1. Edit /etc/gmond.conf file
a. In cluster tag modified like this
cluster {
name = "Cluster Name"
owner = "IT team"
latlong = "unspecified"
url = "unspecified"
}b. In udp_send_channels tag, add master node's IP address
udp_send_channel {
mcast_join =IP Address of Master Node
port = 8649
}c. udp_recv_channels tag should be like this
udp_recv_channel {
port = 8649
}d. Save & close the file
2. Start the gmond daemon
/etc/rc.d/init.d/gmond start
Now your Ganglia tool is ready for monitoring. Open web URL http://master-server-ip/ganglia to monitor the configured servers.
|
||||
Bulletin | Latest | Past week | Past month |
|
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: January 09, 2020