Image Credits : xmodulo
So what is it that needs to be really monitored? It really depends from one organization to other. For a web development platform, response time of the page may matter a lot. The kind of traffic, 4xx's 5xx's could be a concern too. Disk Space, CPU, Memory, Swap space, particular processes and services running, DB server replication, read writes, no. of connections, query execution time and many more parameters together. Most of these checks are required by all organizations. Out of the many monitoring tools out there, one of the most used is Nagios.
Nagios is an open source software application that helps in monitoring systems, network and Infrastructure. Nagios is on top of the Linux and hence, whatever you could do with Linux could also be done with Nagios. The best part of using Nagios is the plugin based architecture and 100's and 1000's of plugins that it supports to literally allow you to monitor anything.
Nagios comes with multiple notable features that makes it distinguishing. It uses the standard protocols i.e TCP, UDP, ICMP for monitoring servers across network. You can perform multiple resource checks on any host using the NRPE addon, the checks varies from CPU, Disk RAM and many more. Not just resource checks, you could also add event handlers that perform certain actions when certain events are noticed. Checks are performed at the specified intervals, by default the interval is 5 minutes. There are 2 types of checks, Active - The one that are nags initiated. Passive - The one that are initiated externally.
Nagios consists of various objects that needs to be defined and used.
- Hosts : Hosts are the systems/ servers that need to be monitored in the infrastructure. Nagios also provides the facility to group set of hosts together to give a better monitoring experience. Say you can group all web servers together in a "WebServers" host group. Typically a host definition may look like : "define host{
use linux-boxhost_name test_hostalias CentOS 6address 5.175.142.66}"
- Services : Services are the checks that needs to be performed. There are a wide range of service checks that can be performed on any host. Just like host group, service checks can also be grouped together. E.g you may need to check the CPU utilization of all servers together, you may group it that way. A service definition may look like : "define service{
use generic-servicehost_name test_hostservice_description CPU Loadcheck_command check_nrpe!check_load}"
- Contacts : Contacts are the people who need to be contacted if a notification needs to be sent for any event that occurs. You can configure contacts to send emails, samosas, or even custom messages to any service that allows messaging. Contacts can also be grouped together into a contact group. E.g there is a notification about come process getting shut down on QA server that the Admin may not necessarily be bothered about, in such a case the notification can only be sent to QA group. A contact definition will look like : "[define contact{
name generic-contactservice_notification_period 24x7host_notification_period 24x7service_notification_options w,u,c,r,f,shost_notification_options d,u,r,f,sservice_notification_commands notify-service-by-emailhost_notification_commands notify-host-by-emailregister 0
}" - Commands : Commands define the exact command that will be executed on the remote hosts while executing a particular check. These are the simplest way to get particular check executed, you may also pass bash commands to perform any particular check. A command definition may look like : "define command{
command_name check_nrpecommand_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$}"
- Time Period : If a downtime is scheduled at a particular time regularly and you don't want Nagios to send you any alert at these hours, you can achieve this by adding a time period definition. This looks like : "define timeperiod{
timeperiod_name 24x7-except-night-12-2alias 24x7 Except 00:00 - 02:00sunday 02:00-23:59monday 02:00-23:59tuesday 02:00-23:59wednesday 02:00-23:59thursday 02:00-23:59friday 02:00-23:59saturday 02:00-23:59}"
You can also set a monitoring schedule for a particular object if you do not want to add it to the existing service/hosts check. This allows you to explicitly look at a particular check.
Sometimes writing the definition can become a real pain using the same definition for all services and hosts can be a real pain even if you decide to copy-paste the definitions. Templates come for help here. You can define a template with all the necessary details of definition and simply use the same template everywhere in the configs. A typical template definition look like :
define host{
name generic-host
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_period 24x7
register 0
}
define contact{
name generic-contact
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r,f,s
host_notification_options d,u,r,f,s
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
register 0
}
Monitoring in Nagios is parallel, i.e a number of hosts and service checks will go simultaneously in parallel. This could be resource consuming but this is always better than sequential monitoring as you can be sure that all your servers are doing well and don't have to wait too long for any kind of update. The add ons for Nagios are simple to make and add to the Nagios community. The configs are all split and simple to understand too. Nagios has a huge documentation and help examples for quickly getting started.
Happy Monitoring!!