Pages

Saturday 10 October 2015

Infrastructure Monitoring with Nagios


Image Credits : xmodulo

Server management is a real pain and the pain keeps getting worse with more and more server getting added to the infrastructure. So how do organizations sustain with huge server farms, datacenters in place? How can super admins promise an SLA of 99.99% uptime with a very low response and resolution time? Quiet obviously the answer is server monitoring solutions. It could have been so tedious for a human to monitor servers 24x7 especially when most of the systems are stable and its only once in a while some manual intervention is needed.

So what is it that needs to be really monitored? It really depends from one organization to other. For a  web development platform, response time of the page may matter a lot. The kind of traffic, 4xx's 5xx's could be a concern too. Disk Space, CPU, Memory, Swap space, particular processes and services running, DB server replication, read writes, no. of connections, query execution time and many more parameters together. Most of these checks are required by all organizations. Out of the many monitoring tools out there, one of the most used is Nagios.

Nagios is an open source software application that helps in monitoring systems, network and Infrastructure. Nagios is on top of the Linux and hence, whatever you could do with Linux could also be done with Nagios. The best part of using Nagios is the plugin based architecture and 100's and 1000's of plugins that it supports to literally allow you to monitor anything.

Nagios comes with multiple notable features that makes it distinguishing. It uses the standard protocols i.e TCP, UDP, ICMP for monitoring servers across network. You can perform multiple resource checks on any host using the NRPE addon, the checks varies from CPU, Disk RAM and many more. Not just resource checks, you could also add event handlers that perform certain actions when certain events are noticed. Checks are performed at the specified intervals, by default the interval is 5 minutes. There are 2 types of checks, Active - The one that are nags initiated. Passive - The one that are initiated externally.

Nagios consists of various objects that needs to be defined and used.

  1. Hosts : Hosts are the systems/ servers that need to be monitored in the infrastructure. Nagios also provides the facility to group set of hosts together to give a better monitoring experience. Say you can group all web servers together in a "WebServers" host group. Typically a host definition may look like : "define host{
    use                             linux-box 
    host_name                       test_host 
    alias                           CentOS 6 
    address                         5.175.142.66 
    }"
  2. Services : Services are the checks that needs to be performed. There are a wide range of service checks that can be performed on any host. Just like host group, service checks can also be grouped together. E.g you may need to check the CPU utilization of all servers together, you may group it that way. A service definition may look like : "define service{
            use                     generic-service
            host_name               test_host
            service_description     CPU Load
            check_command           check_nrpe!check_load 
            }"
  3. Contacts : Contacts are the people who need to be contacted if a notification needs to be sent for any event that occurs. You can configure contacts to send emails, samosas, or even custom messages to any service that allows messaging. Contacts can also be grouped together into a contact group. E.g there is a notification about come process getting shut down on QA server that the Admin may not necessarily be bothered about, in such a case the notification can only be sent to QA group. A contact definition will look like : "[define contact{
            name                            generic-contact
            service_notification_period     24x7
            host_notification_period        24x7
            service_notification_options    w,u,c,r,f,s
            host_notification_options       d,u,r,f,s
            service_notification_commands   notify-service-by-email
            host_notification_commands      notify-host-by-email
            register                        0   

            }
    "
  4. Commands : Commands define the exact command that will be executed on the remote hosts while executing a particular check. These are the simplest way to get particular check executed, you may also pass bash commands to perform any particular check. A command definition may look like : "define command{
            command_name check_nrpe
            command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
            }"
  5. Time Period : If a downtime is scheduled at a particular time regularly and you don't want Nagios to send you any alert at these hours, you can achieve this by adding a time period definition. This looks like : "define timeperiod{
            timeperiod_name 24x7-except-night-12-2
            alias           24x7 Except 00:00 - 02:00
            sunday          02:00-23:59
            monday          02:00-23:59
            tuesday         02:00-23:59
            wednesday       02:00-23:59
            thursday        02:00-23:59
            friday          02:00-23:59
            saturday        02:00-23:59
    }"
You can also set a monitoring schedule for a particular object if you do not want to add it to the existing service/hosts check. This allows you to explicitly look at a particular check.
Sometimes writing the definition can become a real pain using the same definition for all services and hosts can be a real pain even if you decide to copy-paste the definitions. Templates come for help here. You can define a template with all the necessary details of definition and simply use the same template everywhere in the configs. A typical template definition look like :
define host{
        name                            generic-host    
        notifications_enabled           1               
        event_handler_enabled           1               
        flap_detection_enabled          1               
        process_perf_data               1               
        retain_status_information       1               
        retain_nonstatus_information    1               
        notification_period             24x7            
        register                        0               
        }

define contact{
        name                            generic-contact         
        service_notification_period     24x7                    
        host_notification_period        24x7                    
        service_notification_options    w,u,c,r,f,s             
        host_notification_options       d,u,r,f,s               
        service_notification_commands   notify-service-by-email 
        host_notification_commands      notify-host-by-email    
        register                        0                        
        }

Monitoring in Nagios is parallel, i.e a number of hosts and service checks will go simultaneously in parallel. This could be resource consuming but this is always better than sequential monitoring as you can be sure that all your servers are doing well and don't have to wait too long for any kind of update. The add ons for Nagios are simple to make and add to the Nagios community. The configs are all split and simple to understand too. Nagios has a huge documentation and help examples for quickly getting started. 

Happy Monitoring!!

Tuesday 18 August 2015

Software Configuration Management System

Picture credits : Paul Downey

Any application would generally consist of Web servers, Application Servers, Memcache systems, SQL and NoSQL Database servers, Load Balancers, Messaging queues, etc. Although this is pretty much enough, however as a precaution/privilege we also ensure proper redundancies so that whenever there is a failure we have a back plan in place to handle the failure. In order to keep a track of server performances we also have logging servers, Analytics servers and Monitoring servers in place. All these servers need to available again within no time in case something goes wrong(which does go wrong).

In traditional systems the admin guy managed all these by managing the wiring of the server and SSHing the servers and maintaining them throughout. There was nothing wrong with the idea except of time taken to get the process done. When something goes wrong get into that machine and spend hours finding out what went wrong and correct it my defining a good downtime. With a configuration management(CM) system in place now, we describe a state of a server and use some tool that just ensures that the server resides in that state throughout. The CM system ensures that right packages are installed, config files have correct values and permissions set and that the expected services are running on the host system and many more.

Software Deployment is another concern that a Devops person has to take care of which is at times addressed by CM tools too, although may not be considered a good practice always. Deployment is the process where the software that is written/developed by a company is built/compiled/processed and the required binaries and static files and other necessary files are copied to the server. The expected services are started as well. This is done mostly by using some scripting language and now we have some deployment specific tools that have their own advantages over scripting languages rollback being an important one. Capistrano and Fabric are famous ones.

Many a times the deployment process involves multiple remote servers. In complex environments the deployment process, the order of execution of tasks play an important role. A deployment may fail if an expected event occurs before another. E.g the database server needs to be up and running before the web server is brought up. Or in a high availability environment servers needs to be 1st taken out of the load balancer one by one before deployment and later added back to the load balancer post successful deployment. This automated arrangement, coordination and management of complex systems is called orchestration.

With a bunch of IAAS providers in the cloud market, virtualization has taken up huge pace. The evaluation of any new CM tool that comes to the IT world is largely done based on the number of cloud providers it supports. An important feature of a CM tool is provisioning. Provisioning is the process of spinning up of server for that cloud provider automatically. Many CM tools providers have plugins written to communicate with many cloud providers. Chef, Ansible, Puppet, CFEnginer, Salt have already become favorite for many out there.

I have personally used Ansible and Chef as of now. Cloud is fun indeed .. :)

Thursday 23 April 2015

s3cmd to push large files greater than 5GB to Amazon S3

image credits:  Stefano Bertolo

Use command line utility to push s3cmd files on Amazon S3.

Install s3cmd from s3tools.org or
apt-get install yum install s3cmd OR s3cmd

Configure s3cmd by
vim ~ / .s3cfg

<Paste những info add to it and you access-key and secret-key>

[Default]
access_key = TUOWAAA99023990001
access_token = 
add_encoding_exts = 
add_headers = 
bucket_location = US 
cache_file = 
cloudfront_host = cloudfront.amazonaws.com 
default_mime_type = binary / octet-stream 
delay_updates = False 
delete_after = False 
delete_after_fetch = False 
delete_removed = False 
dry_run = False 
enable_multipart = True 
encoding = UTF-8 
encrypt = False 
EXPIRY_DATE = 
expiry_days = 
expiry_prefix = 
follow_symlinks = False 
force = False 
get_continue = False 
gpg_command = / usr / bin / gpg 
gpg_decrypt =% (gpg_command) s -d --verbose --no-use-agent --batch --yes --passphrase-fd% (passphrase_fd) s -o% (output_file) s% (input_file) s 
gpg_encrypt =% (gpg_command) s -c --verbose --no-use-agent --batch --yes --passphrase-fd% (passphrase_fd) s -o% (output_file) s% (input_file) s 
gpg_passphrase = 
guess_mime_type = True 
host_base = s3.amazonaws.com 
host_bucket =% (bucket) s.s3.amazonaws.com 
human_readable_sizes = False 
ignore_failed_copy = False 
invalidate_default_index_on_cf = False 
invalidate_default_index_root_on_cf = True 
invalidate_on_cf = False 
list_md5 = False 
log_target_prefix = 
max_delete = -1 
mime_type = 
multipart_chunk_size_mb = 15 
preserve_attrs = True 
progress_meter = True 
proxy_host = 
proxy_port = 0 
put_continue = False 
recursive = False 
recv_chunk = 4096 
reduced_redundancy = False 
restore_days = 1 
secret_key = sd / ceP_vbb # eDDDK 
send_chunk = 4096 
server_side_encryption = False 
simpledb_host = sdb.amazonaws.com 
skip_existing = False 
socket_timeout = 300 
urlencoding_mode = normal 
use_https = True 
use_mime_magic = True 
verbosity = WARNING 
website_endpoint = http: //% (bucket) s.s3-their Website% (location) s.amazonaws.com/ 
website_error = 
website_index = index.html

access_key = YOUR-ACCESS-KEY-HERE
You can see how to use s3cmd at: http://s3tools.org/usage

Here I came across a typical scenario where I could not upload files greater than 5GB. You could do this to print two Ways:


  1. Using the --multipart-chunk-size-mb flag: s3cmd put --multipart-chunk-size-mb = 4096 201412.tar.gz s3: // apache-logs / I could not do this since I Had an older version of s3cmd installed and I did not really have time to download and install những version.
  2. Splitting Into the large files using small files and then uploading it split command.
  • Original file
-rw-r - r--. 1 root root 5.4G Jan 20 06:54 201412.tar.gz
  • Split Command
split -b 3G 2014backup.tar.gz "201 412"
  • Post Split

-rw-r - r--. 1 root root 23 Apr 06:41 3.0G 201412aa
-rw-r - r--. 1 root root 23 Apr 06:43 2.4G 201412ab

  • Upload files những

201 412 * s3cmd put s3: // apache-logs /

Saved some time :)