Vigilance Computer Applications Service Availability Monitoring.
Proactive server, applications Frame Relay, WAN and network monitoring.
Home Based in Portland, Oregon and Seattle, Washington.
Monitoring 101 - The Basics
Looking for an expert, professional, NOC Design Engineer to design
and build your Network Operations Center initiative?
For most companies it is no longer acceptable to learn they have a computer
or network problem by receiving a call from one of their customers. In fact,
companies that view their computer operations as being mission critical want
to be warned well in advance so they can take corrective action before a problem
impacts customers or employees. Any IT organization that is charged with
delivering Service Level Agreements (SLA) to their customers is going to
require some type of professional monitoring to meet those objectives.
Today's monitoring technologies can solve these problems if you know what
tools to employ and how to utilize them properly. This page discusses
terminology and monitoring capabilities that can be used to lower your
computer service delivery costs while improving performance and reliability.
Standard terminology: Some service providers use the term "7x24x365" to describe their
monitoring. Frequently what this means is that there is a computer
sitting in the corner sending out ping packets every 5-10 minutes.
Others claim to have "professionally staffed" monitoring when in fact
their monitoring software is sending alerts to a person who is carrying
the "duty pager". Proactive? In many cases, this means paging you about
something fairly meaningless such as a disk being 90% full or about CPU
usage momentarily running high. This is quite different than knowing
what is causing the disk to fill or what processes are causing the high
CPU usage. Do you really want a 3 AM page just because the CPU went to
90% for 60 seconds? You'll need to ask the right questions to determine
whether 7x24x365 professionally staffed proactive monitoring means the
same thing to a prospective MSP as it does to you and us.
A well implemented monitoring service should pay for itself by lowering
your costs and improving productivity.
Basic Ping and Port Check Monitoring
The most basic level of monitoring tells you if something is up or
down. Ping monitoring confirms whether an IP address is alive and port monitoring
confirms whether a specific port at an IP address is responding.
monitoring verifies that a specific web site page is reachable. These
tests do not
measure performance. They merely tell you if something is
broken. This type of monitoring provides no advance warning, so the
alerts they generate will be reactive in nature. Most ping/port check
tools run unattended. They send out pages and e-mails if they do
not receive a reply from the host being tested within a certain amount of
time. It is not uncommon for these types of tools to send out frequent
false alarms because network latency or other factors delay the response
from the host being tested.
There are hundreds if not thousands of people offering this service.
For most, this is a part time business which runs on a PC in
their garage. Typically, ping/port checking is implemented by running
free or nearly free open source software. These companies will claim to
provide 7x24x365 monitoring but will really just be pinging and port
checking your server once every 5-10 minutes and will spray out e-mails
and pages if their PC doesn't receive a response in time.
Most of these companies will have some bell or whistle that gives the
impression that they are doing more than ping & port checking.
However, even with an extra feature or two, this type of
service is still very low end and entry level.
The fee for this type of monitoring typically runs in the
$100-$150 per month per server range. You can save some money if you are
willing to have your server tested less often (e.g. one ping every 20
minutes). Personally, we think you should save even more money and just
let your customers call you when your server goes down.
Server Performance Monitoring
This is the next level of monitoring and is where the majority of
monitoring services end. The methodology to perform this monitoring
usually depends on tools included in the server's operating system. The items being monitored
would include such things as CPU usage, server load, disk utilization,
memory usage, and entries in selected log files. This monitoring can provide
some advance warning about impending system problems if the thresholds for the
alerts are set properly. For example, if you set the upper alert level
for disk usage at 90% you should receive an alarm in time to take action
before the disk become full. Of course these types of tools have no way
of knowing what is causing the disk to fill or how fast disk space is
being consumed. Like low end monitoring, this service
almost always runs unattended on "at home PCs" and sprays out e-mails and pages to a list
when a problem is detected.
Many of the people in this business claim that they are providing
7x24x365 "proactive monitoring". We'll leave it to you to decide
whether that's true or not.
Almost every piece of equipment in a company network can be
monitored via SNMP polling. This methodology uses device specific management information
blocks (MIBs) to obtain additional information about a device's health.
This information is collected at regular polling intervals. SNMP
enabled devices can also be programmed to send SNMP trap information to
message handlers in real time. If your servers and network
truly mission critical this is an increased level of monitoring that can be
quite important. Building and maintaining an SNMP monitoring
environment is a non-trivial undertaking. IT Managers will need to
purchase expensive Network Node Manager type software and commit internal resources to
bring this capability in-house or will need to outsource this to a
professional monitoring company.
Almost none of the "7x24x365 monitoring" companies you'll find in a google
search will provide meaningful SNMP monitoring.
Many monitoring tools can tell you if an application is alive or dead,
but not many monitor the actual health and well being of applications.
An application can be alive but performing so slowly
that customers or employees can't use it. For example, this type of
monitoring might be used to measure the speed of shopping cart
transactions for ecommerce companies or to determine how fast server based
applications respond to employees. An important applications monitoring
feature is the ability to monitor log files and to understand the meaning of
messages placed there by the application.
The high cost of telecom WAN pipelines makes bandwidth monitoring a
necessity. It provides a way to optimize the capacity of your circuits, identify bottlenecks,
plan future needs, verify bills, and
eliminate illegal usage.
This is a requirement for IT Managers who are serious about
operating system and applications monitoring. It is
unlikely that even two 9's SLAs (99% uptime) can be reached
without employing this type of technology. Agents are
essentially daemons or processes that run on the server
and provide "hooks" for other pieces of code
that do much more precise levels of monitoring than
previously discussed. For example, a monitoring
software agent would usually operate with a Smart Plug In (SPI)
or Knowledge Module (KM) that was designed to monitor
specific operating systems and applications such as Solaris
and Oracle or IIS and SQL. Agents may operate
independently but more often they also communicate with
server consoles or enterprise managers located in a Network
Operations Center (NOC). In this configuration, a NOC
Tech can verify and troubleshoot equipment in real time as
well as receive and view asynchronous messages and alarms
from server agents.
Local vs Remote Monitoring
company firewall can prevent a remote monitoring company from having
access to the information needed for availability, performance and bandwidth
testing. Outsourced monitoring often requires the creation of a VPN
between your data center and the monitoring company. Another option is
to choose a monitoring company that places a monitoring appliance inside
your data center, behind your firewall. A third choice is to select a
monitoring company that utilizes software tools that have been specifically
designed to safely monitor devices from untrusted Internet space.
Obviously, security is a huge concern any time you allow anyone access to
your computing equipment. Make sure your MSP is not going to expose
your network to any type of security risk!
Reactive vs Proactive Monitoring
One of the primary purposes of monitoring is to warn you of problems so
you can take corrective action in a timely manner. Wherever possible you
want to implement monitoring that will give you as much advance warning as
possible before a
problem impacts employees or customers. Ping and port check monitoring can
only inform you after an outage has occurred. Most of the other monitoring
methods discussed here can be configured in a way that will give you at least
some advance warning so
you can take corrective action before an outage occurs. For
maximum proactive coverage, having actual Technicians viewing a monitoring
console in a NOC is going to be a requirement. This is really the only
way to get past two 9's SLAs. Even the best monitoring software is
going to generate false alarms or miss important events if implemented to
run unattended. For example, a disk drive could be failing and
generating error messages in the system log file; obviously an
important indication that some maintenance is going to be required very
soon. However creating an alarm for every instance of the word
"error" in the log would be impractical since many unimportant
things also generate error messages. Having a Technician review
these log messages as they come in to make sure that nothing important is
missed is an important monitoring feature.
Event and Alert Escalation
You will need to determine how you want to handle alerts that have been
generated. Many companies will want custom alert event handling
depending on severity, time of day, service redundancy and other
factors. Alarms can be sent to pagers, e-mail addresses or a trouble
ticket system. Verified failures can also be reported directly to
Customer IT Administrators or to a service provider for resolution. You want to avoid
monitoring companies that only call the "on call Engineer" or that
spray out calls and pages to your entire IT Staff. A competent
monitoring company will isolate the problem sufficiently so that only the
proper source of solution needs to be notified (e.g. The Oracle DBA is
called for Oracle problems, not the Network or UNIX Admins). Or you can choose a monitoring company
that has the technical expertise to perform root cause analysis and
correct the problem for you.
A complete monitoring offering will include
an integrated trouble ticket system. This
system will automatically open a trouble
ticket when an alert is generated so you
have an audit trail of the problem and
actions that were taken to correct it. This
also could be used by your IT staff to
track all IT related tasks and inform the
call center of actions in process.
© 2002-2004 Vigilance Monitoring