There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc). Beyond that, there are a lot of lessons that I've learned from operating production systems that have helped shape the breadth of monitoring that we perform day in and day out.
One of my favorite all-time tweets is from @DevOps_Borat:
"Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet."
Here is a small list of things we monitor regularly that have grown out of those (sometimes painful!) experiences.
1 - Processes Creation Rate (Fork Rate)
We once had a problem where IPv6 was intentionally disabled on a box. This caused a significant and unexpected issue for us: each time a new network connection was created, modprobe would spawn a new process to evaluate IPv6 status. This rapid creation of new processes  slowed our servers in what is known as a "fork bomb". We eventually tracked it down by noticing that the process counter in /proc/stat was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10/sec on a production server with steady traffic.
2 - Flow Control Packets - Controlling Transmission
TL;DR; If your network configuration honors flow control packets and isn't configured to disable them, they can temporarily cause dropped traffic (if this doesn't sound like an outage, then I don't know what does).
$ /usr/sbin/ethtool -S eth0 | grep flow_control
Note: Read this to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC's. You should also trend these metrics on your switch gear. While you're at it, watch your dropped frames.
3 - Swap In/Out Rate: Boosting Memory Efficiency
It's common to check for swap usage (extra space on your hard drive reserved to supplement your memory) above a threshold. But even if you have a small quantity of memory swapped, it's actually the rate it's swapped in/out that can impact performance, not the quantity. Opt for a more direct check for that state.
4 - Server Boot Notification
Unexpected reboots are part of life. Do you know when they happen on your servers? Most people don't. We use a simple init.d script that triggers an email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.
5 - NTP Clock Offset
If not monitored, yes, one of your servers is probably off. If you've never thought about clock skew you might not even be running ntpd on your servers. Generally, there are three things to check for: 1) that ntpd is running; 2) clock skew inside your datacenter; and 3) clock skew from your master time servers to an external source. We use check_ntp_time for this check.
6 - DNS Resolutions
Internal DNS: it's a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) local resolutions from each server; 2) external resolution and quantity of queries (if you have DNS servers in your datacenter); and 3) the availability of each upstream DNS resolver you use.
External DNS: it's good to verify your external domains resolve correctly against each of your published external nameservers. We also rely on several CC TLD's and we monitor those authoritative servers directly as well (yes, it's happened that all authoritative nameservers for a TLD have been offline).
7 - SSL Certificate Expiration
It's the thing everyone forgets about because it happens so infrequently. An expired SSL Certificate could unexpectedly cause unavailability of a secure website. The fix is easy, just check SSL expiration dates and get alerted with enough timeframe to renew your SSL certificates.
8 - DELL OpenManage Server Administrator (OMSA)
We run with a split across two data centers: the first is a managed environment with DELL hardware, and the second is Amazon EC2. The key is to proactively monitor data centers and ensure procedures are in place for regular checks. For our DELL hardware, it's important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.
9 - Connection Limits: Managing Database and Memory
You probably run things like memcached (for in-memory caching) and MySQL (for database storage) but you may not have realized these have default connection limits. Do you monitor how close you are to those limits as you scale out application tiers?
10 - Load Balancer Status
We configure our Load Balancers with a health check, which we can easily force to fail in order to have any given server removed from rotation. We've found it important to have visibility into the health check state, so we monitor and alert based on the same health check (if you use EC2 Load Balancers, you can monitor the ELB state from Amazon APIs).
This scratches the surface of how to keep a stable environment in the development world of your company. Keep monitoring--consistency is the name of the game!
Start your workday the right way with the news that matters most. Learn more