LINUX & APPLICATION SERVERS MONITORING TRICKS
- Prevent server from going down,
- detect what caused the server to go down,
- get servers back after failure.
- Explain what caused the server to fail
I – Regularly watch your monitoring tool(s) (Nagios, Wily,top, …..)
On some of those tools, you can see a graph of the apps CPU Load Average, CPU Used Percentage, Disk Usage percentage, memory used percentage, Network Bandwidth, Swap Used percentage.
II – Check ulimit count (Number of files opened by applications like tomcat, oracle,…)
[Server]# ulimit -n
If the number of files opened is getting closer to ulimit count (1024), increase the ulimit and talk to dev to identify and fix the process that is causing that.
To increase ulimit count for a specific application account, run the command ulimit –n [value]
Ulimit can be set to whatever you want. Its one of those things that’s put in place as a throttle to keep things from going too nuts. Some systems will actually just set it to unlimited.
III- Port Monitoring
Check number of connections to ports used by your apps
IV- Thread dump (stack trace of all threads ) If you have a high cpu percentage
[Server]# kill -3 (The output is printed in catalina.out) to see what is causing this and send it to developers.
V- Disk space /[drive_name] filling up quickly
Identify the file(s) that are filling up the disks. Most of the time ,it will be logs files.
[Server]# du -ks /[drive_name]/* | sort -nr | head
By running this command on the larger folder, that will lead you to the files that eat the disk space.
Back up, remove or empty the file in question given that it won’t break the system.
If the log files are responsible for the disk filling up, let the developer know about it so that they can solve it. In the meantime, empty the log file with the command:
[Log_File_Location]# echo -n > Large_Log_File_Name.log
VI- Watch catalina.out and log4j.out after staging and live deploy, especially when you are restarting the servers.
[Server]# tail -f log4j.log
VII- Start app servers properly
Before restarting app servers, make sure there is no app pid running for that specific server.
[Server_Name]$ ps -ef | grep oracle
Kill the pid for that server.
IX – Cpu Load level
I would say that if we peak under 70% CPU during high traffic, we are doing well and have room. A good level to be ticking over at would be 30% used.
top – 12:37:29 up 47 days, 23:09, 4 users, load average: 0.20, 0.20, 0.22
Tasks: 189 total, 1 running, 178 sleeping, 10 stopped, 0 zombie
Cpu(s): 1.2%us, 0.1%sy, 0.0%ni, 97.5%id, 1.0%wa, 0.0%hi, 0.1%si, 0.0%st
X- Server specific status pings (To assure the server are up and serving contents)
Write scripts for this
XI- Garbage collection stats
If you are interested in any garbage collection stats there’s the gc.log files on each of the appservers (bad thing about it is it doesn’t do any date stamping so you can see how memory fluctuates but its a difficult to create a chart over time). In the past I’ve thought it might be good idea to write a cron that archived it daily so that you could at least break things down day by day.
XII- DB Connection
XIII- Load Average Monitoring script
Set up a cron that just email sysadmin when the load average is above 3.
XIV – Find out who is monopolizing or eating the CPUs
[Server]# ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10