Linux & App Servers Monitoring Tricks



  • Prevent server from going down,
  • detect what caused the server to go down,
  • get servers back after failure.
  • Explain what caused the server to fail

I – Regularly watch your monitoring  tool(s) (Nagios, Wily,top, …..)

On some of those tools,  you can see a graph of the apps CPU Load Average, CPU Used Percentage, Disk Usage percentage, memory used percentage, Network Bandwidth, Swap Used percentage.

II – Check ulimit count (Number of files opened by applications like tomcat, oracle,…)

[Server]# ulimit -n
If the number of files opened is getting closer to ulimit count (1024), increase the ulimit and talk to dev to identify and fix the process that is causing that.
To increase ulimit count for a specific application account, run the command  ulimit –n [value]
Ulimit can be set to whatever you want. Its one of those things that’s put in place as a throttle to keep things from going too nuts. Some systems will actually just set it to unlimited.

III- Port Monitoring 
Check number of connections to ports used by your apps

IV- Thread dump (stack trace of all threads ) If you have a high cpu percentage

AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, Discriminative AI, xAI, LLMs, GPUs, Machine Learning, NLP, Promp Engineering)

[Server]# kill -3 (The output is printed in catalina.out) to see what is causing this and send it to developers.

V- Disk space /[drive_name] filling up quickly
Identify the file(s) that are filling up the disks. Most of the time ,it will be logs files.
[Server]# du -ks /[drive_name]/* | sort -nr | head
5719076 /[drive_name]/catalina
3675672 /[drive_name]/data
3287436 /[drive_name]/source
2044316 /[drive_name]/servers
319404 /[drive_name]/images
16 /[drive_name]/lost+found
By running this command on the larger folder, that will lead you to the files that eat the disk space.
Back up, remove or empty the file in question given that it won’t break the system.
If the log files are responsible for the disk filling up, let the developer know about it so that they can solve it. In the meantime, empty the log file with the command:

[Log_File_Location]# echo -n > Large_Log_File_Name.log

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLF-C02 book

VI- Watch catalina.out and log4j.out after staging and live deploy, especially when you are restarting the servers.


[Server]# tail -f log4j.log
VII- Start app servers properly
Before restarting app servers, make sure there is no app pid running for that specific server.

[Server_Name]$ ps -ef | grep oracle
Kill the pid for that server.

IX – Cpu Load level

I would say that if we peak under 70% CPU during high traffic, we are doing well and have room. A good level to be ticking over at would be 30% used.
[Server]# top
top – 12:37:29 up 47 days, 23:09, 4 users, load average: 0.20, 0.20, 0.22
Tasks: 189 total, 1 running, 178 sleeping, 10 stopped, 0 zombie
Cpu(s): 1.2%us, 0.1%sy, 0.0%ni, 97.5%id, 1.0%wa, 0.0%hi, 0.1%si, 0.0%st

X- Server specific status pings (To assure the server are up and serving contents)
Write scripts for this

XI- Garbage collection stats

If you are interested in any garbage collection stats there’s the gc.log files on each of the appservers (bad thing about it is it doesn’t do any date stamping so you can see how memory fluctuates but its a difficult to create a chart over time). In the past I’ve thought it might be good idea to write a cron that archived it daily so that you could at least break things down day by day.

XII- DB Connection

XIII- Load Average Monitoring script
Set up a  cron that just email sysadmin when the load average is above 3.

Djamgatech: Build the skills that’ll drive your career into six figures: Get Djamgatech.

XIV – Find out who is monopolizing or eating the CPUs
[Server]# ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10

error: Content is protected !!