What and how to check any linux Server/Systems health

1. CPU Performance

  • Current Usage: top, htop, or mpstat
  • Load Average: uptime or check the output of top (the three numbers at the top-right).
  • Compare the load average to the number of CPU cores (nproc).
  • Processes: Monitor high-CPU-consuming processes using top or ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu.

2. Memory Usage

  • Total and Free Memory: free -h or vmstat -s.
  • Swap Usage: Check if swap space is being heavily used (free -h or swapon -s).
  • Processes Using Most Memory: top or ps -eo pid,ppid,cmd,%mem --sort=-%mem.

3. Disk Usage

  • Available Space: df -h to check disk usage across filesystems.
  • Inode Usage: df -i to check inode utilization.
  • Disk I/O: iostat, iotop, or dstat.
  • Error Messages: Review logs in /var/log/ for any disk-related errors.

4. Network Performance

  • Network Usage: iftop, ip -s link, or netstat.
  • Connections: ss or netstat to check open connections and ports.
  • Packet Loss/Latency: ping, traceroute, or mtr.
  • Bandwidth Monitoring: vnstat, iftop, or nload.

5. System Logs

  • General System Logs: journalctl or /var/log/syslog (for system-wide events).
  • Kernel Logs: dmesg or journalctl -k to check for hardware errors or warnings.

6. Uptime and System Load

  • Uptime: uptime command provides server uptime and load averages.
  • Load Analysis: Investigate load spikes with sar or atop.

7. Running Services and Processes

  • Service Status: systemctl status <service> or service <service> status.
  • Zombie/Unnecessary Processes: ps aux | grep Z to list zombie processes.

8. Security

  • Users Logged In: who, w, or last.
  • Unauthorized Logins: Review /var/log/secure or /var/log/auth.log.
  • Firewall Rules: iptables -L or ufw status.
  • Listening Ports: ss -tuln or netstat -tuln.

9. Hardware Health

  • Temperature and Fan Speed: sensors (part of lm-sensors package).
  • RAID Status: Check using mdadm or vendor tools if RAID is configured.

10. Scheduled Jobs

  • Cron Jobs: crontab -l or check /etc/crontab.
  • Failures: Examine /var/log/syslog for cron-related logs.

11. Backup Status

  • Backup Logs: Ensure regular backups are occurring as scheduled.
  • Verify Integrity: Test restore procedures periodically.

Automation Tools for Server Health Monitoring

  • Nagios, Zabbix, Prometheus, or Datadog for continuous monitoring.
  • Custom scripts combining commands like top, df, iostat, and log parsing can provide quick insights.

By periodically reviewing these parameters, you can ensure the Linux server’s health and address potential issues proactively.

Leave a Reply

Your email address will not be published. Required fields are marked *