Monitoring system resources

Disk space

Troubleshooting workflow:

  1. df -h to see what volume is causing issues
  2. du -hsc * to find the directory causing issues
  3. ncdu [-x] for deep dive

df

disk free - displays file system usage:

  • lists all mounted volumes and how much space is left on each in bytes
  • change bytes with -h option
  • Device names are generated by the type of hardware that the underlying storage device is on
  • Also need to consider inodes in addition to the size of the data
    • inode is a database object that contains metadata for the actual items that you’re storing
    • file owner, permissions, last modified date, etc
    • If failure bc of inodes, server is creating too many files, such as log files or email messages
df -h               # list disk space in human-readable format
df -i               # list inode usage

ncdu

NCurses Disk Usage:

  • Get disk space and look through the results
  • Need to install first
  • Can only scan dirs that the user can access
sudo apt install ncdu               # install
ncdu -x                             # view only current fs

Disk usage by directory

du

Shows how much space a directory is using:

  • scans the current directory and subdirectories that you have permissions to
    • run as root to get full picture
  • After you find the general location of the disk hog, cd into dirs and run du again
du -hsc *       # human-readable, summary, total usage

Memory usage

free

Displays the current memory usage in KB:

  • To see if there is a problem, look at the Mem available vs Mem total.
  • free memory is the only memory that is actually not in use at all
  • available is actually in use by the system cache, but the kernel can free this memory for use if an app needs it.
    • This is because any RAM that is not in use is wasted - its about efficiency
    • “Extra” RAM is given to the filesystem cache, which stores data that is written to disk when the time is right (it is synchronized)
    • This makes your system faster, bc the system doesn’t have to read/write to disk for recently used files - it goes to RAM
  • tmpfs is a temporary filesystem in Linux that resides in memory (RAM) rather than on a physical storage device. It is typically used for storing temporary files that don’t need to persist after a system reboot.
ColumnDescription
totalTotal memory on the server
usedMemory that is used. used = total - free - buffers/cache
freeMemory not in use by anything
sharedMemory used by tmpfs and other shared resources
buff/cacheMemory used by buffers and cache
availableMemory that is free for app use. Much of this is actually used for RAM.
free                # memory usage in KB
free -m             # memory usage in MB (recommended)

# --- Example to understand columns --- #
free -m
               total        used        free      shared  buff/cache   available
Mem:            3915         533        2474           1        1198        3381
Swap:           2335           0        2335

Swap

A disk partition or a file that acts like RAM when your server memory is saturated:

  • On disk, so much slower than RAM
  • Prevents OOM from killing processes
  • After 16.04, Ubuntu uses a swap file, not swap partition
    • Easier to grow and shrink a file than partition
    • No need to make a swap file or partition anymore
  • swap is listed in /etc/fstab file
  • Only delete swap file if you need to make a larger one
  • Some apps like K8s require that you disable swap
  • Recommend 2GB swap files on servers at least
  • swappiness is the point at which (how frequently) your server uses swap
    • Set to 60 by default
    • higher the value, more likely the server uses swap
    • Change in /etc/sysctl.conf to persist swappiness after reboot
grep swap /etc/fstab        # swap file in /etc/fstab
/swap.img	none	swap	sw	0	0

swapon -a                       # finds swap with /etc/fstab, mounts it, activates it
swapoff -a                      # deactivates swap
cat /proc/sys/vm/swappiness     # view swapiness
sysctl vm.swappiness=30         # change swappiness until reboot

# --- Creating a swap file --- #
# 1. Create the file with fallocate
fallocate -l 2G /swapfile

# 2. Set permissions
chmod 0600 /swapfile

# 3. Convert to swap file
mkswap /swapfile

# 4. Mount it with /etc/fstab
/swapfile   none    swap    0   0

# 5. Activate new swapfile
swapon -a

# --- Change swappiness and persist after reboot --- #
# 1. Edit /etc/sysctl.conf
/etc/sysctl.conf

# 2. Add new val to bottom of file
vm.swappiness = 30

fallocate

Create a file with a preallocated size:

fallocate -l <size> <filename>
# l - length of file in bytes

fallocate -l 4G /swapfile

Load average

Represents your server’s trend in CPU utilization over time:

  • Stored in /proc/loadavg
  • easier to view with uptime
  • numbers are 1 min, 5 min, 15 min
    • Represent how many tasks were waiting for CPU in that time period
    • Less than 1 is good
    • If load avg = # CPUs on system, then they are all running 100%
    • If load avg > # CPUs on system, you have an issue
    • Analogy: cashiers at a supermarket - if there are 4 cashiers and 4 customers checking out, they are running at capacity. If there are 6 customers, then the store is above capacity
  • Develop baselines for your server so you know what is normal. For example, if it goes from 1.x to 0.x, then you are overspending on your server or maybe a service is down
  • Better view of CPU usage than something like htop bc CPU usage can go to 100% when a process is running but then back down when complete, so the uptime view over time gives a better picture
  • A server can have multiple CPUs (physical cores), and each CPU can have multiple cores (logical cores). The kernel treats physical and logical cores the same
cat /proc/loadavg                   # view load avg in /proc
0.00 0.00 0.00 1/234 14112

uptime                              # view load avg, resets at reboot
nproc                               # get number of cores

View resource usage

htop

Provides an overall view of your server performance. Better than top:

  • Maybe run as root for additional capabilities, like killing processes
  • Add CPU average for all cores: F2 > Meters, then F5 to select CPU average
  • Can navigate with mouse. Ex: click Quit in bottom of page to exit
  • View by user by entering u
  • Tree view with F5
  • Refreshes every 2 seconds, but can change that with -d option and number in tenths of seconds
F2                  # Setup mode to view options set colors, etc
u                   # View Show processes of: menu to view processes for specific user
F5                  # Enter/exit tree view
htop -d 70          # Refresh every 7 seconds
F9                  # Kill the selected process - will provide signal options