Monitoring system resources

Disk space

Troubleshooting workflow:

df -h to see what volume is causing issues
du -hsc * to find the directory causing issues
ncdu [-x] for deep dive

df

disk free - displays file system usage:

lists all mounted volumes and how much space is left on each in bytes
change bytes with -h option
Device names are generated by the type of hardware that the underlying storage device is on
Also need to consider inodes in addition to the size of the data
- inode is a database object that contains metadata for the actual items that you’re storing
- file owner, permissions, last modified date, etc
- If failure bc of inodes, server is creating too many files, such as log files or email messages

df -h               # list disk space in human-readable format
df -i               # list inode usage

ncdu

NCurses Disk Usage–scan your filesystem and show report on which directories are using the most space::

Get disk space and look through the results
Need to install first
Can only scan dirs that the user can access
Use the arrow and Enter key to drill down into directories in the output.

sudo apt install ncdu               # install
ncdu -x                             # view only current fs

Disk usage by directory

du

Shows how much space a directory is using:

scans the current directory and subdirectories that you have permissions to
- run as root to get full picture
After you find the general location of the disk hog, cd into dirs and run du again

du -hsc *       # human-readable, summary, total usage

Memory usage

free

Displays the current memory usage in KB:

To see if there is a problem, look at the Mem available vs Mem total.
free memory is the only memory that is actually not in use at all
available is actually in use by the system cache, but the kernel can free this memory for use if an app needs it.
- This is because any RAM that is not in use is wasted - its about efficiency
- “Extra” RAM is given to the filesystem cache, which stores data that is written to disk when the time is right (it is synchronized)
- This makes your system faster, bc the system doesn’t have to read/write to disk for recently used files - it goes to RAM
tmpfs is a temporary filesystem in Linux that resides in memory (RAM) rather than on a physical storage device. It is typically used for storing temporary files that don’t need to persist after a system reboot.

Column	Description
total	Total memory on the server
used	Memory that is used. used = total - free - buffers/cache
free	Memory not in use by anything
shared	Memory used by `tmpfs` and other shared resources
buff/cache	Memory used by buffers and cache
available	Memory that is free for app use. Much of this is actually used for RAM.

free                # memory usage in KB
free -m             # memory usage in MB (recommended)

# --- Example to understand columns --- #
free -m
               total        used        free      shared  buff/cache   available
Mem:            3915         533        2474           1        1198        3381
Swap:           2335           0        2335

Swap

A disk partition or a file that acts like RAM when your server memory is saturated:

On disk, so much slower than RAM
Prevents OOM from killing processes
After 16.04, Ubuntu uses a swap file, not swap partition
- Easier to grow and shrink a file than partition
- No need to make a swap file or partition anymore
swap is listed in /etc/fstab file
Only delete swap file if you need to make a larger one
Some apps like K8s require that you disable swap
Recommend 2GB swap files on servers at least
swappiness is the point at which (how frequently) your server uses swap
- Set to 60 by default
- higher the value, more likely the server uses swap
- Change in /etc/sysctl.conf to persist swappiness after reboot

grep swap /etc/fstab        # swap file in /etc/fstab
/swap.img	none	swap	sw	0	0

swapon -a                       # finds swap with /etc/fstab, mounts it, activates it
swapoff -a                      # deactivates swap
cat /proc/sys/vm/swappiness     # view swapiness
sysctl vm.swappiness=30         # change swappiness until reboot

# --- Creating a swap file --- #
# 1. Create the file with fallocate
fallocate -l 2G /swapfile

# 2. Set permissions
chmod 0600 /swapfile

# 3. Convert to swap file
mkswap /swapfile

# 4. Mount it with /etc/fstab
/swapfile   none    swap    0   0

# 5. Activate new swapfile
swapon -a

# --- Change swappiness and persist after reboot --- #
# 1. Edit /etc/sysctl.conf
/etc/sysctl.conf

# 2. Add new val to bottom of file
vm.swappiness = 30

fallocate

Create a file with a preallocated size:

fallocate -l <size> <filename>
# l - length of file in bytes

fallocate -l 4G /swapfile

Load average

Represents your server’s trend in CPU utilization over time:

Stored in /proc/loadavg
easier to view with uptime
numbers are 1 min, 5 min, 15 min
- Represent how many tasks were waiting for CPU in that time period
- Less than 1 is good
- If load avg = # CPUs on system, then they are all running 100%
- If load avg > # CPUs on system, you have an issue
- Analogy: cashiers at a supermarket - if there are 4 cashiers and 4 customers checking out, they are running at capacity. If there are 6 customers, then the store is above capacity
Develop baselines for your server so you know what is normal. For example, if it goes from 1.x to 0.x, then you are overspending on your server or maybe a service is down
Better view of CPU usage than something like htop bc CPU usage can go to 100% when a process is running but then back down when complete, so the uptime view over time gives a better picture
A server can have multiple CPUs (physical cores), and each CPU can have multiple cores (logical cores). The kernel treats physical and logical cores the same

cat /proc/loadavg                   # view load avg in /proc
0.00 0.00 0.00 1/234 14112

uptime                              # view load avg, resets at reboot
nproc                               # get number of cores

View resource usage

htop

Provides an overall view of your server performance. Better than top:

Maybe run as root for additional capabilities, like killing processes
Add CPU average for all cores: F2 > Meters, then F5 to select CPU average
Can navigate with mouse. Ex: click Quit in bottom of page to exit
View by user by entering u
Tree view with F5
Refreshes every 2 seconds, but can change that with -d option and number in tenths of seconds

F2                  # Setup mode to view options set colors, etc
u                   # View Show processes of: menu to view processes for specific user
F5                  # Enter/exit tree view
htop -d 70          # Refresh every 7 seconds
F9                  # Kill the selected process - will provide signal options

Devices

lsusb

Lists attached USB devices:

lsusb
Bus 004 Device 033: ID 0bda:8153 Realtek Semiconductor Corp. RTL8153 Gigabit Ethernet Adapter
Bus 004 Device 032: ID 2109:0817 VIA Labs, Inc. USB3.0 Hub             
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
...

lshw

List all hardware devices on your system:

# list only network devices
lshw -class network
  *-network                 
       description: Ethernet interface
       product: Wi-Fi 6 AX200
       ...
  *-network
       description: Ethernet interface
       physical id: e
       ...


lshw -html > lshw-output.html      # output as html
lshw -c memory                     # memory info
lshw -c storage                    # storage
lshw -c multimedia                 # multimedia
lshw -c cpu                        # cpu