Text processing

Most system data on Linux is stored as plain text. Log files record application events, configuration files control behavior, CSV exports carry database contents, and /proc files expose kernel state. The command line provides a set of composable tools for filtering, transforming, and extracting data from these files without requiring a database query language or spreadsheet software.

The real power is in combining these tools in pipelines. Each command in a pipeline does one thing and does it well. A two-minute investigation of a production issue often comes down to three or four of these commands chained together. This page covers the tools you will reach for most frequently.

Viewing and counting text

wc

wc counts lines, words, and characters in a file or from stdin. A newline counts as one character.

The following table shows the common options:

Option	Counts
`-l`	Lines only
`-w`	Words only
`-c`	Characters only

Count lines in a log file to check whether it has grown since last night:

wc -l /var/log/auth.log             # 1274 /var/log/auth.log
wc -w /etc/nginx/nginx.conf         # word count of a config file
ls -1 /etc | wc -l                  # number of entries in /etc

head

head prints the first 10 lines of a file. Specify a different number with -n:

head -20 /var/log/syslog            # first 20 lines
ls /bin | head -5                   # first 5 filenames in /bin
head -3 access.log | wc -w          # count words in the first 3 lines

tail

tail prints the last 10 lines of a file. Pass -f to follow a log file in real time as new lines are written:

tail -f /var/log/nginx/error.log                    # live log watching
tail -n 50 /var/log/syslog                          # last 50 lines
tail -n +3 /etc/hosts                               # everything starting at line 3
head -10 /var/log/auth.log | tail -3                # lines 8, 9, and 10

cat and tac

cat (concatenate) prints files in order from top to bottom. tac prints them in reverse, from bottom to top:

cat file1.txt file2.txt > combined.txt
tac /var/log/apache2/access.log | head -20    # most recent 20 requests

tac is useful for logs that are in chronological order but cannot be reversed with sort -r, because the sort key appears mid-line.

Extracting columns and fields

cut

cut extracts columns from a line. Specify columns by field number (tab-delimited by default) or by character position.

The following table shows the main options:

Option	Behavior
`-fN`	Print field N (tab-delimited)
`-f2,4`	Print fields 2 and 4
`-f2-4`	Print fields 2 through 4
`-cN`	Print character N
`-c2-4`	Print characters 2 through 4
`-d,`	Change the field delimiter to `,`

Extract the username field from /etc/passwd:

cut -d: -f1 /etc/passwd             # first field, colon-delimited
cut -d: -f1,7 /etc/passwd           # usernames and shells
cut -f1-3 report.tsv                # first three tab-delimited columns

Combining text

paste

paste combines files side by side, separating columns with a tab by default:

paste usernames.txt emails.txt                      # tab-delimited
paste -d, names.txt scores.txt                      # comma-delimited
paste -d "\n" list1.txt list2.txt                   # interleave lines from two files

diff

diff compares two files and prints their differences. The output notation uses < for lines only in the first file and > for lines only in the second:

diff /etc/hosts /etc/hosts.backup
1c1                                 # line 1 in file 1 differs from line 1 in file 2
< 192.168.1.100 web-01
---
> 192.168.1.101 web-01

Filter the output to see only the changed lines without the context markers:

diff config.current config.backup | grep '^[<>]'
diff config.current config.backup | grep '^[<>]' | cut -c3-     # remove the < > prefix

Transforming text

tr

tr translates characters: it takes two sets of characters and replaces each character in the first set with the corresponding character in the second set. Pass -d to delete characters instead.

Convert a colon-delimited string to newlines for readable output:

echo $PATH | tr ':' '\n'

Remove all spaces from a string:

echo "hello world" | tr -d ' '     # helloworld

Convert uppercase to lowercase:

echo "SERVER-01" | tr '[A-Z]' '[a-z]'    # server-01

rev

rev reverses the characters on each line. This is useful when you need to extract the last field from lines that have varying numbers of columns. For example, extract the last word from each line regardless of how many words precede it:

rev /etc/shells | cut -d'/' -f1 | rev       # extract shell name from each path

Sorting and deduplication

sort

sort reorders lines in ascending alphabetical order by default.

The following table shows common options:

Option	Behavior
`-r`	Reverse (descending) order
`-n`	Numeric sort
`-nr`	Numeric descending
`-u`	Remove duplicate lines
`-f`	Ignore case
`-k N`	Sort by field N (whitespace-delimited)
`-t ,`	Change field delimiter to `,`
`-o`	Write output to the specified file

Sort a log file by the third column (for example, HTTP status code):

sort -k9 -n access.log              # sort by status code (field 9)
sort -k3 -rn report.txt             # sort by field 3, numeric descending
cut -f3 data.tsv | sort -nr         # sort the third column numerically

Sort by characters 4 and 5 of field 2 (useful for date fields):

sort -k 2.4,2.5 filename.txt

uniq

uniq removes adjacent duplicate lines. Always sort first so that duplicates are adjacent.

The following table shows common options:

Option	Behavior
`-c`	Prefix each line with the count of occurrences
`-d`	Print only lines that appear more than once
`-u`	Print only lines that appear exactly once
`-i`	Ignore case when comparing
`-f N`	Skip the first N fields before comparing

Find the most common HTTP status codes in an access log:

awk '{print $9}' /var/log/apache2/access.log | sort | uniq -c | sort -rn | head -10

Find the grade that appears most often in a grades file:

cut -f1 grades | sort | uniq -c | sort -nr | head -1 | cut -c9

Pattern matching with grep

grep searches files for lines that match a pattern and prints each matching line. It is the most frequently used filtering tool on the command line.

grep [OPTIONS] PATTERN [FILE...]

The following table shows common options:

Option	Behavior
`-c`	Print a count of matching lines instead of the lines themselves
`-E`	Enable extended regular expressions (ERE)
`-f`	Read patterns from a file
`-i`	Ignore case
`-l`	Print only filenames of files with at least one match
`-n`	Print the line number before each match
`-o`	Print only the matching text, not the full line
`-q`	Silent mode – exit status only, no output
`-r`	Recursively search directories
`-R`	Recursive search, following symbolic links
`-v`	Invert match: print lines that do not match
`-w`	Match whole words only
`-P`	Enable Perl-compatible regular expressions

Search for failed SSH login attempts in the auth log:

grep "Failed password" /var/log/auth.log
grep -c "Failed password" /var/log/auth.log     # count failures
grep -i "error" /var/log/syslog                  # case-insensitive
grep -R "PermitRootLogin" /etc/ssh/             # recursive config search
grep -v "^#" /etc/nginx/nginx.conf              # non-comment lines only

Double-filter to find root login failures specifically:

grep "authenticating" /var/log/auth.log | grep "root"

Anchors and patterns

Pattern anchors control where in the line a match must occur:

grep ^root /etc/passwd          # lines that begin with "root"
grep nologin$ /etc/passwd       # lines that end with "nologin"
grep -v '^$' file.txt           # non-blank lines (match blank lines and invert)
grep '......' file.txt          # lines with at least 6 characters
grep daemon.*nologin /etc/passwd

Extended regular expressions

Pass -E (or run egrep) to enable extended regex, which adds support for |, +, ?, {, and (:

grep -E "^root|^dbus" /etc/passwd               # lines beginning with root or dbus
grep -E '^(web|app|db)-[0-9]+' /etc/hosts       # match server naming patterns
egrep "(daemon|nobody).*nologin" /etc/passwd

Regex reference

The following table describes regex metacharacters:

Metacharacter	Matches
`.`	Any single character except newline
`*`	Zero or more of the preceding character
`+`	One or more of the preceding character (ERE)
`?`	Zero or one of the preceding character (ERE)
`^`	Start of line
`$`	End of line
`[abc]`	Any character in the set
`[^abc]`	Any character not in the set
`[a-z]`	Any character in the range
`\`	Escape the following character

The following table describes POSIX character classes for use inside [[ ]]:

Class	Matches
`[:alnum:]`	Alphanumeric characters
`[:alpha:]`	Alphabetic characters
`[:digit:]`	Digits
`[:lower:]`	Lowercase letters
`[:upper:]`	Uppercase letters
`[:space:]`	Whitespace including line breaks
`[:punct:]`	Punctuation
`[:xdigit:]`	Hexadecimal digits

Perl regex shortcuts

These shortcuts require the -P flag:

Shortcut	Matches
`\s`	Any whitespace
`\S`	Any non-whitespace
`\d`	Any digit
`\D`	Any non-digit
`\w`	Word character (letter, digit, underscore)
`\W`	Non-word character

Quantifiers

Quantifiers control how many times the preceding expression must match:

grep -E 'T{5}'    file   # T appears exactly 5 times consecutively
grep -E 'T{3,6}'  file   # T appears 3 to 6 times
grep -E 'T{5,}'   file   # T appears 5 or more times

Back references

A back reference lets you refer to a previous capture group. This example matches any HTML opening and closing tag pair where the tag names match:

egrep '<([a-zA-Z]*)>.*</\1>' file.html

The \1 means “whatever was matched inside the first set of parentheses.”

Stream editing with sed

sed transforms text by applying a sequence of instructions called a sed script to each line of input. The most common script replaces one string with another:

sed 's/regexp/replacement/' input-file

The s command replaces the first occurrence on each line. Append g to replace all occurrences:

echo "one one two" | sed 's/one/yes/g'          # yes yes two

The following table shows common sed commands:

Script	Effect
`s/old/new/`	Replace first occurrence per line
`s/old/new/g`	Replace all occurrences per line
`s/old/new/i`	Case-insensitive replacement
`s/old/new/gI`	Global case-insensitive replacement
`2s/old/new/`	Replace only on line 2
`s/old/new/3`	Replace only the third occurrence on each line
`-i`	Modify the file in place
`d`	Delete matching lines
`/pattern/d`	Delete lines matching a pattern
`nd`	Delete line number n
`-n 'Np'`	Print only line N
`-n '/pattern/p'`	Print lines matching a pattern
`y/abc/xyz/`	Translate characters (like `tr`)

Replace a hostname in a configuration file:

sed -i 's/old-server.internal/new-server.internal/g' /etc/app/config.conf

Delete comment lines and blank lines from a config file:

sed -e '/^#/d' -e '/^$/d' /etc/nginx/nginx.conf

Print a specific line range:

sed -n '5,10p' /var/log/syslog      # print lines 5 through 10

Subexpressions

Subexpressions let you capture and rearrange parts of a matched string. Wrap the part you want to capture in $ and $, then reference it in the replacement with \1, \2, and so on.

For example, reformat a date string from YYYY-MM-DD to DD/MM/YYYY:

echo "2025-03-29" | sed 's/\([0-9]*\)-\([0-9]*\)-\([0-9]*\)/\3\/\2\/\1/'
# 29/03/2025

Rename image files by moving a version number from the end of the name to before the extension:

sed "s/image\.jpg\.\([1-3]\)/image\1.jpg/"

Long regex patterns

Break a long regex into named shell variables for readability:

areacode='\([0-9]*\)'
state='\([A-Z][A-Z]\)'
city='\([^@]*\)'

regexp="${areacode}@${state}@${city}@"
replacement='\1\t\2\t\3\n'

sed "s/$regexp/$replacement/g" data.txt

Report generation with awk

awk processes structured text files and generates reports. It reads each line, splits it into fields, and runs your awk program against each line. Fields are referenced as $1, $2, and so on. $0 is the entire line and $NF is the last field.

awk '{print $2}' /etc/hosts                     # print second column of hosts file
awk -F: '{print $1}' /etc/passwd                # usernames, colon-delimited
df | awk 'FNR>1 {print $4}'                     # available space, skip header

Filtering with patterns

Run an awk program only on lines that match a pattern:

awk '/ERROR/' /var/log/app.log                  # lines containing ERROR
awk '!/^#/' /etc/hosts                          # non-comment lines
awk '$9 >= 500' access.log                      # HTTP 5xx errors (field 9 is status code)
awk 'NR>=10 && NR<=20' file.txt                 # lines 10 through 20

BEGIN and END blocks

BEGIN runs before processing any lines. END runs after all lines have been processed. Both are useful for printing headers, footers, and computed summaries:

awk -F'\t' \
'BEGIN {print "Recent entries:"} \
$3~/^2025/{print $4, "(" $3 ").", "\"" $2 "\""} \
END {print "End of report"}' \
data.tsv

Sum a column and print the total:

seq 1 100 | awk '{s+=$1} END {print "Total:", s}'     # Total: 5050

Arrays and loops in awk

awk arrays act as hash maps indexed by any string key. This makes them useful for counting occurrences:

awk '{counts[$9]++} END {for (code in counts) print counts[code], code}' access.log

The previous command counts HTTP status codes across an access log. counts[$9]++ increments the count for each status code value in field 9. The END block prints each status code and its count.

Find duplicate files by checksum:

md5sum *.jpg | awk '{counts[$1]++} END {for (key in counts) print counts[key], key}' | sort -rn

awk cheat sheet

The following table shows common awk programs:

Program	Description
`'{print $1}'`	Print first column
`'{print $1, $3}'`	Print columns 1 and 3
`'NR==3'`	Print line 3
`'NR>=2 && NR<=4'`	Print lines 2 to 4
`'/pattern/'`	Print lines matching pattern
`'!/pattern/'`	Print lines not matching pattern
`'{print NR, $0}'`	Print lines with line numbers
`'$2 > 50'`	Print lines where field 2 is greater than 50
`'{sum+=$1} END {print sum}'`	Sum field 1
`'{sum+=$1} END {print sum/NR}'`	Average of field 1
`'BEGIN {FS=","} {print $1}'`	Comma-delimited: print first field
`'{print toupper($1)}'`	Convert first field to uppercase
`'{sub(/old/, "new"); print}'`	Replace first occurrence per line
`'{gsub(/old/, "new"); print}'`	Replace all occurrences per line
`'{print $NF}'`	Print last field

Generating text

date

date prints the current date and time in any format you specify. Format strings begin with + and contain % sequences:

The following table shows common format strings:

Format string	Example output	Description
`%Y-%m-%d`	2025-03-29	ISO 8601 date
`%d-%m-%Y`	29-03-2025	Day-Month-Year
`%m/%d/%Y`	03/29/2025	Month/Day/Year (US)
`%A, %B %d, %Y`	Saturday, March 29, 2025	Full date with names
`%H:%M:%S`	14:30:15	24-hour time
`%Y-%m-%d %H:%M:%S`	2025-03-29 14:30:15	Full timestamp
`%s`	1743456615	Unix epoch seconds

Generate a timestamped filename for a log archive:

tar -czf "backup_$(date +%Y-%m-%d).tar.gz" /var/www/html

seq

seq prints a sequence of numbers. The basic form is seq LOW HIGH. Add a third argument between them to set the step:

seq 1 5             # 1 2 3 4 5
seq 1 2 10          # 1 3 5 7 9 (odd numbers)
seq 10 -1 1         # 10 9 8 ... 1 (countdown)
seq 0 0.5 2         # 0 .5 1.0 1.5 2.0
seq -w 1 5          # 1 2 3 4 5 with zero-padded width
seq -s, 1 5         # 1,2,3,4,5 (comma-separated)

Brace expansion

Brace expansion generates sequences of numbers or letters directly in the shell without a separate program:

echo {1..5}                     # 1 2 3 4 5
echo {1..10..2}                 # 1 3 5 7 9 (step of 2)
echo {a..z}                     # a b c ... z
echo {A..Z} | tr -d ' '         # ABCDEFGHIJKLMNOPQRSTUVWXYZ (no spaces)
echo {A..Z} | tr ' ' '\n'       # one letter per line

Brace expansion is useful for creating numbered files or directories in bulk:

mkdir -p logs/{jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec}
touch report_{2023..2025}.txt

yes

yes prints the same line repeatedly until killed. By default it prints y. Pipe it to a command that requires interactive confirmation, or pipe it to head to generate repeated test input:

yes | apt-get install -y package-name    # confirm all prompts automatically
yes "test data" | head -100              # generate 100 lines of test data

Checksums and duplicate detection

md5sum and sha1sum

md5sum computes a 32-character hash (checksum) of a file’s contents. Files with identical contents produce identical checksums. sha1sum produces a 40-character hash and is more collision-resistant:

md5sum /etc/hosts                       # hash the file
sha1sum deployment.tar.gz              # verify a downloaded archive

Verify a file’s integrity by comparing against a published checksum:

sha1sum -c checksums.txt               # -c reads a file of hash: filename pairs

Detecting duplicate files

This pipeline finds duplicate files in the current directory by comparing checksums. Walk through each step to understand how it builds:

Step 1: compute the checksum for all .txt files:

md5sum *.txt | cut -d' ' -f1

Step 2: sort so that identical checksums are adjacent:

md5sum *.txt | cut -d' ' -f1 | sort

Step 3: count occurrences of each checksum:

md5sum *.txt | cut -d' ' -f1 | sort | uniq -c

Step 4: sort by count descending so duplicates appear first:

md5sum *.txt | cut -d' ' -f1 | sort | uniq -c | sort -rn

Step 5: filter out non-duplicates (count of 1):

md5sum *.txt | cut -d' ' -f1 | sort | uniq -c | sort -rn | grep -v "^      1 "

Once you have a checksum for a duplicate, find the filenames:

md5sum *.txt | grep "5bbf5a52328e7439ae6e719dfe712200" | cut -d' ' -f3

Real-world pipeline example: Apache log analysis

An Apache access log has fields for IP address, timestamp, HTTP method, path, status code, and bytes transferred. This section shows how to investigate a spike in 404 errors.

Filter only 404 lines for today’s date:

grep "$(date +%d/%b/%Y)" /var/log/apache2/access.log | grep ' 404 '

Find the top 15 paths generating 404 errors:

grep ' 404 ' /var/log/apache2/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -15

Find the IP addresses making the most requests that result in 404 errors:

grep ' 404 ' /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

Check whether a single IP is making an unusually high number of 401 errors (failed authentication):

grep ' 401 ' /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -5

Leveraging text files for recurring tasks

When data is structured and stored in a text file, you can write commands once and rerun them on updated data. The general process is:

Identify the business problem that involves data.
Store the data in a plain text file in a consistent format.
Write Linux commands to process the file and solve the problem.
Capture the commands in a script so they are easy to repeat.

A practical example: check domain expiration dates from a list. The following script queries each domain’s registrar with whois and extracts the expiry date:

#!/bin/bash
expdate=$(date \
            --date "$(whois "$1" \
                | grep 'Registry Expiry Date:' \
                | awk '{print $4}')" \
            +'%Y-%m-%d')
echo "$expdate $1"

Call that script from a loop that reads from a file of domain names:

while read -r domain; do
    ./check-expiry "$domain"
    sleep 5
done < domains.txt