Text processing
Most system data on Linux is stored as plain text. Log files record application events, configuration files control behavior, CSV exports carry database contents, and /proc files expose kernel state. The command line provides a set of composable tools for filtering, transforming, and extracting data from these files without requiring a database query language or spreadsheet software.
The real power is in combining these tools in pipelines. Each command in a pipeline does one thing and does it well. A two-minute investigation of a production issue often comes down to three or four of these commands chained together. This page covers the tools you will reach for most frequently.
Viewing and counting text
wc
wc counts lines, words, and characters in a file or from stdin. A newline counts as one character.
The following table shows the common options:
| Option | Counts |
|---|---|
-l | Lines only |
-w | Words only |
-c | Characters only |
Count lines in a log file to check whether it has grown since last night:
wc -l /var/log/auth.log # 1274 /var/log/auth.log
wc -w /etc/nginx/nginx.conf # word count of a config file
ls -1 /etc | wc -l # number of entries in /etc
head
head prints the first 10 lines of a file. Specify a different number with -n:
head -20 /var/log/syslog # first 20 lines
ls /bin | head -5 # first 5 filenames in /bin
head -3 access.log | wc -w # count words in the first 3 lines
tail
tail prints the last 10 lines of a file. Pass -f to follow a log file in real time as new lines are written:
tail -f /var/log/nginx/error.log # live log watching
tail -n 50 /var/log/syslog # last 50 lines
tail -n +3 /etc/hosts # everything starting at line 3
head -10 /var/log/auth.log | tail -3 # lines 8, 9, and 10
cat and tac
cat (concatenate) prints files in order from top to bottom. tac prints them in reverse, from bottom to top:
cat file1.txt file2.txt > combined.txt
tac /var/log/apache2/access.log | head -20 # most recent 20 requests
tac is useful for logs that are in chronological order but cannot be reversed with sort -r, because the sort key appears mid-line.
Extracting columns and fields
cut
cut extracts columns from a line. Specify columns by field number (tab-delimited by default) or by character position.
The following table shows the main options:
| Option | Behavior |
|---|---|
-fN | Print field N (tab-delimited) |
-f2,4 | Print fields 2 and 4 |
-f2-4 | Print fields 2 through 4 |
-cN | Print character N |
-c2-4 | Print characters 2 through 4 |
-d, | Change the field delimiter to , |
Extract the username field from /etc/passwd:
cut -d: -f1 /etc/passwd # first field, colon-delimited
cut -d: -f1,7 /etc/passwd # usernames and shells
cut -f1-3 report.tsv # first three tab-delimited columns
Combining text
paste
paste combines files side by side, separating columns with a tab by default:
paste usernames.txt emails.txt # tab-delimited
paste -d, names.txt scores.txt # comma-delimited
paste -d "\n" list1.txt list2.txt # interleave lines from two files
diff
diff compares two files and prints their differences. The output notation uses < for lines only in the first file and > for lines only in the second:
diff /etc/hosts /etc/hosts.backup
1c1 # line 1 in file 1 differs from line 1 in file 2
< 192.168.1.100 web-01
---
> 192.168.1.101 web-01
Filter the output to see only the changed lines without the context markers:
diff config.current config.backup | grep '^[<>]'
diff config.current config.backup | grep '^[<>]' | cut -c3- # remove the < > prefix
Transforming text
tr
tr translates characters: it takes two sets of characters and replaces each character in the first set with the corresponding character in the second set. Pass -d to delete characters instead.
Convert a colon-delimited string to newlines for readable output:
echo $PATH | tr ':' '\n'
Remove all spaces from a string:
echo "hello world" | tr -d ' ' # helloworld
Convert uppercase to lowercase:
echo "SERVER-01" | tr '[A-Z]' '[a-z]' # server-01
rev
rev reverses the characters on each line. This is useful when you need to extract the last field from lines that have varying numbers of columns. For example, extract the last word from each line regardless of how many words precede it:
rev /etc/shells | cut -d'/' -f1 | rev # extract shell name from each path
Sorting and deduplication
sort
sort reorders lines in ascending alphabetical order by default.
The following table shows common options:
| Option | Behavior |
|---|---|
-r | Reverse (descending) order |
-n | Numeric sort |
-nr | Numeric descending |
-u | Remove duplicate lines |
-f | Ignore case |
-k N | Sort by field N (whitespace-delimited) |
-t , | Change field delimiter to , |
-o | Write output to the specified file |
Sort a log file by the third column (for example, HTTP status code):
sort -k9 -n access.log # sort by status code (field 9)
sort -k3 -rn report.txt # sort by field 3, numeric descending
cut -f3 data.tsv | sort -nr # sort the third column numerically
Sort by characters 4 and 5 of field 2 (useful for date fields):
sort -k 2.4,2.5 filename.txt
uniq
uniq removes adjacent duplicate lines. Always sort first so that duplicates are adjacent.
The following table shows common options:
| Option | Behavior |
|---|---|
-c | Prefix each line with the count of occurrences |
-d | Print only lines that appear more than once |
-u | Print only lines that appear exactly once |
-i | Ignore case when comparing |
-f N | Skip the first N fields before comparing |
Find the most common HTTP status codes in an access log:
awk '{print $9}' /var/log/apache2/access.log | sort | uniq -c | sort -rn | head -10
Find the grade that appears most often in a grades file:
cut -f1 grades | sort | uniq -c | sort -nr | head -1 | cut -c9
grep for pattern matching
grep stands for Global Regular Expression Print, taken from the ed command g/re/p. It searches files for lines that match a pattern and prints each matching line. It is the most frequently used filtering tool on the command line.
grep [OPTIONS] PATTERN [FILE...]
The following table shows common options:
| Option | Behavior |
|---|---|
-c | Print a count of matching lines instead of the lines themselves |
-e | Specify a pattern. Repeat the flag to match any of several patterns |
-E | Enable extended regular expressions (ERE) |
-f | Read patterns from a file |
-i | Ignore case |
-l | Print only filenames of files with at least one match |
-n | Print the line number before each match |
-o | Print only the matching text, not the full line |
-q | Silent mode – exit status only, no output |
-r | Recursively search directories |
-R | Recursive search, following symbolic links |
-v | Invert match: print lines that do not match |
-w | Match whole words only |
-P | Enable Perl-compatible regular expressions |
Search for failed SSH login attempts in the auth log:
grep "Failed password" /var/log/auth.log
grep -c "Failed password" /var/log/auth.log # count failures
grep -i "error" /var/log/syslog # case-insensitive
grep -R "PermitRootLogin" /etc/ssh/ # recursive config search
grep -v "^#" /etc/nginx/nginx.conf # non-comment lines only
Double-filter to find root login failures specifically:
grep "authenticating" /var/log/auth.log | grep "root"
Anchors and patterns
Pattern anchors control where in the line a match must occur:
grep ^root /etc/passwd # lines that begin with "root"
grep nologin$ /etc/passwd # lines that end with "nologin"
grep -v '^$' file.txt # non-blank lines (match blank lines and invert)
grep '......' file.txt # lines with at least 6 characters
grep daemon.*nologin /etc/passwd
Extended regular expressions
Pass -E (or run egrep) to enable extended regex, which adds support for |, +, ?, {, and (:
grep -E "^root|^dbus" /etc/passwd # lines beginning with root or dbus
grep -E '^(web|app|db)-[0-9]+' /etc/hosts # match server naming patterns
egrep "(daemon|nobody).*nologin" /etc/passwd
Matching multiple patterns
Two approaches let you match lines that contain any one of several patterns.
Use | inside a single expression with -E. The shell treats the whole thing as one pattern:
grep -E "error|warning|critical" /var/log/syslog
grep -E "^root|^daemon|^nobody" /etc/passwd
Use -e to supply each pattern as a separate flag. Each -e argument is an independent pattern and the result is identical, but the form is easier to read when patterns are long or contain characters that would conflict inside a single regex:
grep -e "error" -e "warning" -e "critical" /var/log/syslog
grep -e "^root" -e "^daemon" -e "^nobody" /etc/passwd
Both forms print every line that matches at least one pattern. For large sets of patterns, store them one per line in a file and load them with -f:
grep -f patterns.txt /var/log/syslog
Regex reference
The following table describes regex metacharacters:
| Metacharacter | Matches |
|---|---|
. | Any single character except newline |
* | Zero or more of the preceding character |
+ | One or more of the preceding character (ERE) |
? | Zero or one of the preceding character (ERE) |
^ | Start of line |
$ | End of line |
[abc] | Any character in the set |
[^abc] | Any character not in the set |
[a-z] | Any character in the range |
\ | Escape the following character |
The following table describes POSIX character classes for use inside [[ ]]:
| Class | Matches |
|---|---|
[:alnum:] | Alphanumeric characters |
[:alpha:] | Alphabetic characters |
[:digit:] | Digits |
[:lower:] | Lowercase letters |
[:upper:] | Uppercase letters |
[:space:] | Whitespace including line breaks |
[:punct:] | Punctuation |
[:xdigit:] | Hexadecimal digits |
Perl regex shortcuts
These shortcuts require the -P flag:
| Shortcut | Matches |
|---|---|
\s | Any whitespace |
\S | Any non-whitespace |
\d | Any digit |
\D | Any non-digit |
\w | Word character (letter, digit, underscore) |
\W | Non-word character |
Quantifiers
Quantifiers control how many times the preceding expression must match:
grep -E 'T{5}' file # T appears exactly 5 times consecutively
grep -E 'T{3,6}' file # T appears 3 to 6 times
grep -E 'T{5,}' file # T appears 5 or more times
Back references
A back reference lets you refer to a previous capture group. This example matches any HTML opening and closing tag pair where the tag names match:
egrep '<([a-zA-Z]*)>.*</\1>' file.html
The \1 means “whatever was matched inside the first set of parentheses.”
sed for stream editing
sed stands for Stream EDitor. It transforms text by applying a sequence of instructions called a sed script to each line of input. The most common script replaces one string with another:
sed 's/regexp/replacement/' input-file
The s command replaces the first occurrence on each line. Append g to replace all occurrences:
echo "one one two" | sed 's/one/yes/g' # yes yes two
The following table shows common sed commands:
| Script | Effect |
|---|---|
s/old/new/ | Replace first occurrence per line |
s/old/new/g | Replace all occurrences per line |
s/old/new/i | Case-insensitive replacement |
s/old/new/gI | Global case-insensitive replacement |
2s/old/new/ | Replace only on line 2 |
s/old/new/3 | Replace only the third occurrence on each line |
-i | Modify the file in place |
d | Delete matching lines |
/pattern/d | Delete lines matching a pattern |
nd | Delete line number n |
-n 'Np' | Print only line N |
-n '/pattern/p' | Print lines matching a pattern |
y/abc/xyz/ | Translate characters (like tr) |
Replace a hostname in a configuration file:
sed -i 's/old-server.internal/new-server.internal/g' /etc/app/config.conf
Delete comment lines and blank lines from a config file:
sed -e '/^#/d' -e '/^$/d' /etc/nginx/nginx.conf
Print a specific line range:
sed -n '5,10p' /var/log/syslog # print lines 5 through 10
Deleting lines
The d command deletes lines. Address a line by number, by $ for the last line, or by a pattern.
Delete the first line
Line 1 is the address. d deletes it and sed moves on to the next line.
sed '1d' file.txt
Delete the last line
$ matches the last line of the file regardless of how many lines there are.
sed '$d' file.txt
Delete a specific line by number
Replace 3 with the line number you want to remove.
sed '3d' file.txt
Delete a range of lines
Specify the start and end line numbers separated by a comma. Both lines are included in the deletion.
sed '2,4d' file.txt # delete lines 2 through 4
Delete multiple non-consecutive lines
Chain separate -e expressions to target lines that are not adjacent to each other.
sed -e '1d' -e '5d' -e '9d' file.txt
Printing line ranges
-n suppresses default output so only the lines you explicitly print with p appear. Without -n, matching lines would print twice.
Print a specific line
Supply the line number followed by p. The -n flag ensures only that line appears in the output.
sed -n '4p' file.txt
Print a range of lines
Separate the start and end line numbers with a comma. Both lines are included.
sed -n '5,10p' /var/log/syslog
Print from a line number to the end of the file
$ means the last line, so 20,$ reads as “from line 20 to the end.”
sed -n '20,$p' file.txt
Print lines matching a pattern
Wrap the pattern in / delimiters. sed prints every line where the pattern matches.
sed -n '/ERROR/p' /var/log/app.log
Editing files in place
-i writes the result back to the file instead of printing to stdout. Always test your expression without -i first.
Linux
Run without -i to preview, then add -i to apply the change.
sed 's/old-server/new-server/g' config.conf # preview the change
sed -i 's/old-server/new-server/g' config.conf # apply it
Pass a suffix to keep a backup copy before editing.
sed -i.bak 's/old/new/g' file.txt # original saved as file.txt.bak
macOS
-i requires an explicit backup extension. Pass an empty string to skip the backup.
sed -i '' 's/old/new/g' file.txt
Subexpressions
Subexpressions let you capture and rearrange parts of a matched string. Wrap the part you want to capture in \( and \), then reference it in the replacement with \1, \2, and so on.
For example, reformat a date string from YYYY-MM-DD to DD/MM/YYYY:
echo "2025-03-29" | sed 's/\([0-9]*\)-\([0-9]*\)-\([0-9]*\)/\3\/\2\/\1/'
# 29/03/2025
Rename image files by moving a version number from the end of the name to before the extension:
sed "s/image\.jpg\.\([1-3]\)/image\1.jpg/"
Removing whitespace
Remove leading spaces from the start of each line:
sed 's/^ *//' file.txt
Remove trailing spaces from the end of each line:
sed 's/ *$//' file.txt
Remove both leading and trailing spaces in a single command by chaining two substitutions with -e:
sed -e 's/^ *//' -e 's/ *$//' file.txt
Remove all spaces from every line:
sed 's/ //g' file.txt
Long regex patterns
Break a long regex into named shell variables for readability:
areacode='\([0-9]*\)'
state='\([A-Z][A-Z]\)'
city='\([^@]*\)'
regexp="${areacode}@${state}@${city}@"
replacement='\1\t\2\t\3\n'
sed "s/$regexp/$replacement/g" data.txt
awk for filtering fields
awk is named after its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan. It processes structured text files and generates reports. It reads each line, splits it into fields, and runs your awk program against each line. Fields are referenced as $1, $2, and so on. $0 is the entire line and $NF is the last field.
awk '{print $2}' /etc/hosts # print second column of hosts file
awk -F: '{print $1}' /etc/passwd # usernames, colon-delimited
df | awk 'FNR>1 {print $4}' # available space, skip header
-F: changing the field delimiter
By default, awk splits each line on whitespace. Whitespace means one or more spaces or tabs. Leading and trailing whitespace is ignored, so fields always start with a non-space character. Use -F to specify a different delimiter. The delimiter can be a single character or a regular expression:
awk -F: '{print $1}' /etc/passwd # split on colon, print username
awk -F, '{print $2}' report.csv # split on comma, print second column
awk -F'\t' '{print $3}' data.tsv # split on tab, print third column
awk -F/ '{print $NF}' /etc/shells # split on slash, print last field
You can also set the delimiter inside the program using the FS variable in a BEGIN block. This is useful when the delimiter itself is complex or when you want the program to be self-contained:
awk 'BEGIN {FS=":"} {print $1, $6}' /etc/passwd # username and home directory
awk 'BEGIN {FS=","} {print $1}' report.csv
NF: number of fields
NF is a built-in variable that holds the number of fields in the current line. Because $NF evaluates to the field at position NF, it always refers to the last field regardless of how many fields the line has. Use $(NF-1) to get the second-to-last field:
awk '{print NF}' /etc/passwd # print the field count for each line
awk '{print $NF}' /etc/passwd # print the last field of each line
awk '{print $(NF-1)}' /etc/passwd # print the second-to-last field
awk 'NF > 3' report.txt # print only lines with more than 3 fields
Extract the filename from each path in a list, without knowing how many directories deep each path goes:
echo "/var/log/nginx/access.log" | awk -F/ '{print $NF}' # access.log
Filtering with patterns
Run an awk program only on lines that match a pattern:
awk '/ERROR/' /var/log/app.log # lines containing ERROR
awk '!/^#/' /etc/hosts # non-comment lines
awk '$9 >= 500' access.log # HTTP 5xx errors (field 9 is status code)
awk 'NR>=10 && NR<=20' file.txt # lines 10 through 20
BEGIN and END blocks
BEGIN runs before processing any lines. END runs after all lines have been processed. Both are useful for printing headers, footers, and computed summaries:
awk -F'\t' \
'BEGIN {print "Recent entries:"} \
$3~/^2025/{print $4, "(" $3 ").", "\"" $2 "\""} \
END {print "End of report"}' \
data.tsv
Sum a column and print the total:
seq 1 100 | awk '{s+=$1} END {print "Total:", s}' # Total: 5050
Arrays and loops in awk
awk arrays act as hash maps indexed by any string key. This makes them useful for counting occurrences:
awk '{counts[$9]++} END {for (code in counts) print counts[code], code}' access.log
The previous command counts HTTP status codes across an access log. counts[$9]++ increments the count for each status code value in field 9. The END block prints each status code and its count.
Find duplicate files by checksum:
md5sum *.jpg | awk '{counts[$1]++} END {for (key in counts) print counts[key], key}' | sort -rn
awk cheat sheet
The following table shows common awk programs:
| Program | Description |
|---|---|
'{print $1}' | Print first column |
'{print $1, $3}' | Print columns 1 and 3 |
'NR==3' | Print line 3 |
'NR>=2 && NR<=4' | Print lines 2 to 4 |
'/pattern/' | Print lines matching pattern |
'!/pattern/' | Print lines not matching pattern |
'{print NR, $0}' | Print lines with line numbers |
'$2 > 50' | Print lines where field 2 is greater than 50 |
'{sum+=$1} END {print sum}' | Sum field 1 |
'{sum+=$1} END {print sum/NR}' | Average of field 1 |
'BEGIN {FS=","} {print $1}' | Comma-delimited: print first field |
'{print toupper($1)}' | Convert first field to uppercase |
'{sub(/old/, "new"); print}' | Replace first occurrence per line |
'{gsub(/old/, "new"); print}' | Replace all occurrences per line |
'{print $NF}' | Print last field |
Generating text
date
date prints the current date and time in any format you specify. Format strings begin with + and contain % sequences:
The following table shows common format strings:
| Format string | Example output | Description |
|---|---|---|
%Y-%m-%d | 2025-03-29 | ISO 8601 date |
%d-%m-%Y | 29-03-2025 | Day-Month-Year |
%m/%d/%Y | 03/29/2025 | Month/Day/Year (US) |
%A, %B %d, %Y | Saturday, March 29, 2025 | Full date with names |
%H:%M:%S | 14:30:15 | 24-hour time |
%Y-%m-%d %H:%M:%S | 2025-03-29 14:30:15 | Full timestamp |
%s | 1743456615 | Unix epoch seconds |
Generate a timestamped filename for a log archive:
tar -czf "backup_$(date +%Y-%m-%d).tar.gz" /var/www/html
seq
seq prints a sequence of numbers. The basic form is seq LOW HIGH. Add a third argument between them to set the step:
seq 1 5 # 1 2 3 4 5
seq 1 2 10 # 1 3 5 7 9 (odd numbers)
seq 10 -1 1 # 10 9 8 ... 1 (countdown)
seq 0 0.5 2 # 0 .5 1.0 1.5 2.0
seq -w 1 5 # 1 2 3 4 5 with zero-padded width
seq -s, 1 5 # 1,2,3,4,5 (comma-separated)
Brace expansion
Brace expansion generates sequences of numbers or letters directly in the shell without a separate program:
echo {1..5} # 1 2 3 4 5
echo {1..10..2} # 1 3 5 7 9 (step of 2)
echo {a..z} # a b c ... z
echo {A..Z} | tr -d ' ' # ABCDEFGHIJKLMNOPQRSTUVWXYZ (no spaces)
echo {A..Z} | tr ' ' '\n' # one letter per line
Brace expansion is useful for creating numbered files or directories in bulk:
mkdir -p logs/{jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec}
touch report_{2023..2025}.txt
yes
yes prints the same line repeatedly until killed. By default it prints y. Pipe it to a command that requires interactive confirmation, or pipe it to head to generate repeated test input:
yes | apt-get install -y package-name # confirm all prompts automatically
yes "test data" | head -100 # generate 100 lines of test data
Checksums and duplicate detection
md5sum and sha1sum
md5sum computes a 32-character hash (checksum) of a file’s contents. Files with identical contents produce identical checksums. sha1sum produces a 40-character hash and is more collision-resistant:
md5sum /etc/hosts # hash the file
sha1sum deployment.tar.gz # verify a downloaded archive
Verify a file’s integrity by comparing against a published checksum:
sha1sum -c checksums.txt # -c reads a file of hash: filename pairs
Detecting duplicate files
This pipeline finds duplicate files in the current directory by comparing checksums. Walk through each step to understand how it builds:
Step 1: compute the checksum for all .txt files:
md5sum *.txt | cut -d' ' -f1
Step 2: sort so that identical checksums are adjacent:
md5sum *.txt | cut -d' ' -f1 | sort
Step 3: count occurrences of each checksum:
md5sum *.txt | cut -d' ' -f1 | sort | uniq -c
Step 4: sort by count descending so duplicates appear first:
md5sum *.txt | cut -d' ' -f1 | sort | uniq -c | sort -rn
Step 5: filter out non-duplicates (count of 1):
md5sum *.txt | cut -d' ' -f1 | sort | uniq -c | sort -rn | grep -v "^ 1 "
Once you have a checksum for a duplicate, find the filenames:
md5sum *.txt | grep "5bbf5a52328e7439ae6e719dfe712200" | cut -d' ' -f3
Real-world pipeline example: Apache log analysis
An Apache access log has fields for IP address, timestamp, HTTP method, path, status code, and bytes transferred. This section shows how to investigate a spike in 404 errors.
Filter only 404 lines for today’s date:
grep "$(date +%d/%b/%Y)" /var/log/apache2/access.log | grep ' 404 '
Find the top 15 paths generating 404 errors:
grep ' 404 ' /var/log/apache2/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -15
Find the IP addresses making the most requests that result in 404 errors:
grep ' 404 ' /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
Check whether a single IP is making an unusually high number of 401 errors (failed authentication):
grep ' 401 ' /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -5
Leveraging text files for recurring tasks
When data is structured and stored in a text file, you can write commands once and rerun them on updated data. The general process is:
- Identify the business problem that involves data.
- Store the data in a plain text file in a consistent format.
- Write Linux commands to process the file and solve the problem.
- Capture the commands in a script so they are easy to repeat.
A practical example: check domain expiration dates from a list. The following script queries each domain’s registrar with whois and extracts the expiry date:
#!/bin/bash
expdate=$(date \
--date "$(whois "$1" \
| grep 'Registry Expiry Date:' \
| awk '{print $4}')" \
+'%Y-%m-%d')
echo "$expdate $1"
Call that script from a loop that reads from a file of domain names:
while read -r domain; do
./check-expiry "$domain"
sleep 5
done < domains.txt