Text processing

Most system data on Linux is stored as plain text. Log files record application events, configuration files control behavior, CSV exports carry database contents, and /proc files expose kernel state. The command line provides a set of composable tools for filtering, transforming, and extracting data from these files without requiring a database query language or spreadsheet software.

The real power is in combining these tools in pipelines. Each command in a pipeline does one thing and does it well. A two-minute investigation of a production issue often comes down to three or four of these commands chained together. This page covers the tools you will reach for most frequently.

Viewing and counting text

wc

wc counts lines, words, and characters in a file or from stdin. A newline counts as one character.

The following table shows the common options:

Option	Counts
`-l`	Lines only
`-w`	Words only
`-c`	Characters only

Count lines in a log file to check whether it has grown since last night:

wc -l /var/log/auth.log             # 1274 /var/log/auth.log
wc -w /etc/nginx/nginx.conf         # word count of a config file
ls -1 /etc | wc -l                  # number of entries in /etc

head

head prints the first 10 lines of a file. Specify a different number with -n:

head -20 /var/log/syslog            # first 20 lines
ls /bin | head -5                   # first 5 filenames in /bin
head -3 access.log | wc -w          # count words in the first 3 lines

tail

tail prints the last 10 lines of a file. Pass -f to follow a log file in real time as new lines are written:

tail -f /var/log/nginx/error.log                    # live log watching
tail -n 50 /var/log/syslog                          # last 50 lines
tail -n +3 /etc/hosts                               # everything starting at line 3
head -10 /var/log/auth.log | tail -3                # lines 8, 9, and 10

cat and tac

cat (concatenate) prints files in order from top to bottom. tac prints them in reverse, from bottom to top:

cat file1.txt file2.txt > combined.txt
tac /var/log/apache2/access.log | head -20    # most recent 20 requests

tac is useful for logs that are in chronological order but cannot be reversed with sort -r, because the sort key appears mid-line.

Extracting columns and fields

cut

cut extracts columns from a line. Specify columns by field number (tab-delimited by default) or by character position.

The following table shows the main options:

Option	Behavior
`-fN`	Print field N (tab-delimited)
`-f2,4`	Print fields 2 and 4
`-f2-4`	Print fields 2 through 4
`-cN`	Print character N
`-c2-4`	Print characters 2 through 4
`-d,`	Change the field delimiter to `,`

Extract the username field from /etc/passwd:

cut -d: -f1 /etc/passwd             # first field, colon-delimited
cut -d: -f1,7 /etc/passwd           # usernames and shells
cut -f1-3 report.tsv                # first three tab-delimited columns

Combining text

paste

paste combines files side by side, separating columns with a tab by default:

paste usernames.txt emails.txt                      # tab-delimited
paste -d, names.txt scores.txt                      # comma-delimited
paste -d "\n" list1.txt list2.txt                   # interleave lines from two files

diff

diff compares two files and prints their differences. The output notation uses < for lines only in the first file and > for lines only in the second:

diff /etc/hosts /etc/hosts.backup
1c1                                 # line 1 in file 1 differs from line 1 in file 2
< 192.168.1.100 web-01
---
> 192.168.1.101 web-01

Filter the output to see only the changed lines without the context markers:

diff config.current config.backup | grep '^[<>]'
diff config.current config.backup | grep '^[<>]' | cut -c3-     # remove the < > prefix

Transforming text

tr

tr translates characters: it takes two sets of characters and replaces each character in the first set with the corresponding character in the second set. Pass -d to delete characters instead.

Convert a colon-delimited string to newlines for readable output:

echo $PATH | tr ':' '\n'

Remove all spaces from a string:

echo "hello world" | tr -d ' '     # helloworld

Convert uppercase to lowercase:

echo "SERVER-01" | tr '[A-Z]' '[a-z]'    # server-01

rev

rev reverses the characters on each line. This is useful when you need to extract the last field from lines that have varying numbers of columns. For example, extract the last word from each line regardless of how many words precede it:

rev /etc/shells | cut -d'/' -f1 | rev       # extract shell name from each path

Sorting and deduplication

sort

sort reorders lines in ascending alphabetical order by default.

The following table shows common options:

Option	Behavior
`-r`	Reverse (descending) order
`-n`	Numeric sort
`-nr`	Numeric descending
`-u`	Remove duplicate lines
`-f`	Ignore case
`-k N`	Sort by field N (whitespace-delimited)
`-t ,`	Change field delimiter to `,`
`-o`	Write output to the specified file

Sort a log file by the third column (for example, HTTP status code):

sort -k9 -n access.log              # sort by status code (field 9)
sort -k3 -rn report.txt             # sort by field 3, numeric descending
cut -f3 data.tsv | sort -nr         # sort the third column numerically

Sort by characters 4 and 5 of field 2 (useful for date fields):

sort -k 2.4,2.5 filename.txt

uniq

uniq removes adjacent duplicate lines. Always sort first so that duplicates are adjacent.

The following table shows common options:

Option	Behavior
`-c`	Prefix each line with the count of occurrences
`-d`	Print only lines that appear more than once
`-u`	Print only lines that appear exactly once
`-i`	Ignore case when comparing
`-f N`	Skip the first N fields before comparing

Find the most common HTTP status codes in an access log:

awk '{print $9}' /var/log/apache2/access.log | sort | uniq -c | sort -rn | head -10

Find the grade that appears most often in a grades file:

cut -f1 grades | sort | uniq -c | sort -nr | head -1 | cut -c9

`grep` for pattern matching

grep stands for Global Regular Expression Print, taken from the ed command g/re/p. It searches files for lines that match a pattern and prints each matching line. It is the most frequently used filtering tool on the command line.

grep [OPTIONS] PATTERN [FILE...]

The following table shows common options:

Option	Behavior
`-c`	Print a count of matching lines instead of the lines themselves
`-e`	Specify a pattern. Repeat the flag to match any of several patterns
`-E`	Enable extended regular expressions (ERE)
`-f`	Read patterns from a file
`-i`	Ignore case
`-l`	Print only filenames of files with at least one match
`-n`	Print the line number before each match
`-o`	Print only the matching text, not the full line
`-q`	Silent mode – exit status only, no output
`-r`	Recursively search directories
`-R`	Recursive search, following symbolic links
`-v`	Invert match: print lines that do not match
`-w`	Match whole words only
`-P`	Enable Perl-compatible regular expressions

Search for failed SSH login attempts in the auth log:

grep "Failed password" /var/log/auth.log
grep -c "Failed password" /var/log/auth.log     # count failures
grep -i "error" /var/log/syslog                  # case-insensitive
grep -R "PermitRootLogin" /etc/ssh/             # recursive config search
grep -v "^#" /etc/nginx/nginx.conf              # non-comment lines only

Double-filter to find root login failures specifically:

grep "authenticating" /var/log/auth.log | grep "root"

Anchors and patterns

Pattern anchors control where in the line a match must occur:

grep ^root /etc/passwd          # lines that begin with "root"
grep nologin$ /etc/passwd       # lines that end with "nologin"
grep -v '^$' file.txt           # non-blank lines (match blank lines and invert)
grep '......' file.txt          # lines with at least 6 characters
grep daemon.*nologin /etc/passwd

Extended regular expressions

Pass -E (or run egrep) to enable extended regex, which adds support for |, +, ?, {, and (:

grep -E "^root|^dbus" /etc/passwd               # lines beginning with root or dbus
grep -E '^(web|app|db)-[0-9]+' /etc/hosts       # match server naming patterns
egrep "(daemon|nobody).*nologin" /etc/passwd

Matching multiple patterns

Two approaches let you match lines that contain any one of several patterns.

Use | inside a single expression with -E. The shell treats the whole thing as one pattern:

grep -E "error|warning|critical" /var/log/syslog
grep -E "^root|^daemon|^nobody" /etc/passwd

Use -e to supply each pattern as a separate flag. Each -e argument is an independent pattern and the result is identical, but the form is easier to read when patterns are long or contain characters that would conflict inside a single regex:

grep -e "error" -e "warning" -e "critical" /var/log/syslog
grep -e "^root" -e "^daemon" -e "^nobody" /etc/passwd

Both forms print every line that matches at least one pattern. For large sets of patterns, store them one per line in a file and load them with -f:

grep -f patterns.txt /var/log/syslog

Regex reference

The following table describes regex metacharacters:

Metacharacter	Matches
`.`	Any single character except newline
`*`	Zero or more of the preceding character
`+`	One or more of the preceding character (ERE)
`?`	Zero or one of the preceding character (ERE)
`^`	Start of line
`$`	End of line
`[abc]`	Any character in the set
`[^abc]`	Any character not in the set
`[a-z]`	Any character in the range
`\`	Escape the following character

The following table describes POSIX character classes for use inside [[ ]]:

Class	Matches
`[:alnum:]`	Alphanumeric characters
`[:alpha:]`	Alphabetic characters
`[:digit:]`	Digits
`[:lower:]`	Lowercase letters
`[:upper:]`	Uppercase letters
`[:space:]`	Whitespace including line breaks
`[:punct:]`	Punctuation
`[:xdigit:]`	Hexadecimal digits

Perl regex shortcuts

These shortcuts require the -P flag:

Shortcut	Matches
`\s`	Any whitespace
`\S`	Any non-whitespace
`\d`	Any digit
`\D`	Any non-digit
`\w`	Word character (letter, digit, underscore)
`\W`	Non-word character

Quantifiers

Quantifiers control how many times the preceding expression must match:

grep -E 'T{5}'    file   # T appears exactly 5 times consecutively
grep -E 'T{3,6}'  file   # T appears 3 to 6 times
grep -E 'T{5,}'   file   # T appears 5 or more times

Back references

A back reference lets you refer to a previous capture group. This example matches any HTML opening and closing tag pair where the tag names match:

egrep '<([a-zA-Z]*)>.*</\1>' file.html

The \1 means “whatever was matched inside the first set of parentheses.”

`sed` for stream editing

sed stands for Stream EDitor. It transforms text by applying a sequence of instructions called a sed script to each line of input. The most common script replaces one string with another:

sed 's/regexp/replacement/' input-file

The s command replaces the first occurrence on each line. Append g to replace all occurrences:

echo "one one two" | sed 's/one/yes/g'          # yes yes two

The following table shows common sed commands:

Script	Effect
`s/old/new/`	Replace first occurrence per line
`s/old/new/g`	Replace all occurrences per line
`s/old/new/i`	Case-insensitive replacement
`s/old/new/gI`	Global case-insensitive replacement
`2s/old/new/`	Replace only on line 2
`s/old/new/3`	Replace only the third occurrence on each line
`-i`	Modify the file in place
`d`	Delete matching lines
`/pattern/d`	Delete lines matching a pattern
`nd`	Delete line number n
`-n 'Np'`	Print only line N
`-n '/pattern/p'`	Print lines matching a pattern
`y/abc/xyz/`	Translate characters (like `tr`)

Replace a hostname in a configuration file:

sed -i 's/old-server.internal/new-server.internal/g' /etc/app/config.conf

Delete comment lines and blank lines from a config file:

sed -e '/^#/d' -e '/^$/d' /etc/nginx/nginx.conf

Print a specific line range:

sed -n '5,10p' /var/log/syslog      # print lines 5 through 10

Append a domain suffix to every line in a file:

sed 's/$/.example.com/g' files/subdomains-1000.txt

$ matches the end of each line, so the replacement appends .example.com after the last character on every line.

Deleting lines

The d command deletes lines. Address a line by number, by $ for the last line, or by a pattern.

Delete the first line

Line 1 is the address. d deletes it and sed moves on to the next line.

sed '1d' file.txt

Delete the last line

$ matches the last line of the file regardless of how many lines there are.

sed '$d' file.txt

Delete a specific line by number

Replace 3 with the line number you want to remove.

sed '3d' file.txt

Delete a range of lines

Specify the start and end line numbers separated by a comma. Both lines are included in the deletion.

sed '2,4d' file.txt         # delete lines 2 through 4

Delete multiple non-consecutive lines

Chain separate -e expressions to target lines that are not adjacent to each other.

sed -e '1d' -e '5d' -e '9d' file.txt

Printing line ranges

-n suppresses default output so only the lines you explicitly print with p appear. Without -n, matching lines would print twice.

Print a specific line

Supply the line number followed by p. The -n flag ensures only that line appears in the output.

sed -n '4p' file.txt

Print a range of lines

Separate the start and end line numbers with a comma. Both lines are included.

sed -n '5,10p' /var/log/syslog

Print from a line number to the end of the file

$ means the last line, so 20,$ reads as “from line 20 to the end.”

sed -n '20,$p' file.txt

Print lines matching a pattern

Wrap the pattern in / delimiters. sed prints every line where the pattern matches.

sed -n '/ERROR/p' /var/log/app.log

Editing files in place

-i writes the result back to the file instead of printing to stdout. Always test your expression without -i first.

Linux

Run without -i to preview, then add -i to apply the change.

sed 's/old-server/new-server/g' config.conf         # preview the change
sed -i 's/old-server/new-server/g' config.conf      # apply it

Pass a suffix to keep a backup copy before editing.

sed -i.bak 's/old/new/g' file.txt       # original saved as file.txt.bak

macOS

-i requires an explicit backup extension. Pass an empty string to skip the backup.

sed -i '' 's/old/new/g' file.txt

Subexpressions

Subexpressions let you capture and rearrange parts of a matched string. Wrap the part you want to capture in $ and $, then reference it in the replacement with \1, \2, and so on.

For example, reformat a date string from YYYY-MM-DD to DD/MM/YYYY:

echo "2025-03-29" | sed 's/\([0-9]*\)-\([0-9]*\)-\([0-9]*\)/\3\/\2\/\1/'
# 29/03/2025

Rename image files by moving a version number from the end of the name to before the extension:

sed "s/image\.jpg\.\([1-3]\)/image\1.jpg/"

Removing whitespace

Remove leading spaces from the start of each line:

sed 's/^ *//' file.txt

Remove trailing spaces from the end of each line:

sed 's/ *$//' file.txt

Remove both leading and trailing spaces in a single command by chaining two substitutions with -e:

sed -e 's/^ *//' -e 's/ *$//' file.txt

Remove all spaces from every line:

sed 's/ //g' file.txt

Long regex patterns

Break a long regex into named shell variables for readability:

areacode='\([0-9]*\)'
state='\([A-Z][A-Z]\)'
city='\([^@]*\)'

regexp="${areacode}@${state}@${city}@"
replacement='\1\t\2\t\3\n'

sed "s/$regexp/$replacement/g" data.txt

`awk` for filtering fields

awk is named after its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan. It processes structured text files and generates reports. It reads each line, splits it into fields, and runs your awk program against each line. Fields are referenced as $1, $2, and so on. $0 is the entire line and $NF is the last field.

awk '{print $2}' /etc/hosts                     # print second column of hosts file
awk -F: '{print $1}' /etc/passwd                # usernames, colon-delimited
df | awk 'FNR>1 {print $4}'                     # available space, skip header

-F: changing the field delimiter

By default, awk splits each line on whitespace. Whitespace means one or more spaces or tabs. Leading and trailing whitespace is ignored, so fields always start with a non-space character. Use -F to specify a different delimiter. The delimiter can be a single character or a regular expression:

awk -F: '{print $1}' /etc/passwd          # split on colon, print username
awk -F, '{print $2}' report.csv           # split on comma, print second column
awk -F'\t' '{print $3}' data.tsv          # split on tab, print third column
awk -F/ '{print $NF}' /etc/shells         # split on slash, print last field

You can also set the delimiter inside the program using the FS variable in a BEGIN block. This is useful when the delimiter itself is complex or when you want the program to be self-contained:

awk 'BEGIN {FS=":"} {print $1, $6}' /etc/passwd    # username and home directory
awk 'BEGIN {FS=","} {print $1}' report.csv

NF: number of fields

NF is a built-in variable that holds the number of fields in the current line. Because $NF evaluates to the field at position NF, it always refers to the last field regardless of how many fields the line has. Use $(NF-1) to get the second-to-last field:

awk '{print NF}' /etc/passwd            # print the field count for each line
awk '{print $NF}' /etc/passwd           # print the last field of each line
awk '{print $(NF-1)}' /etc/passwd       # print the second-to-last field
awk 'NF > 3' report.txt                 # print only lines with more than 3 fields

Extract the filename from each path in a list, without knowing how many directories deep each path goes:

echo "/var/log/nginx/access.log" | awk -F/ '{print $NF}'    # access.log

Filtering with patterns

Run an awk program only on lines that match a pattern:

awk '/ERROR/' /var/log/app.log                  # lines containing ERROR
awk '!/^#/' /etc/hosts                          # non-comment lines
awk '$9 >= 500' access.log                      # HTTP 5xx errors (field 9 is status code)
awk 'NR>=10 && NR<=20' file.txt                 # lines 10 through 20

BEGIN and END blocks

BEGIN runs before processing any lines. END runs after all lines have been processed. Both are useful for printing headers, footers, and computed summaries:

awk -F'\t' \
'BEGIN {print "Recent entries:"} \
$3~/^2025/{print $4, "(" $3 ").", "\"" $2 "\""} \
END {print "End of report"}' \
data.tsv

Sum a column and print the total:

seq 1 100 | awk '{s+=$1} END {print "Total:", s}'     # Total: 5050

Arrays and loops in awk

awk arrays act as hash maps indexed by any string key. This makes them useful for counting occurrences:

awk '{counts[$9]++} END {for (code in counts) print counts[code], code}' access.log

The previous command counts HTTP status codes across an access log. counts[$9]++ increments the count for each status code value in field 9. The END block prints each status code and its count.

Find duplicate files by checksum:

md5sum *.jpg | awk '{counts[$1]++} END {for (key in counts) print counts[key], key}' | sort -rn

awk cheat sheet

The following table shows common awk programs:

Program	Description
`'{print $1}'`	Print first column
`'{print $1, $3}'`	Print columns 1 and 3
`'NR==3'`	Print line 3
`'NR>=2 && NR<=4'`	Print lines 2 to 4
`'/pattern/'`	Print lines matching pattern
`'!/pattern/'`	Print lines not matching pattern
`'{print NR, $0}'`	Print lines with line numbers
`'$2 > 50'`	Print lines where field 2 is greater than 50
`'{sum+=$1} END {print sum}'`	Sum field 1
`'{sum+=$1} END {print sum/NR}'`	Average of field 1
`'BEGIN {FS=","} {print $1}'`	Comma-delimited: print first field
`'{print toupper($1)}'`	Convert first field to uppercase
`'{sub(/old/, "new"); print}'`	Replace first occurrence per line
`'{gsub(/old/, "new"); print}'`	Replace all occurrences per line
`'{print $NF}'`	Print last field

Generating text

date

date prints the current date and time in any format you specify. Format strings begin with + and contain % sequences:

The following table shows common format strings:

Format string	Example output	Description
`%Y-%m-%d`	2025-03-29	ISO 8601 date
`%d-%m-%Y`	29-03-2025	Day-Month-Year
`%m/%d/%Y`	03/29/2025	Month/Day/Year (US)
`%A, %B %d, %Y`	Saturday, March 29, 2025	Full date with names
`%H:%M:%S`	14:30:15	24-hour time
`%Y-%m-%d %H:%M:%S`	2025-03-29 14:30:15	Full timestamp
`%s`	1743456615	Unix epoch seconds

Generate a timestamped filename for a log archive:

tar -czf "backup_$(date +%Y-%m-%d).tar.gz" /var/www/html

seq

seq prints a sequence of numbers. The basic form is seq LOW HIGH. Add a third argument between them to set the step:

seq 1 5             # 1 2 3 4 5
seq 1 2 10          # 1 3 5 7 9 (odd numbers)
seq 10 -1 1         # 10 9 8 ... 1 (countdown)
seq 0 0.5 2         # 0 .5 1.0 1.5 2.0
seq -w 1 5          # 1 2 3 4 5 with zero-padded width
seq -s, 1 5         # 1,2,3,4,5 (comma-separated)

Use seq in a loop to generate structured output, like a list of IP addresses:

#!/usr/bin/env bash

# Generate IP addresses from a given range
for ip in $(seq 1 254); do
    echo "172.16.10.${ip}" >> 172-16-10-hosts.txt    
done

Brace expansion

Brace expansion generates sequences of numbers or letters directly in the shell without a separate program:

Sequences

echo {1..5}                     # 1 2 3 4 5
echo {1..10..2}                 # 1 3 5 7 9 (step of 2)
echo {a..z}                     # a b c ... z
echo {A..Z} | tr -d ' '         # ABCDEFGHIJKLMNOPQRSTUVWXYZ (no spaces)
echo {A..Z} | tr ' ' '\n'       # one letter per line

Files and directories

Brace expansion is useful for creating numbered files or directories in bulk:

mkdir -p logs/{jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec}
touch report_{2023..2025}.txt

echo and sed

Combine brace expansion with sed to generate a one-per-line list without a loop:

echo 10.1.0.{1..254} | sed 's/ /\n/g'

echo 10.1.0.{1..254} expands the brace sequence and prints all 254 addresses on a single space-separated line.
sed 's/ /\n/g' replaces every space with a newline, putting each address on its own line.

printf

printf with brace expansion is a concise way to generate formatted output for each value in a sequence:

printf "10.1.0.%d\n" {1..254}

%d formats each expanded value as a decimal integer.
\n prints each address on its own line.

yes

yes prints the same line repeatedly until killed. By default it prints y. Pipe it to a command that requires interactive confirmation, or pipe it to head to generate repeated test input:

yes | apt-get install -y package-name    # confirm all prompts automatically
yes "test data" | head -100              # generate 100 lines of test data

Checksums and duplicate detection

md5sum and sha1sum

md5sum computes a 32-character hash (checksum) of a file’s contents. Files with identical contents produce identical checksums. sha1sum produces a 40-character hash and is more collision-resistant:

md5sum /etc/hosts                       # hash the file
sha1sum deployment.tar.gz              # verify a downloaded archive

Verify a file’s integrity by comparing against a published checksum:

sha1sum -c checksums.txt               # -c reads a file of hash: filename pairs

Detecting duplicate files

This pipeline finds duplicate files in the current directory by comparing checksums. Walk through each step to understand how it builds:

Step 1: compute the checksum for all .txt files:

md5sum *.txt | cut -d' ' -f1

Step 2: sort so that identical checksums are adjacent:

md5sum *.txt | cut -d' ' -f1 | sort

Step 3: count occurrences of each checksum:

md5sum *.txt | cut -d' ' -f1 | sort | uniq -c

Step 4: sort by count descending so duplicates appear first:

md5sum *.txt | cut -d' ' -f1 | sort | uniq -c | sort -rn

Step 5: filter out non-duplicates (count of 1):

md5sum *.txt | cut -d' ' -f1 | sort | uniq -c | sort -rn | grep -v "^      1 "

Once you have a checksum for a duplicate, find the filenames:

md5sum *.txt | grep "5bbf5a52328e7439ae6e719dfe712200" | cut -d' ' -f3

Real-world pipeline example: Apache log analysis

An Apache access log has fields for IP address, timestamp, HTTP method, path, status code, and bytes transferred. This section shows how to investigate a spike in 404 errors.

Filter only 404 lines for today’s date:

grep "$(date +%d/%b/%Y)" /var/log/apache2/access.log | grep ' 404 '

Find the top 15 paths generating 404 errors:

grep ' 404 ' /var/log/apache2/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -15

Find the IP addresses making the most requests that result in 404 errors:

grep ' 404 ' /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

Check whether a single IP is making an unusually high number of 401 errors (failed authentication):

grep ' 401 ' /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -5

Leveraging text files for recurring tasks

When data is structured and stored in a text file, you can write commands once and rerun them on updated data. The general process is:

Identify the business problem that involves data.
Store the data in a plain text file in a consistent format.
Write Linux commands to process the file and solve the problem.
Capture the commands in a script so they are easy to repeat.

A practical example: check domain expiration dates from a list. The following script queries each domain’s registrar with whois and extracts the expiry date:

#!/bin/bash
expdate=$(date \
            --date "$(whois "$1" \
                | grep 'Registry Expiry Date:' \
                | awk '{print $4}')" \
            +'%Y-%m-%d')
echo "$expdate $1"

Call that script from a loop that reads from a file of domain names:

while read -r domain; do
    ./check-expiry "$domain"
    sleep 5
done < domains.txt

Text processing

Viewing and counting text

wc

head

tail

cat and tac

Extracting columns and fields

cut

Combining text

paste

diff

Transforming text

tr

rev

Sorting and deduplication

sort

uniq

grep for pattern matching

Anchors and patterns

Extended regular expressions

Matching multiple patterns

Regex reference

Perl regex shortcuts

Quantifiers

Back references

sed for stream editing

Deleting lines

Delete the first line

Delete the last line

Delete a specific line by number

Delete a range of lines

Delete multiple non-consecutive lines

Printing line ranges

Print a specific line

Print a range of lines

Print from a line number to the end of the file

Print lines matching a pattern

Editing files in place

Linux

macOS

Subexpressions

Removing whitespace

Long regex patterns

awk for filtering fields

-F: changing the field delimiter

NF: number of fields

Filtering with patterns

BEGIN and END blocks

Arrays and loops in awk

awk cheat sheet

Generating text

date

seq

Brace expansion

Sequences

Files and directories

echo and sed

printf

yes

Checksums and duplicate detection

md5sum and sha1sum

Detecting duplicate files

Real-world pipeline example: Apache log analysis

Leveraging text files for recurring tasks

`grep` for pattern matching

`sed` for stream editing

`awk` for filtering fields