Ubuntu Text Processing

Introduction

Text processing is a fundamental skill for anyone working with Ubuntu or any Linux-based system. It involves manipulating, filtering, and transforming text data using powerful command-line tools. In Ubuntu, text processing allows you to search through files, extract specific information, modify text content, and automate repetitive text-related tasks.

Whether you're a system administrator, developer, data analyst, or just a curious Ubuntu user, mastering text processing will significantly enhance your productivity and problem-solving capabilities in the command line environment.

Core Text Processing Tools

Ubuntu provides several powerful text processing utilities that form the backbone of text manipulation in the shell. Let's explore the most essential ones:

1. grep - Global Regular Expression Print

grep is a powerful tool for searching text patterns within files using regular expressions.

Basic Usage:

grep pattern filename

Examples:

Finding all occurrences of "ubuntu" in a file:

$ grep "ubuntu" file.txt
Ubuntu is a popular Linux distribution.
Many servers run ubuntu server edition.
Ubuntu Desktop provides a user-friendly interface.

Case-insensitive search with -i flag:

$ grep -i "ubuntu" file.txt
Ubuntu is a popular Linux distribution.
Many servers run ubuntu server edition.
Ubuntu Desktop provides a user-friendly interface.
ubuntu repositories contain thousands of packages.

Counting occurrences with -c flag:

$ grep -c "ubuntu" file.txt
3

Displaying line numbers with -n flag:

$ grep -n "ubuntu" file.txt
Ubuntu is a popular Linux distribution.
Many servers run ubuntu server edition.
Ubuntu Desktop provides a user-friendly interface.

Recursive search in directories with -r flag:

$ grep -r "function" /path/to/project/
/path/to/project/script.js:function processData() {
/path/to/project/utils.js:function formatText(text) {

2. sed - Stream Editor

sed is primarily used for text substitution, deletion, and insertion operations.

Basic Usage:

sed 'command' filename

Examples:

Replacing text:

$ cat file.txt
Hello World
$ sed 's/World/Ubuntu/' file.txt
Hello Ubuntu

Global substitution (all occurrences) with g flag:

$ cat file.txt
Hello World World
$ sed 's/World/Ubuntu/g' file.txt
Hello Ubuntu Ubuntu

Deleting specific lines:

$ cat numbers.txt
1
2
3
4
5
$ sed '3d' numbers.txt
1
2
4
5

Deleting a range of lines:

$ sed '2,4d' numbers.txt
1
5

In-place editing with -i flag:

$ sed -i 's/old/new/g' file.txt

3. awk - Aho, Weinberger, and Kernighan

awk is a powerful programming language designed for text processing and typically used as a data extraction and reporting tool.

Basic Usage:

awk 'pattern {action}' filename

Examples:

Printing specific columns:

$ cat data.txt
John 85 92
Lisa 93 88
Mike 78 85
$ awk '{print $1, $3}' data.txt
John 92
Lisa 88
Mike 85

Computing with columns:

$ awk '{print $1, $2 + $3}' data.txt
John 177
Lisa 181
Mike 163

Filtering rows with conditions:

$ awk '$2 >= 90 {print $1, "has high score:", $2}' data.txt
Lisa has high score: 93

Setting field separator with -F:

$ cat comma_data.csv
John,85,92
Lisa,93,88
Mike,78,85
$ awk -F, '{print $1, $3}' comma_data.csv
John 92
Lisa 88
Mike 85

Computing averages:

$ awk '{sum += $2} END {print "Average:", sum/NR}' data.txt
Average: 85.3333

Text Analysis Tools

Beyond the core tools, Ubuntu offers additional utilities for text analysis:

1. sort - Sort Lines of Text

$ cat fruits.txt
banana
apple
orange
kiwi
$ sort fruits.txt
apple
banana
kiwi
orange

Numeric sort with -n flag:

$ cat numbers.txt
10
2
100
5
$ sort -n numbers.txt
2
5
10
100

Reverse sorting with -r flag:

$ sort -r fruits.txt
orange
kiwi
banana
apple

2. uniq - Report or Filter Repeated Lines

$ cat repeated.txt
apple
apple
banana
orange
orange
orange
$ uniq repeated.txt
apple
banana
orange

Counting occurrences with -c flag:

$ uniq -c repeated.txt
apple
banana
orange

Note: uniq only detects adjacent duplicate lines, so it's commonly used after sort.

3. wc - Word, Line, Character, and Byte Count

$ wc file.txt
  5  20 100 file.txt

This shows:

5 lines
20 words
100 characters

Counting only lines with -l, words with -w, or characters with -c:

$ wc -l file.txt
5 file.txt
$ wc -w file.txt
20 file.txt
$ wc -c file.txt
100 file.txt

Advanced Text Processing Techniques

Let's explore some more advanced techniques that combine multiple tools:

1. Pipelines

One of the most powerful features in Ubuntu text processing is the ability to chain commands using pipes (|):

$ cat log.txt | grep "ERROR" | sort | uniq -c
ERROR: Failed to connect to database
ERROR: Invalid input parameters
ERROR: Server timeout

This pipeline:

Reads log.txt
Filters only lines containing "ERROR"
Sorts the errors
Counts unique error occurrences

2. Regular Expressions

Regular expressions (regex) enable powerful pattern matching:

# Find all lines containing email addresses
$ grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txt

# Find all lines with phone numbers in format XXX-XXX-XXXX
$ grep -E "[0-9]{3}-[0-9]{3}-[0-9]{4}" contacts.txt

3. Text Extraction With cut

The cut command extracts portions of text from each line:

$ cat names.csv
John,Doe,35
Jane,Smith,28
Michael,Johnson,42

# Extract first names
$ cut -d ',' -f 1 names.csv
John
Jane
Michael

# Extract first and last names
$ cut -d ',' -f 1,2 names.csv
John,Doe
Jane,Smith
Michael,Johnson

4. Transforming Data With tr

The tr command translates or deletes characters:

# Convert lowercase to uppercase
$ echo "hello world" | tr 'a-z' 'A-Z'
HELLO WORLD

# Delete specific characters
$ echo "hello world" | tr -d 'aeiou'
hll wrld

# Squeeze repeated characters
$ echo "hello     world" | tr -s ' '
hello world

Real-World Applications

Let's look at some practical examples of text processing in real-world scenarios:

Example 1: Log Analysis

Analyzing Apache web server logs to find the most common IP addresses:

$ cat access.log | cut -d ' ' -f 1 | sort | uniq -c | sort -nr | head -5
192.168.1.105
192.168.1.107
192.168.1.103
192.168.1.115
192.168.1.109

This shows the top 5 IP addresses with the number of requests each made.

Example 2: CSV Data Processing

Processing a CSV file to calculate averages:

$ cat sales.csv
Region,Product,Sales
North,Widget A,5200
South,Widget A,4800
East,Widget B,6100
West,Widget B,5300
North,Widget B,5800

# Calculate average sales by product
$ grep "Widget A" sales.csv | cut -d ',' -f 3 | awk '{ sum += $1 } END { print "Widget A average:", sum/NR }'
Widget A average: 5000
$ grep "Widget B" sales.csv | cut -d ',' -f 3 | awk '{ sum += $1 } END { print "Widget B average:", sum/NR }'
Widget B average: 5733.33

Example 3: System Monitoring

Monitoring system processes and memory usage:

# Find top 5 memory-consuming processes
$ ps aux | sort -nrk 4 | head -5

# Count processes by user
$ ps aux | cut -d ' ' -f 1 | sort | uniq -c

# Find zombie processes
$ ps aux | grep 'Z' | grep -v grep

Example 4: Batch File Renaming

Using text processing to rename multiple files:

# Rename all .txt files to .md
$ for file in *.txt; do mv "$file" "${file%.txt}.md"; done

# Add prefix to all image files
$ for file in *.jpg; do mv "$file" "vacation_$file"; done

Text Processing Workflow

When approaching a text processing task, follow this workflow:

Summary

Ubuntu text processing provides a powerful toolkit for manipulating and analyzing text data directly from the command line. The core utilities—grep, sed, and awk—combined with supporting tools like sort, uniq, cut, and tr offer almost limitless possibilities for text transformation.

Key takeaways:

grep excels at pattern matching and searching
sed is perfect for text substitution and editing
awk provides programming capabilities for complex text processing
Pipelines allow you to chain commands for powerful workflows
Regular expressions enable sophisticated pattern matching

Mastering these text processing tools and techniques will significantly enhance your productivity and capabilities when working with Ubuntu or any Linux-based system.

Exercises

Create a script that extracts all email addresses from a text file.
Write a command to find the 5 most frequent words in a text file.
Process a CSV file of student grades to calculate the average score for each student.
Create a pipeline that finds all files in your home directory that contain a specific text string.
Write a sed command to replace all occurrences of "error" with "ERROR" in a log file, but only on lines that contain the word "critical".

Additional Resources

The GNU Grep Manual: man grep
The GNU Sed Manual: man sed
The GNU Awk Manual: man awk
The Linux Documentation Project (TLDP) Text Processing Commands
Ubuntu Community Help Wiki

💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!

Introduction​

Core Text Processing Tools​

1. grep - Global Regular Expression Print​

Basic Usage:​

Examples:​

2. sed - Stream Editor​

Basic Usage:​

Examples:​

3. awk - Aho, Weinberger, and Kernighan​

Basic Usage:​

Examples:​

Text Analysis Tools​

1. sort - Sort Lines of Text​

2. uniq - Report or Filter Repeated Lines​

3. wc - Word, Line, Character, and Byte Count​

Advanced Text Processing Techniques​

1. Pipelines​

2. Regular Expressions​

3. Text Extraction With cut​

4. Transforming Data With tr​

Real-World Applications​

Example 1: Log Analysis​

Example 2: CSV Data Processing​

Example 3: System Monitoring​

Example 4: Batch File Renaming​

Text Processing Workflow​

Summary​

Exercises​

Additional Resources​

Introduction

Core Text Processing Tools

1. grep - Global Regular Expression Print

Basic Usage:

Examples:

2. sed - Stream Editor

Basic Usage:

Examples:

3. awk - Aho, Weinberger, and Kernighan

Basic Usage:

Examples:

Text Analysis Tools

1. sort - Sort Lines of Text

2. uniq - Report or Filter Repeated Lines

3. wc - Word, Line, Character, and Byte Count

Advanced Text Processing Techniques

1. Pipelines

2. Regular Expressions

3. Text Extraction With cut

4. Transforming Data With tr

Real-World Applications

Example 1: Log Analysis

Example 2: CSV Data Processing

Example 3: System Monitoring

Example 4: Batch File Renaming

Text Processing Workflow

Summary

Exercises

Additional Resources