Text Data Analysis in R: Understanding grep, grepl, sub and gsub

R Programming
grep
grepl
sub
gsub
regex
text analysis
Author

M. Fatih Tüzen

Published

July 9, 2024

Modified

July 9, 2024

https://carlalexander.ca/beginners-guide-regular-expressions/

Introduction

In text data analysis, being able to search for patterns, validate their existence, and perform substitutions is crucial. R provides powerful base functions like grep, grepl, sub, and gsub to handle these tasks efficiently. This blog post will delve into how these functions work, using examples ranging from simple to complex, to show how they can be leveraged for text manipulation, classification, and grouping tasks.

1. Understanding grep and grepl

What is grep?

  • Functionality: Searches for matches to a specified pattern in a vector of character strings.

  • Usage: grep(pattern, x, ...)

  • Example: Searching for specific words or patterns in text.

What is grepl?

  • Functionality: Returns a logical vector indicating whether a pattern is found in each element of a character vector.

  • Usage: grepl(pattern, x, ...)

  • Example: Checking if specific patterns exist in text data.

Differences, Advantages, and Disadvantages

  • Differences: grep returns indices or values matching the pattern, while grepl returns a logical vector.

  • Advantages: Fast pattern matching over large datasets.

  • Disadvantages: Exact matching without inherent flexibility for complex patterns.

2. Using sub and gsub for Text Substitution

What is sub?

  • Functionality: Replaces the first occurrence of a pattern in a string.

  • Usage: sub(pattern, replacement, x, ...)

  • Example: Substituting specific patterns with another string.

What is gsub?

  • Functionality: Replaces all occurrences of a pattern in a string.

  • Usage: gsub(pattern, replacement, x, ...)

  • Example: Global substitution of patterns throughout text data.

Differences, Advantages, and Disadvantages

  • Differences: sub replaces only the first occurrence, while gsub replaces all occurrences.

  • Advantages: Efficient for bulk text replacements.

  • Disadvantages: Lack of advanced pattern matching features compared to other libraries.

3. Practical Examples with a Synthetic Dataset

Example Dataset

For the purposes of this blog post, we’ll create a synthetic dataset. This dataset is a data frame that contains two columns: id and text. Each row represents a unique text entry with a corresponding identifier.

# Creating a synthetic data frame
text_data <- data.frame(
  id = 1:15,
  text = c("Cats are great pets.",
           "Dogs are loyal animals.",
           "Birds can fly high.",
           "Fish swim in water.",
           "Horses run fast.",
           "Rabbits hop quickly.",
           "Cows give milk.",
           "Sheep have wool.",
           "Goats are curious creatures.",
           "Lions are the kings of the jungle.",
           "Tigers have stripes.",
           "Elephants are large animals.",
           "Monkeys are very playful.",
           "Giraffes have long necks.",
           "Zebras have black and white stripes.")
)

Explanation of the Dataset

  • id Column: This is a simple identifier for each row, ranging from 1 to 15.

  • text Column: This contains various sentences about different animals. Each text string is unique and describes a characteristic or trait of the animal mentioned.

Applying grep, grepl, sub, and gsub

Example 1: Using grep to find specific words

# Find rows containing the word 'are'
indices <- grep("are", text_data$text, ignore.case = TRUE)
result_grep <- text_data[indices, ]
result_grep
   id                               text
1   1               Cats are great pets.
2   2            Dogs are loyal animals.
9   9       Goats are curious creatures.
10 10 Lions are the kings of the jungle.
12 12       Elephants are large animals.
13 13          Monkeys are very playful.

Explanation: grep("are", text_data$text, ignore.case = TRUE) searches for the word “are” in the text column of text_data, ignoring case, and returns the indices of the matching rows. The resulting rows will be displayed.

Example 2: Applying grepl for conditional checks

# Add a new column indicating if the word 'fly' is present

text_data$contains_fly <- grepl("fly", text_data$text)
text_data
   id                                 text contains_fly
1   1                 Cats are great pets.        FALSE
2   2              Dogs are loyal animals.        FALSE
3   3                  Birds can fly high.         TRUE
4   4                  Fish swim in water.        FALSE
5   5                     Horses run fast.        FALSE
6   6                 Rabbits hop quickly.        FALSE
7   7                      Cows give milk.        FALSE
8   8                     Sheep have wool.        FALSE
9   9         Goats are curious creatures.        FALSE
10 10   Lions are the kings of the jungle.        FALSE
11 11                 Tigers have stripes.        FALSE
12 12         Elephants are large animals.        FALSE
13 13            Monkeys are very playful.        FALSE
14 14            Giraffes have long necks.        FALSE
15 15 Zebras have black and white stripes.        FALSE

Explanation: grepl("fly", text_data$text) checks each element of the text column for the presence of the word “fly” and returns a logical vector. This vector is then added as a new column contains_fly.

Example 3: Using sub to replace a pattern in text

# Replace the first occurrence of 'a' with 'A' in the text column

text_data$text_sub <- sub(" a ", " A ", text_data$text)
text_data[,c("text","text_sub")]
                                   text                             text_sub
1                  Cats are great pets.                 Cats are great pets.
2               Dogs are loyal animals.              Dogs are loyal animals.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                      Cows give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.         Elephants are large animals.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.

Explanation: sub(" a ", " A ", text_data$text) replaces the first occurrence of ’ a ’ with ’ A ’ in each element of the text column. The resulting text is stored in a new column text_sub.

Example 4: Applying gsub for global pattern replacement

# Replace all occurrences of 'a' with 'A' in the text column

text_data$text_gsub <- gsub(" a ", " A ", text_data$text)
text_data[,c("text","text_gsub")]
                                   text                            text_gsub
1                  Cats are great pets.                 Cats are great pets.
2               Dogs are loyal animals.              Dogs are loyal animals.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                      Cows give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.         Elephants are large animals.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.

Explanation: gsub(" a ", " A ", text_data$text) replaces all occurrences of ’ a ’ with ’ A ’ in each element of the text column. The resulting text is stored in a new column text_gsub.

Example 5: Text-based Grouping and Assignment

Let’s group the texts based on the presence of the word “bird” and assign a category.

# Add a new column 'category' based on the presence of the word 'fly'

text_data$category <- ifelse(grepl("fly", text_data$text, ignore.case = TRUE), "Can Fly", "Cannot Fly")
text_data[,c("text","category")]
                                   text   category
1                  Cats are great pets. Cannot Fly
2               Dogs are loyal animals. Cannot Fly
3                   Birds can fly high.    Can Fly
4                   Fish swim in water. Cannot Fly
5                      Horses run fast. Cannot Fly
6                  Rabbits hop quickly. Cannot Fly
7                       Cows give milk. Cannot Fly
8                      Sheep have wool. Cannot Fly
9          Goats are curious creatures. Cannot Fly
10   Lions are the kings of the jungle. Cannot Fly
11                 Tigers have stripes. Cannot Fly
12         Elephants are large animals. Cannot Fly
13            Monkeys are very playful. Cannot Fly
14            Giraffes have long necks. Cannot Fly
15 Zebras have black and white stripes. Cannot Fly

Explanation: grepl("fly", text_data$text, ignore.case = TRUE) checks for the presence of the word “fly” in each element of the text column, ignoring case. The ifelse function is then used to create a new column category, assigning “Can Fly” if the word is present and “Cannot Fly” otherwise.

Additional Examples

Example 6: Using grep to find multiple patterns

# Find rows containing the words 'great' or 'loyal'
indices <- grep("great|loyal", text_data$text, ignore.case = TRUE)
text_data[indices,c("text") ]
[1] "Cats are great pets."    "Dogs are loyal animals."

Explanation: grep("great|loyal", text_data$text, ignore.case = TRUE) searches for the words “great” or “loyal” in the text column, ignoring case, and returns the indices of the matching rows. The resulting rows will be displayed.

Example 7: Using gsub for complex substitutions

# Replace all occurrences of 'animals' with 'creatures' and 'pets' with 'companions'

text_data$text_gsub_complex <- gsub("animals", "creatures", gsub("pets", "companions", text_data$text))
text_data[,c("text","text_gsub_complex")]
                                   text                    text_gsub_complex
1                  Cats are great pets.           Cats are great companions.
2               Dogs are loyal animals.            Dogs are loyal creatures.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                      Cows give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.       Elephants are large creatures.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.

Explanation: The inner gsub replaces all occurrences of ‘pets’ with ‘companions’, and the outer gsub replaces all occurrences of ‘animals’ with ‘creatures’ in each element of the text column. The resulting text is stored in a new column text_gsub_complex.

Example 8: Using grepl with multiple conditions

# Add a new column indicating if the text contains either 'large' or 'playful'

text_data$contains_large_or_playful <- grepl("large|playful", text_data$text)
text_data[,c("text","contains_large_or_playful")]
                                   text contains_large_or_playful
1                  Cats are great pets.                     FALSE
2               Dogs are loyal animals.                     FALSE
3                   Birds can fly high.                     FALSE
4                   Fish swim in water.                     FALSE
5                      Horses run fast.                     FALSE
6                  Rabbits hop quickly.                     FALSE
7                       Cows give milk.                     FALSE
8                      Sheep have wool.                     FALSE
9          Goats are curious creatures.                     FALSE
10   Lions are the kings of the jungle.                     FALSE
11                 Tigers have stripes.                     FALSE
12         Elephants are large animals.                      TRUE
13            Monkeys are very playful.                      TRUE
14            Giraffes have long necks.                     FALSE
15 Zebras have black and white stripes.                     FALSE

Explanation: grepl("large|playful", text_data$text) checks each element of the text column for the presence of the words “large” or “playful” and returns a logical vector. This vector is then added as a new column contains_large_or_playful.

4. Understanding Regular Expressions

Regular expressions (regex) are powerful tools used for pattern matching and text manipulation. They allow you to define complex search patterns using a combination of literal characters and special symbols. R’s grep, grepl, sub, and gsub functions all support the use of regular expressions.

Key Components of Regular Expressions

  • Literal Characters: These are the basic building blocks of regex. For example, cat matches the string “cat”.

  • Metacharacters: Special characters with unique meanings, such as ^, $, ., *, +, ?, |, [], (), {}

    • ^ matches the start of a string.

    • $ matches the end of a string.

    • . matches any single character except a newline.

    • * matches zero or more occurrences of the preceding element.

    • + matches one or more occurrences of the preceding element.

    • ? matches zero or one occurrence of the preceding element.

    • | denotes alternation (or).

    • [] matches any one of the characters inside the brackets.

    • () groups elements together.

    • {} specifies a specific number of occurrences.

Examples with Regular Expressions

Using the same synthetic dataset, let’s explore how to apply regular expressions with grep, grepl, sub, and gsub.

Example 1: Matching Text that Starts with a Specific Word

# Find rows where text starts with the word 'Cats'
indices <- grep("^Cats", text_data$text)
text_data[indices,c("text")]
[1] "Cats are great pets."

Explanation: grep("^Cats", text_data$text) uses the ^ metacharacter to find rows where the text starts with “Cats”.

Example 2: Matching Text that Ends with a Specific Word

# Find rows where text ends with the word 'water.'
indices <- grep("water\\.$", text_data$text)
text_data[indices,c("text")]
[1] "Fish swim in water."

Explanation: grep("water\\.$", text_data$text) uses the $ metacharacter to find rows where the text ends with “water.” The \\. is used to escape the dot character, which is a metacharacter in regex.

Example 3: Matching Text that Contains a Specific Pattern

# Find rows where text contains 'great' followed by any character and 'pets'
indices <- grep("great.pets", text_data$text)
text_data[indices,c("text")]
[1] "Cats are great pets."

Explanation: grep("great.pets", text_data$text) uses the . metacharacter to match any character between “great” and “pets”.

Example 4: Using gsub with Regular Expressions

# Replace all occurrences of words starting with 'C' with 'Animal'
text_data$text_gsub_regex <- gsub("\\bC\\w+", "Animal", text_data$text)
text_data[,c("text","text_gsub_regex")]
                                   text                      text_gsub_regex
1                  Cats are great pets.               Animal are great pets.
2               Dogs are loyal animals.              Dogs are loyal animals.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                    Animal give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.         Elephants are large animals.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.

Explanation: gsub("\\bC\\w+", "Animal", text_data$text) replaces all words starting with ‘C’ (\\b indicates a word boundary, C matches the character ‘C’, and \\w+ matches one or more word characters) with “Animal”.

Example 5: Using grepl to Check for Complex Patterns

# Add a new column indicating if the text contains a word ending with 's'
text_data$contains_s_end <- grepl("\\b\\w+s\\b", text_data$text)
text_data[,c("text","contains_s_end")]
                                   text contains_s_end
1                  Cats are great pets.           TRUE
2               Dogs are loyal animals.           TRUE
3                   Birds can fly high.           TRUE
4                   Fish swim in water.          FALSE
5                      Horses run fast.           TRUE
6                  Rabbits hop quickly.           TRUE
7                       Cows give milk.           TRUE
8                      Sheep have wool.          FALSE
9          Goats are curious creatures.           TRUE
10   Lions are the kings of the jungle.           TRUE
11                 Tigers have stripes.           TRUE
12         Elephants are large animals.           TRUE
13            Monkeys are very playful.           TRUE
14            Giraffes have long necks.           TRUE
15 Zebras have black and white stripes.           TRUE

Explanation: grepl("\\b\\w+s\\b", text_data$text) checks each element of the text column for the presence of a word ending with ‘s’. Here, \\b indicates a word boundary, \\w+ matches one or more word characters, and s matches the character ‘s’.

Conclusion

The grep, grepl, sub, and gsub functions in R are powerful tools for text data analysis. They allow for efficient searching, pattern matching, and text manipulation, making them essential for any data analyst or data scientist working with textual data. By understanding how to use these functions and leveraging regular expressions, you can perform a wide range of text processing tasks, from simple searches to complex pattern replacements and text-based classifications.