Monday, November 15, 2021

Stack Abuse: Python: Count Number of Substring Occurrences in String

Introduction

A substring is a continuous sequence of characters within a String. For instance, "substring" is a substring of "Find a substring within a string".

Strings in Python are arrays of bytes representing Unicode characters and one of the most commonly used data types to represent data in a human-readable format.

In this article we will learn how to count the number of occurrences of a specified substring within a string in Python.

Find All Occurrences of a Substring in a String Using count()

The count() method of the string class actually does just this. It returns the number of times a specified value (substring) appears in the string. It also has two optional parameters - start and end, denoting the start and end of the search space:

string.count(value, start, end)

Note: The default start is 0, and the default end is the length of the string.

Let's take a look at the usage of the method, with a representative sentence:

# Define string and substring
str1 = 'John has 1 apple, Sarah has 2 apples, Mike has 5 apples.'
substr = 'apples'

# Occurences of substring 'apples' in the string
result = str1.count(substr)
print("Number of substring occurrences:", result)

# Occurences of substring 'apples' from index 0 to index 40
start, end = 0, 40
result2 = str1.count(substr, start, end)
print("Number of substring occurrences:", result2)

This results in:

Number of substring occurrences: 2
Number of substring occurrences: 1

It is a very simple and straightforward method that works well in most cases. It's efficient, and can scale up well to large input sizes. For instance, we could load in a large piece of text and search for a common word or a stopword which is bound to be present.

You can also simply obtain a large search space to get a sense for the efficiency. Let's download 'Romeo and Juliet' by William Shakespeare, from Project Gutenberg, and retrieve the number of times 'Romeo' is mentioned:

import time
import requests

txt = requests.get('https://www.gutenberg.org/cache/epub/1513/pg1513.txt').text
print(f"Downloaded {len(txt)} bytes of text...")

start_time = time.time()
count = txt.count('Romeo')
end_time = time.time()

print(f"Time to find all occurences of 'Romeo': {end_time - start_time}s with {count} results")

This results in:

Downloaded 167333 bytes of text...
Time to find all occurences of 'Romeo': 0.0s with 153 results

Or, even if we find a much more common word, such as 'a':

start_time = time.time()
count = txt.count('a')
end_time = time.time()

print(f"Time to find all occurences of 'a': {end_time - start_time}s with {count} results")

The result is the same:

Downloaded 167333 bytes of text...
Time to find all occurences of 'Romeo': 0.0s with 8308 results

The majority of the execution time is taken by the time it takes to download the text.

Note: This method does not return the position in the string at which the substring occurs.

If you require this knowledge, either to perform additional transformational operations on the occurences besides counting them - you'll want to use a Regular Expressions to find their positions or check individual cases with startsWith().

We'll be taking a look at these two cases in the following sections.

Find All Occurrences and Positions of a Substring in a String in Python

The startswith() method returns True if the string starts with the specified value (substring) and False if it does not. Similarly to count() method, this method also has optional parameters start and end that specify the start and end positions of the search space:

string.startswith(value, start, end)

The default start value is 0 and the default end value is the length of the string.

Using this method is a bit more complex, since it requires us to use list comprehension along with the method itself, or a more traditional for loop. The startswith() method returns the beginning indices of the substring. After that we utilize list comprehension to iterate through the entire search space:

# Define string and substring
str1 = 'John has 1 apple, Sarah has 2 apples, Mike has 5 apples.'
substr = 'apples'

# Print original string and substring
print("Original string is:", str1)
print("Substring is:", substr)

# Sse startswith() and list comprehension
# Find all occurrences of a substring within a string
result = [i for i in range(len(str1)) if str1.startswith(substr, i)]

# Print the number of substring occurrences
print("Number of substring occurrences is:", len(result))

# We can also find the starting indices of substrings
print("Starting indices of substrings are: " + str(result))

This nets us the number of occurences, like last time, but also the starting positions of the strings themselves. Since we know the string in question, and thus, its length - we can easily deduce the space it ocuppies in the search string:

Original string is: John has 1 apple, Sarah has 2 apples, Mike has 5 apples.
Substring is: apples
Number of substring occurrences is: 2
Starting indices of substrings are: [30, 49]

Find All Occurrences of a Substring in a String in Python Using re.finditer()

The finditer() function is part of Python's RegEx library - re. It is most commonly used to find the occurrence of a particular pattern within a given string.

To enable the usage of this method, along with many other methods that handle RegEx expressions, we first need to import the regex library:

re.finditer(pattern, string, flags=0)

If you would like to learn more about Regular Expressions, read our Guide to Regular Expressions in Python!

The re.finditer() function returns an iterator yielding matched objects over all non-overlapping matches for the RegEx pattern in a string. The scan is performed from left to right, and matches are returned in the order they are found in. Empty matches are included as well.

Flags can be used to enable various unique features and syntax variations (for example, re.I or re.IGNORECASE flag enables case insensitive matching, re.A or re.ASCII flag enables ASCII only matching instead of the usual full UNICODE matching).

Let's replace the list comprehension from before with a Regular Expression:

import re

# Define string and substring
str1 = 'John has 1 apple, Sarah has 2 apples, Mike has 5 apples.'
substr = 'apples'

# Print original string and substring
print("Original string is:", str1)
print("Substring is:", substr)

# Use re.finditer() to find all substring occurrences
# Using list comprehension we find the start and end indices of every substring occurence
result = [(_.start(), _.end()) for _ in re.finditer(substr, str1)]

# Print number of substrings found
print("Number of substring occurrences is:", len(result))

# Print start and end indices of substring occurrences
print("The start and end indices of the substrings are: " + str(result))

This results in:

Original string is: John has 1 apple, Sarah has 2 apples, Mike has 5 apples.
Substring is: apples
Number of substring occurrences is: 2
The start and end indices of the substrings are: [(30, 36), (49, 55)]

Now, we don't have to manually add up the length of the strings to the starting indices.

Benchmarking Performance

It's worth noting that the performance will vary based on the method you choose. While in all cases, the code will end fairly quickly - it's still worth taking the performance into account on really large search spaces.

Let's use these three methods to find all instances of the character 'a' in 'Romeo and Juliet':

import re
import time
import requests

txt = requests.get('https://www.gutenberg.org/cache/epub/1513/pg1513.txt').text
print(f"Downloaded {len(txt)} bytes of text...")

start_time_1 = time.time()
result_1 = txt.count('a')
end_time_1 = time.time()

print(f"String.count(): Time to find all occurences of 'a': {end_time_1 - start_time_1}s")

start_time_2 = time.time()
result_2 = [i for i in range(len(txt)) if txt.startswith('a', i)]
end_time_2 = time.time()

print(f"List Comprehensions: Time to find all occurences of 'a': {end_time_2 - start_time_2}s")

start_time_3 = time.time()
result_3 = [(_.start(), _.end()) for _ in re.finditer('a', txt)]
end_time_3 = time.time()

print(f"Regex: Time to find all occurences of 'a': {end_time_3 - start_time_3}s")

This results in:

String.count(): Time to find all occurences of 'a': 0.0s
List Comprehensions: Time to find all occurences of 'a': 0.031008481979370117s
Regex: Time to find all occurences of 'a': 0.002000093460083008s

The count() method is definitely the most efficient one, but it doesn't let us know where the strings are. For the additional knowledge - Regular Expressions are still extremely fast for this task, and more than 10 times as efficient as our manual list comprehension loop.

Conclusion

There multiple different ways to solve this problem, some used more often then others, depending on the data you'd like to extract in the process.

In the benchmark, the count() method outperformed the other two, but it doesn't give us information on where the substrings are located. On the other hand, Regular Expressions, albeit slower, provide us with this information.

It's worth noting that all three approaches are exceptionally fast and can parse an entire literary masterpiece for a common word in a fraction of a second.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...