Friday, December 31, 2021

Stack Abuse: Count Number of Word Occurrences in List Python

Introduction

Counting the word frequency in a list element in Python is a relatively common task - especially when creating distribution data for histograms.

Say we have a list ['b', 'b', 'a'] - we have two occurrences on "b" and one of "a". This guide will show you three different ways to count the number of word occurrences in a Python list:

  • Using Pandas and Numpy
  • Using the count() Function
  • Using the Collection Module's Counter
  • Using a Loop and a Counter Variable

In practice, you'll use Pandas/Nunpy, the count() function or a Counter as they're pretty convenient to use.

Using Pandas and Numpy

The shortest and easiest way to get value counts in an easily-manipulable format (DataFrame) is via Numpy and Pandas. We can wrap the list into a Numpy array, and then call the value_counts() method of the pd instance (which is also available for all DataFrame instances):

import numpy as np
import pandas as pd

words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']

pd.value_counts(np.array(words))

This results in a DataFrame that contains:

hello      3
goodbye    1
bye        1
howdy      1
hi         1
dtype: int64

You can access its values field to get the counts themselves, or index to get the words themselves:

df = pd.value_counts(np.array(words))

print('Index:', df.index)
print('Values:', df.values)

This results in:

Index: Index(['hello', 'goodbye', 'bye', 'howdy', 'hi'], dtype='object')

Values: [3 1 1 1 1]

Using the count() Function

The "standard" way (no external libraries) to get the count of word occurrences in a list is by using the list object's count() function.

The count() method is a built-in function that takes an element as its only argument and returns the number of times that element appears in the list.

The complexity of the count() function is O(n), where n is the number of factors present in the list.

The code below uses count() to get the number of occurrences for a word in a list:

words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']

print(f'"hello" appears {words.count("hello")} time(s)')
print(f'"howdy" appears {words.count("howdy")} time(s)')

This should give us the same output as before using loops:

"hello" appears 3 time(s)
"howdy" appears 1 time(s)

The count() method offers us an easy way to get the number of word occurrences in a list for each individual word.

Using the Collection Module's Counter

The Counter class instance can be used to, well, count instances of other objects. By passing a list into its constructor, we instantiate a Counter which returns a dictionary of all the elements and their occurrences in a list.

From there, to get a single word's occurrence you can just use the word as a key for the dictionary:

from collections import Counter

words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']

word_counts = Counter(words)

print(f'"hello" appears {word_counts["hello"]} time(s)')
print(f'"howdy" appears {word_counts["howdy"]} time(s)')

This resuts in:

"hello" appears 3 time(s)
"howdy" appears 1 time(s)

Using a Loop and a Counter Variable

Ultimately, a brute force approach that loops through every word in the list, incrementing a counter by one when the word is found, and returning the total word count will work!

Of course, this method gets more inefficient as the list size grows, it's just conceptually easy to understand and implement.

The code below uses this approach in the count_occurrence() method:

def count_occurrence(words, word_to_count):
    count = 0
    for word in words:
        if word == word_to_count:
          # update counter variable
            count = count + 1
    return count


words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']
print(f'"hello" appears {count_occurrence(words, "hello")} time(s)')
print(f'"howdy" appears {count_occurrence(words, "howdy")} time(s)')

If you run this code you should see this output:

"hello" appears 3 time(s)
"howdy" appears 1 time(s)

Nice and easy!

Most Efficient Solution?

Naturally - you'll be searching for the most efficient solution if you're dealing with large corpora of words. Let's benchmark all of these to see how they perform.

The task can be broken down into finding occurrences for all words or a single word, and we'll be doing benchmarks for both, starting with all words:

import numpy as np
import pandas as pd
import collections

def pdNumpy(words):
    def _pdNumpy():
        return pd.value_counts(np.array(words))
    return _pdNumpy

def countFunction(words):
    def _countFunction():
        counts = []
        for word in words:
            counts.append(words.count(word))
        return counts
    return _countFunction

def counterObject(words):
    def _counterObject():
        return collections.Counter(words)
    return _counterObject
    
import timeit

words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']

print("Time to execute:\n")
print("Pandas/Numpy: %ss" % timeit.Timer(pdNumpy(words)).timeit(1000))
print("count(): %ss" % timeit.Timer(countFunction(words)).timeit(1000))
print("Counter: %ss" % timeit.Timer(counterObject(words)).timeit(1000))

Which results in:

Time to execute:

Pandas/Numpy: 0.33886080000047514s
count(): 0.0009540999999444466s
Counter: 0.0019409999995332328s

The count() method is extremely fast compared to the other variants, however, it doesn't give us the labels associated with the counts like the other two do.

If you need the labels - the Counter outperforms the inefficient process of wrapping the list in a Numpy array and then counting.

On the other hand, you can make use of DataFrame's methods for sorting or other manipulation that you can't do otherwise. Counter has some unique methods as well.

Ultimately, you can use the Counter to create a dictionary and turn the dictionary into a DataFrame as as well, to leverage the speed of Counter and the versatility of DataFrames:

df = pd.DataFrame.from_dict([Counter(words)]).T

If you don't need the labels - count() is the way to go.

Alternatively, if you're looking for a single word:

import numpy as np
import pandas as pd
import collections

def countFunction(words, word_to_search):
    def _countFunction():
        return words.count(word_to_search)
    return _countFunction

def counterObject(words, word_to_search):
    def _counterObject():
        return collections.Counter(words)[word_to_search]
    return _counterObject

def bruteForce(words, word_to_search):
    def _bruteForce():
        counts = []
        count = 0
        for word in words:
            if word == word_to_search:
              # update counter variable
                count = count + 1
            counts.append(count)
        return counts
    return _bruteForce
    
import timeit

words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']

print("Time to execute:\n")
print("count(): %ss" % timeit.Timer(countFunction(words, 'hello')).timeit(1000))
print("Counter: %ss" % timeit.Timer(counterObject(words, 'hello')).timeit(1000))
print("Brute Force: %ss" % timeit.Timer(bruteForce(words, 'hello')).timeit(1000))

Which results in:

Time to execute:

count(): 0.0001573999998072395s
Counter: 0.0019498999999996158s
Brute Force: 0.0005682000000888365s

The brute force search and count() methods outperform the Counter, mainly because the Counter inherently counts all words instead of one.

Conclusion

In this guide, we explored finding the occurrence of the word in a Python list, assessing the efficiency of each solution and weighing when each is more suitable.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...