Introduction
Counting the word frequency in a list element in Python is a relatively common task - especially when creating distribution data for histograms.
Say we have a list ['b', 'b', 'a']
- we have two occurrences on "b" and one of "a". This guide will show you three different ways to count the number of word occurrences in a Python list:
- Using Pandas and Numpy
- Using the
count()
Function - Using the Collection Module's
Counter
- Using a Loop and a Counter Variable
In practice, you'll use Pandas/Nunpy, the count()
function or a Counter
as they're pretty convenient to use.
Using Pandas and Numpy
The shortest and easiest way to get value counts in an easily-manipulable format (DataFrame
) is via Numpy and Pandas. We can wrap the list into a Numpy array, and then call the value_counts()
method of the pd
instance (which is also available for all DataFrame
instances):
import numpy as np
import pandas as pd
words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']
pd.value_counts(np.array(words))
This results in a DataFrame
that contains:
hello 3
goodbye 1
bye 1
howdy 1
hi 1
dtype: int64
You can access its values
field to get the counts themselves, or index
to get the words themselves:
df = pd.value_counts(np.array(words))
print('Index:', df.index)
print('Values:', df.values)
This results in:
Index: Index(['hello', 'goodbye', 'bye', 'howdy', 'hi'], dtype='object')
Values: [3 1 1 1 1]
Using the count() Function
The "standard" way (no external libraries) to get the count of word occurrences in a list is by using the list object's count()
function.
The count()
method is a built-in function that takes an element as its only argument and returns the number of times that element appears in the list.
The complexity of the
count()
function is O(n), wheren
is the number of factors present in the list.
The code below uses count()
to get the number of occurrences for a word in a list:
words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']
print(f'"hello" appears {words.count("hello")} time(s)')
print(f'"howdy" appears {words.count("howdy")} time(s)')
This should give us the same output as before using loops:
"hello" appears 3 time(s)
"howdy" appears 1 time(s)
The count()
method offers us an easy way to get the number of word occurrences in a list for each individual word.
Using the Collection Module's Counter
The Counter
class instance can be used to, well, count instances of other objects. By passing a list into its constructor, we instantiate a Counter
which returns a dictionary of all the elements and their occurrences in a list.
From there, to get a single word's occurrence you can just use the word as a key for the dictionary:
from collections import Counter
words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']
word_counts = Counter(words)
print(f'"hello" appears {word_counts["hello"]} time(s)')
print(f'"howdy" appears {word_counts["howdy"]} time(s)')
This resuts in:
"hello" appears 3 time(s)
"howdy" appears 1 time(s)
Using a Loop and a Counter Variable
Ultimately, a brute force approach that loops through every word in the list, incrementing a counter by one when the word is found, and returning the total word count will work!
Of course, this method gets more inefficient as the list size grows, it's just conceptually easy to understand and implement.
The code below uses this approach in the count_occurrence()
method:
def count_occurrence(words, word_to_count):
count = 0
for word in words:
if word == word_to_count:
# update counter variable
count = count + 1
return count
words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']
print(f'"hello" appears {count_occurrence(words, "hello")} time(s)')
print(f'"howdy" appears {count_occurrence(words, "howdy")} time(s)')
If you run this code you should see this output:
"hello" appears 3 time(s)
"howdy" appears 1 time(s)
Nice and easy!
Most Efficient Solution?
Naturally - you'll be searching for the most efficient solution if you're dealing with large corpora of words. Let's benchmark all of these to see how they perform.
The task can be broken down into finding occurrences for all words or a single word, and we'll be doing benchmarks for both, starting with all words:
import numpy as np
import pandas as pd
import collections
def pdNumpy(words):
def _pdNumpy():
return pd.value_counts(np.array(words))
return _pdNumpy
def countFunction(words):
def _countFunction():
counts = []
for word in words:
counts.append(words.count(word))
return counts
return _countFunction
def counterObject(words):
def _counterObject():
return collections.Counter(words)
return _counterObject
import timeit
words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']
print("Time to execute:\n")
print("Pandas/Numpy: %ss" % timeit.Timer(pdNumpy(words)).timeit(1000))
print("count(): %ss" % timeit.Timer(countFunction(words)).timeit(1000))
print("Counter: %ss" % timeit.Timer(counterObject(words)).timeit(1000))
Which results in:
Time to execute:
Pandas/Numpy: 0.33886080000047514s
count(): 0.0009540999999444466s
Counter: 0.0019409999995332328s
The count()
method is extremely fast compared to the other variants, however, it doesn't give us the labels associated with the counts like the other two do.
If you need the labels - the Counter
outperforms the inefficient process of wrapping the list in a Numpy array and then counting.
On the other hand, you can make use of DataFrame's methods for sorting or other manipulation that you can't do otherwise. Counter
has some unique methods as well.
Ultimately, you can use the Counter
to create a dictionary and turn the dictionary into a DataFrame
as as well, to leverage the speed of Counter
and the versatility of DataFrame
s:
df = pd.DataFrame.from_dict([Counter(words)]).T
If you don't need the labels - count()
is the way to go.
Alternatively, if you're looking for a single word:
import numpy as np
import pandas as pd
import collections
def countFunction(words, word_to_search):
def _countFunction():
return words.count(word_to_search)
return _countFunction
def counterObject(words, word_to_search):
def _counterObject():
return collections.Counter(words)[word_to_search]
return _counterObject
def bruteForce(words, word_to_search):
def _bruteForce():
counts = []
count = 0
for word in words:
if word == word_to_search:
# update counter variable
count = count + 1
counts.append(count)
return counts
return _bruteForce
import timeit
words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye']
print("Time to execute:\n")
print("count(): %ss" % timeit.Timer(countFunction(words, 'hello')).timeit(1000))
print("Counter: %ss" % timeit.Timer(counterObject(words, 'hello')).timeit(1000))
print("Brute Force: %ss" % timeit.Timer(bruteForce(words, 'hello')).timeit(1000))
Which results in:
Time to execute:
count(): 0.0001573999998072395s
Counter: 0.0019498999999996158s
Brute Force: 0.0005682000000888365s
The brute force search and count()
methods outperform the Counter
, mainly because the Counter
inherently counts all words instead of one.
Conclusion
In this guide, we explored finding the occurrence of the word in a Python list, assessing the efficiency of each solution and weighing when each is more suitable.
from Planet Python
via read more