This an optimization story that should not surprise anyone using the Django ORM. But I thought I'd share because I have numbers now! The origin of this came from a real requirement. For a given parent model, I'd like to extract the value of the name column of all its child models, and the turn all these name strings into 1 MD5 checksum string.
Variants
The first attempted looked like this:
artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist):
names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
The SQL used to generate this is as follows:
SELECT "main_song"."id", "main_song"."artist_id", "main_song"."name",
"main_song"."text", "main_song"."language", "main_song"."key_phrases",
"main_song"."popularity", "main_song"."text_length", "main_song"."metadata",
"main_song"."created", "main_song"."modified",
"main_song"."has_lastfm_listeners", "main_song"."has_spotify_popularity"
FROM "main_song" WHERE "main_song"."artist_id" = 22729;
Clearly, I don't need anything but just the name column, version 2:
artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist).only("name"):
names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
Now, the SQL used is:
SELECT "main_song"."id", "main_song"."name"
FROM "main_song" WHERE "main_song"."artist_id" = 22729;
But still, since I don't really need instances of model class Song I can use the .values() method which gives back a list of dictionaries. This is version 3:
names = []
for song in Song.objects.filter(artist=a).values("name"):
names.append(song["name"])
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
This time Django figures it doesn't even need the primary key value so it looks like this:
SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;
Last but not least; there is an even faster one. values_list(). This time it doesn't even bother to map the column name to the value in a dictionary. And since I only need 1 column's value, I can set flat=True. Version 4 looks like this:
names = []
for name in Song.objects.filter(artist=a).values_list("name", flat=True):
names.append(name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
Same SQL gets used this time as in version 3.
The benchmark
Hopefully this little benchmark script speaks for itself:
from songsearch.main.models import *
import hashlib
def f1(a):
names = []
for song in Song.objects.filter(artist=a):
names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
def f2(a):
names = []
for song in Song.objects.filter(artist=a).only("name"):
names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
def f3(a):
names = []
for song in Song.objects.filter(artist=a).values("name"):
names.append(song["name"])
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
def f4(a):
names = []
for name in Song.objects.filter(artist=a).values_list("name", flat=True):
names.append(name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
artist = Artist.objects.get(name="Bad Religion")
print(Song.objects.filter(artist=artist).count())
print(f1(artist) == f2(artist))
print(f2(artist) == f3(artist))
print(f3(artist) == f4(artist))
# Reporting
import time
import random
import statistics
functions = f1, f2, f3, f4
times = {f.__name__: [] for f in functions}
for i in range(500):
func = random.choice(functions)
t0 = time.time()
func(artist)
t1 = time.time()
times[func.__name__].append((t1 - t0) * 1000)
for name in sorted(times):
numbers = times[name]
print("FUNCTION:", name, "Used", len(numbers), "times")
print("\tBEST", min(numbers))
print("\tMEDIAN", statistics.median(numbers))
print("\tMEAN ", statistics.mean(numbers))
print("\tSTDEV ", statistics.stdev(numbers))
I ran this on my PostgreSQL 11.1 on my MacBook Pro with Django 2.1.7. So the database is on localhost.
The results
276
True
True
True
FUNCTION: f1 Used 135 times
BEST 6.309986114501953
MEDIAN 7.531881332397461
MEAN 7.834429211086697
STDEV 2.03779968066591
FUNCTION: f2 Used 135 times
BEST 3.039121627807617
MEDIAN 3.7298202514648438
MEAN 4.012803678159361
STDEV 1.8498943539073027
FUNCTION: f3 Used 110 times
BEST 0.9920597076416016
MEDIAN 1.4405250549316406
MEAN 1.5053835782137783
STDEV 0.3523240470133114
FUNCTION: f4 Used 120 times
BEST 0.9369850158691406
MEDIAN 1.3251304626464844
MEAN 1.4017681280771892
STDEV 0.3391019435930447
Discussion
I guess the hashlib.md5("".join(names).encode("utf-8")).hexdigest() stuff is a bit "off-topic" but I checked and it's roughly 300 times faster than building up the names list.
It's clearly better to ask less of Python and PostgreSQL to get a better total time. No surprise there. What was interesting was the proportion of these differences. Memorize that and you'll be better equipped if it's worth the hassle of not using the Django ORM in the most basic form.
Also, do take note that this is only relevant in when dealing with many records. The slowest variant (f1) takes, on average, 7 milliseconds.
Summarizing the difference with percentages compared to the fastest variant:
f1- 573% slowerf2- 225% slowerf3- 6% slowerf4- 0% slower
from Planet Python
via read more

No comments:
Post a Comment