Table of Contents
- Introduction
- Example Code
- Not using select_related and prefetch_related
- Adding null to a CharField or TextField
- Descending versus ascending with order_by or latest
- Forgetting that clean is not called on save
- Not including update_fields on save
- Final Thoughts
Introduction
Django is a fantastic framework for building web applications and takes a batteries included approach. With this approach, comes complexity. When you’re starting out with Django, you can introduce subtle bugs due to lack of knowledge. I wrote this post partially for myself to reference in my professional development, but also to help illuminate some mistakes that I see. As with anything, there are tradeoffs with this approach. Django contains a lot of great code that you would have to write for every web application and reinvent the wheel, but when you don’t fully understand the framework, you can write code with unintended effects. For this post, we’ll develop an example Django application that handles employee management for various organizations.
Example Code
from django.contrib.auth import get_user_model
from django.core.exceptions import ValidationError
from django.db import models
User = get_user_model()
class Organization(models.Model):
name = models.CharField(max_length=100)
datetime_created = models.DateTimeField(auto_now_add=True, editable=False)
is_active = models.BooleanField(default=True)
class Employee(models.Model):
user = models.ForeignKey(
User, on_delete=models.CASCADE, related_name="employees"
)
organization = models.ForeignKey(
Organization, on_delete=models.CASCADE, related_name="employees"
)
is_currently_employed = models.BooleanField(default=True)
reference_id = models.CharField(null=True, blank=True, max_length=255)
last_clock_in = models.DateTimeField(null=True, blank=True)
datetime_created = models.DateTimeField(auto_now_add=True, editable=False)
def clean(self):
try:
if self.last_clock_in < self.datetime_created:
raise ValidationError(
"Last clock in must occur after the employee entered"
" the system."
)
except TypeError:
# Raises TypeError if there is no last_clock_in because
# you cant compare None to datetime
pass
Not using select_related and prefetch_related
Let’s say we write some code that loops over every active organization’s employees.
for org in Organization.objects.filter(is_active=True):
for emp in org.employees.all():
if emp.is_currently_employed:
do_something(org, emp)
This loop results in a query for every employee in our database. This could lead to tens of thousands of queries, which slows down our application. However, if we add a prefetch_related
to the organization query, we minimize the amount of queries.
for org in Organization.objects.filter(is_active=True).prefetch_related(
"employees"
):
Adding these methods results in large performance improvements without much work, but adding them can be easy to forget. For a ForeignKey
or OneToOneField
, use select_related. For a reverse ForeignKey
or a ManyToManyField
, use prefetch_related. We could make this more efficient by starting at the employee table and using the database to filter out results. We still need to add select_related
since the function do_something
uses the employee’s organization. If we don’t, the loop can result in thousands of queries to the organization table.
for emp in Employee.objects.filter(
organization__is_active=True, is_currently_employed=True
).select_related("organization"):
do_something(emp.organization, emp)
Adding null to a CharField or TextField
Django’s documentation recommends not adding null=True
to a CharField
. Looking at our example code, the employee’s reference id contains null=True
. In our example application, we optionally integrate with our customer’s employee tracking system, and we use reference_id
as the integrated system’s id.
reference_id = models.CharField(null=True, blank=True, max_length=255)
Adding null=True
means the field has two “no data” values, null and the empty string. Conventionally, Django uses the empty string to represent no data. By also having null as a “no data” value, we can introduce subtle bugs. Let’s say we need to write some code to fetch data from our customer’s system.
if employee.reference_id is not None:
fetch_employee_record(employee)
Ideally, you write the if statement to handle any “no data” values by using if employee.reference_id:
, but I’ve found this does not happen in practice. Since a reference_id
could be either null or the empty string, we’ve created a bug here where the system attempts to fetch the employee record if the reference_id
is the empty string. Clearly, this doesn’t work and will cause an error in our system. According to Django’s documentation, one exception exists for adding null=True
to a CharField
. When you need to add both blank=True
and unique=True
to a CharField
, null=True
is required.
Descending versus ascending with order_by or latest
Django’s order_by defaults to ascending order. By adding a -
in front of your keywords, you tell Django to provide descending order. Let’s look at an example.
oldest_organization_first = Organization.objects.order_by("datetime_created")
newest_organization_first = Organization.objects.order_by("-datetime_created")
With the minus in front of datetime_created
, Django gives us the newest organization first. Conversely without the minus, we get the oldest organization first. Mistaking the default ascending order can result in very subtle bugs. Django querysets also come with latest, which gives us the latest object in a table based on the passed keyword fields. The latest
method defaults to descending order as opposed to order_by
which defaults to ascending order.
oldest_organization_first = Organization.objects.latest("-datetime_created")
newest_organization_first = Organization.objects.latest("datetime_created")
In multiple projects, I have introduced bugs because of the different defaults between latest
and order_by
. Write order_by
and latest
queries with caution. Let’s look at equivalent queries using latest
and order_by
.
>>> oldest_org = Organization.objects.order_by("datetime_created")[:1][0]
>>> oldest_other_org = Organization.objects.latest("-datetime_created")
>>> oldest_org == oldest_other_org
True
>>> newest_org = Organization.objects.order_by("-datetime_created")[:1][0]
>>> newest_other_org = Organization.objects.latest("datetime_created")
>>> newest_org == newest_other_org
True
Forgetting that clean is not called on save
According to Django’s documentation, model validation methods, such as clean
, validate_unique
, and clean_fields
, do not get called automatically by a model’s save
method. In our example code, the employee model contains a clean method that says the last_clock_in
shouldn’t occur before the employee entered the system.
def clean(self):
try:
if self.last_clock_in < self.datetime_created:
raise ValidationError(
"Last clock in must occur after the employee entered"
" the system."
)
except TypeError:
# Raises TypeError if there is no last_clock_in because
# you cant compare None to datetime
pass
Let’s say we have a view that updates the last_clock_in
time for our employee, and as a part of that view, we update the employee by calling save
.
from django.http import HttpResponse
from django.shortcuts import get_object_or_404
from django.views.decorators.http import require_http_methods
from example_project.helpers import parse_request
from example_project.models import Employee
@require_http_methods(["POST"])
def update_employee_last_clock_in(request, employee_pk):
clock_in_datetime = parse_request(request)
employee = get_object_or_404(Employee, pk=employee_pk)
employee.last_clock_in = clock_in_datetime
employee.save()
return HttpResponse(status=200)
In our example view, we call save
without calling clean
or full_clean
, meaning that the clock_in_datetime
passed into our view could occur before the employee’s datetime_created
and still be saved into the database. This results in invalid data entering our database. Let’s fix our bug.
employee.last_clock_in = clock_in_datetime
employee.full_clean()
employee.save()
Now if the clock_in_datetime
comes before the employee’s datetime_created
, full_clean
raises a ValidationError
, which prevents the invalid data from entering our database.
Not including update_fields on save
Django Model’s save method includes a keyword argument called update_fields. In a typical production set up for Django, people use gunicorn to run multiple Django server processes on the same machine and use celery to run background processes. When you call save
without update_fields
, the entire model updates with the values in memory. Let’s look at the actual SQL to illustrate.
>>> user = User.objects.get(id=1)
>>> user.first_name = "Steven"
>>> user.save()
UPDATE "users_user"
SET "password" = 'some_hash',
"last_login" = '2021-02-25T22:43:41.033881+00:00'::timestamptz,
"is_superuser" = false,
"username" = 'stevenapate',
"first_name" = 'Steven',
"last_name" = '',
"email" = 'steven@laac.dev',
"is_staff" = false,
"is_active" = true,
"date_joined" = '2021-02-19T21:08:50.885795+00:00'::timestamptz,
WHERE "users_user"."id" = 1
>>> user.first_name = "NotSteven"
>>> user.save(update_fields=["first_name"])
UPDATE "users_user"
SET "first_name" = 'NotSteven'
WHERE "users_user"."id" = 1
The first call to save
without update_fields
results in every field on the user model being saved. With update_fields
, only first_name
updates. In a production environment with frequent writes, calling save
without update_fields
can result in a race condition. Let’s say we have two processes running, a gunicorn worker running our Django server and a celery worker. On a set schedule, the celery worker queries an external API and potentially updates the user’s is_active
.
from celery import task
from django.contrib.auth import get_user_model
from example_project.external_api import get_user_status
User = get_user_model()
@task
def update_user_status(user_pk):
user = User.objects.get(pk=user_pk)
user_status = get_user_status(user)
if user_status == "inactive":
user.is_active = False
user.save()
The celery worker starts the task, loads the entire user object into memory, and queries the external API, but the external API takes longer than expected. While the celery worker waits for the external API, the same user connects to our gunicorn worker and submits an update to their email, changing it from steven@laac.dev to steven@stevenapate.com. After the email update commits to the database, the external API responds, and the celery worker updates the user’s is_active
to False
.
In this case, the celery worker overwrites the email update because the worker loaded the entire user object into memory before the email update committed. When the celery worker loaded the user into memory, the user’s email is steven@laac.dev. This email sits in memory until the external API responds and overwrites the email update. In the end, the row representing the user inside the database contains the old email address, steven@laac.dev, and is_active=False
. Let’s change the code to prevent this race condition.
if user_status == "inactive":
user.is_active = False
user.save(update_fields=["is_active"])
If the previous scenario occurred with the updated code, the user’s email remains steven@stevenapate.com after the celery worker updates is_active
because the update only writes the is_active
field. Only on rare occasions, such as creating a new object, should you call save
without update_fields
. For more reading on this topic, I recommend this ticket on the Django bug tracker and this pull request for Django REST Framework.
Important Note
While you can solve this problem in your code base by not calling a plain save
method, third party Django packages might contain this issue. For instance, Django REST Framework does not use update_fields
on PATCH requests. Django REST Framework is a fantastic package that I love to use, but it does not handle this issue. Keep this in mind when adding third party packages to your Django project.
Final Thoughts
I’ve made all of these mistakes, some multiple times. I hope this post sheds light on potential bugs and helps you prevent these mistakes. I enjoy using Django, and I think it’s a great framework for building web applications. However, with any big framework, the complexity can obfuscate, and until you know what to look out for, mistakes can be made.
from Planet Python
via read more
No comments:
Post a Comment