Tuesday, March 2, 2021

LAAC Technology: Five Common Django Mistakes

Table of Contents

Introduction

Django is a fantastic framework for building web applications and takes a batteries included approach. With this approach, comes complexity. When you’re starting out with Django, you can introduce subtle bugs due to lack of knowledge. I wrote this post partially for myself to reference in my professional development, but also to help illuminate some mistakes that I see. As with anything, there are tradeoffs with this approach. Django contains a lot of great code that you would have to write for every web application and reinvent the wheel, but when you don’t fully understand the framework, you can write code with unintended effects. For this post, we’ll develop an example Django application that handles employee management for various organizations.

Example Code

from django.contrib.auth import get_user_model
from django.core.exceptions import ValidationError
from django.db import models
User = get_user_model()
class Organization(models.Model):
name = models.CharField(max_length=100)
datetime_created = models.DateTimeField(auto_now_add=True, editable=False)
is_active = models.BooleanField(default=True)
class Employee(models.Model):
user = models.ForeignKey(
User, on_delete=models.CASCADE, related_name="employees"
)
organization = models.ForeignKey(
Organization, on_delete=models.CASCADE, related_name="employees"
)
is_currently_employed = models.BooleanField(default=True)
reference_id = models.CharField(null=True, blank=True, max_length=255)
last_clock_in = models.DateTimeField(null=True, blank=True)
datetime_created = models.DateTimeField(auto_now_add=True, editable=False)
def clean(self):
try:
if self.last_clock_in < self.datetime_created:
raise ValidationError(
"Last clock in must occur after the employee entered"
" the system."
)
except TypeError:
# Raises TypeError if there is no last_clock_in because
# you cant compare None to datetime
pass

Let’s say we write some code that loops over every active organization’s employees.

for org in Organization.objects.filter(is_active=True):
for emp in org.employees.all():
if emp.is_currently_employed:
do_something(org, emp)

This loop results in a query for every employee in our database. This could lead to tens of thousands of queries, which slows down our application. However, if we add a prefetch_related to the organization query, we minimize the amount of queries.

for org in Organization.objects.filter(is_active=True).prefetch_related(
"employees"
):

Adding these methods results in large performance improvements without much work, but adding them can be easy to forget. For a ForeignKey or OneToOneField, use select_related. For a reverse ForeignKey or a ManyToManyField, use prefetch_related. We could make this more efficient by starting at the employee table and using the database to filter out results. We still need to add select_related since the function do_something uses the employee’s organization. If we don’t, the loop can result in thousands of queries to the organization table.

for emp in Employee.objects.filter(
organization__is_active=True, is_currently_employed=True
).select_related("organization"):
do_something(emp.organization, emp)

Adding null to a CharField or TextField

Django’s documentation recommends not adding null=True to a CharField. Looking at our example code, the employee’s reference id contains null=True. In our example application, we optionally integrate with our customer’s employee tracking system, and we use reference_id as the integrated system’s id.

reference_id = models.CharField(null=True, blank=True, max_length=255)

Adding null=True means the field has two “no data” values, null and the empty string. Conventionally, Django uses the empty string to represent no data. By also having null as a “no data” value, we can introduce subtle bugs. Let’s say we need to write some code to fetch data from our customer’s system.

if employee.reference_id is not None:
fetch_employee_record(employee)

Ideally, you write the if statement to handle any “no data” values by using if employee.reference_id:, but I’ve found this does not happen in practice. Since a reference_id could be either null or the empty string, we’ve created a bug here where the system attempts to fetch the employee record if the reference_id is the empty string. Clearly, this doesn’t work and will cause an error in our system. According to Django’s documentation, one exception exists for adding null=True to a CharField. When you need to add both blank=True and unique=True to a CharField, null=True is required.

Descending versus ascending with order_by or latest

Django’s order_by defaults to ascending order. By adding a - in front of your keywords, you tell Django to provide descending order. Let’s look at an example.

oldest_organization_first = Organization.objects.order_by("datetime_created")
newest_organization_first = Organization.objects.order_by("-datetime_created")

With the minus in front of datetime_created, Django gives us the newest organization first. Conversely without the minus, we get the oldest organization first. Mistaking the default ascending order can result in very subtle bugs. Django querysets also come with latest, which gives us the latest object in a table based on the passed keyword fields. The latest method defaults to descending order as opposed to order_by which defaults to ascending order.

oldest_organization_first = Organization.objects.latest("-datetime_created")
newest_organization_first = Organization.objects.latest("datetime_created")

In multiple projects, I have introduced bugs because of the different defaults between latest and order_by. Write order_by and latest queries with caution. Let’s look at equivalent queries using latest and order_by.

>>> oldest_org = Organization.objects.order_by("datetime_created")[:1][0]
>>> oldest_other_org = Organization.objects.latest("-datetime_created")
>>> oldest_org == oldest_other_org
True
>>> newest_org = Organization.objects.order_by("-datetime_created")[:1][0]
>>> newest_other_org = Organization.objects.latest("datetime_created")
>>> newest_org == newest_other_org
True

Forgetting that clean is not called on save

According to Django’s documentation, model validation methods, such as clean, validate_unique, and clean_fields, do not get called automatically by a model’s save method. In our example code, the employee model contains a clean method that says the last_clock_in shouldn’t occur before the employee entered the system.

def clean(self):
try:
if self.last_clock_in < self.datetime_created:
raise ValidationError(
"Last clock in must occur after the employee entered"
" the system."
)
except TypeError:
# Raises TypeError if there is no last_clock_in because
# you cant compare None to datetime
pass

Let’s say we have a view that updates the last_clock_in time for our employee, and as a part of that view, we update the employee by calling save.

from django.http import HttpResponse
from django.shortcuts import get_object_or_404
from django.views.decorators.http import require_http_methods
from example_project.helpers import parse_request
from example_project.models import Employee
@require_http_methods(["POST"])
def update_employee_last_clock_in(request, employee_pk):
clock_in_datetime = parse_request(request)
employee = get_object_or_404(Employee, pk=employee_pk)
employee.last_clock_in = clock_in_datetime
employee.save()
return HttpResponse(status=200)

In our example view, we call save without calling clean or full_clean, meaning that the clock_in_datetime passed into our view could occur before the employee’s datetime_created and still be saved into the database. This results in invalid data entering our database. Let’s fix our bug.

employee.last_clock_in = clock_in_datetime
employee.full_clean()
employee.save()

Now if the clock_in_datetime comes before the employee’s datetime_created, full_clean raises a ValidationError, which prevents the invalid data from entering our database.

Not including update_fields on save

Django Model’s save method includes a keyword argument called update_fields. In a typical production set up for Django, people use gunicorn to run multiple Django server processes on the same machine and use celery to run background processes. When you call save without update_fields, the entire model updates with the values in memory. Let’s look at the actual SQL to illustrate.

>>> user = User.objects.get(id=1)
>>> user.first_name = "Steven"
>>> user.save()
UPDATE "users_user"
SET "password" = 'some_hash',
"last_login" = '2021-02-25T22:43:41.033881+00:00'::timestamptz,
"is_superuser" = false,
"username" = 'stevenapate',
"first_name" = 'Steven',
"last_name" = '',
"email" = 'steven@laac.dev',
"is_staff" = false,
"is_active" = true,
"date_joined" = '2021-02-19T21:08:50.885795+00:00'::timestamptz,
WHERE "users_user"."id" = 1
>>> user.first_name = "NotSteven"
>>> user.save(update_fields=["first_name"])
UPDATE "users_user"
SET "first_name" = 'NotSteven'
WHERE "users_user"."id" = 1

The first call to save without update_fields results in every field on the user model being saved. With update_fields, only first_name updates. In a production environment with frequent writes, calling save without update_fields can result in a race condition. Let’s say we have two processes running, a gunicorn worker running our Django server and a celery worker. On a set schedule, the celery worker queries an external API and potentially updates the user’s is_active.

from celery import task
from django.contrib.auth import get_user_model
from example_project.external_api import get_user_status
User = get_user_model()
@task
def update_user_status(user_pk):
user = User.objects.get(pk=user_pk)
user_status = get_user_status(user)
if user_status == "inactive":
user.is_active = False
user.save()

The celery worker starts the task, loads the entire user object into memory, and queries the external API, but the external API takes longer than expected. While the celery worker waits for the external API, the same user connects to our gunicorn worker and submits an update to their email, changing it from steven@laac.dev to steven@stevenapate.com. After the email update commits to the database, the external API responds, and the celery worker updates the user’s is_active to False.

In this case, the celery worker overwrites the email update because the worker loaded the entire user object into memory before the email update committed. When the celery worker loaded the user into memory, the user’s email is steven@laac.dev. This email sits in memory until the external API responds and overwrites the email update. In the end, the row representing the user inside the database contains the old email address, steven@laac.dev, and is_active=False. Let’s change the code to prevent this race condition.

if user_status == "inactive":
user.is_active = False
user.save(update_fields=["is_active"])

If the previous scenario occurred with the updated code, the user’s email remains steven@stevenapate.com after the celery worker updates is_active because the update only writes the is_active field. Only on rare occasions, such as creating a new object, should you call save without update_fields. For more reading on this topic, I recommend this ticket on the Django bug tracker and this pull request for Django REST Framework.

Important Note

While you can solve this problem in your code base by not calling a plain save method, third party Django packages might contain this issue. For instance, Django REST Framework does not use update_fields on PATCH requests. Django REST Framework is a fantastic package that I love to use, but it does not handle this issue. Keep this in mind when adding third party packages to your Django project.

Final Thoughts

I’ve made all of these mistakes, some multiple times. I hope this post sheds light on potential bugs and helps you prevent these mistakes. I enjoy using Django, and I think it’s a great framework for building web applications. However, with any big framework, the complexity can obfuscate, and until you know what to look out for, mistakes can be made.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...