Sunday, February 2, 2020

Ionel Cristian Maries: Speeding up Django pagination

I assume you have already read Optimizing the Django Admin Paginator. If not, this is basically the take-away from that article:

class InfinityPaginator(Paginator):
    @property
    def count(self):
        return 99999999999

class MyAdmin(admin.ModelAdmin):
    paginator = InfinityPaginator
    show_full_result_count = False

Though the article has a trick with using a statement_timeout, I think it's pointless. In the real world you should expect to get that overt 99999999999 count all over the place. Unless you have some sort of toy project it's very likely your database will be under load. Add some user/group filtering and you'll be always hit the time limit.

What if you could make the count more realistic, but still cheap? Using a random number would be too inconsistent. Strangely enough someone decided that it's a good idea to put a count estimate idea in the postgresql wiki and, for reasons I decided to see how hard is to implement it in django, in a somewhat generalized fashion

From a series of "Just because you can, you have to try it!", behold [1]:

class EstimatedQuerySet(models.QuerySet):
    estimate_bias = 1.2
    estimate_threshold = 100

    def estimated_count(self):

        if self._result_cache is not None:
            return self.count()
        try:
            qs = self.model._base_manager.all()
            compiler = self.query.get_compiler('default')
            where, params = compiler.compile(self.query.where)
            qs = qs.extra(where=[where] if where else None, params=params)

            cursor = connections[self.db].cursor()
            query = qs.query.clone()
            query.add_annotation(Count('*'), alias='__count', is_summary=True)
            query.clear_ordering(True)
            query.select_for_update = False
            query.select_related = False
            query.select = []
            query.default_cols = False
            sql, params = query.sql_with_params()
            logger.info('Running EXPLAIN %s', sql)
            cursor.execute("EXPLAIN %s" % sql, params)
            lines = cursor.fetchall()
            logger.info('Got EXPLAIN result:\n> %s',
                        '\n>   '.join(line for line, in lines))
            marker = ' on %s ' % self.model._meta.db_table
            for line, in lines:
                if marker in line:
                    for part in line.split():
                        if part.startswith('rows='):
                            logger.info('Found size (%s) estimate in query EXPLAIN: %s',
                                        part, line)
                            count = int(int(part[5:]) * self.estimate_bias)
                            if count < self.estimate_threshold:
                                # Unreliable, will make views with lots of filtering
                                # output confusing results.
                                # Just do normal count, shouldn't be that slow.
                                # (well, not much slower than the actual query)
                                return self.count()
                            else:
                                return count

            return qs.count()
        except Exception as exc:
            logger.warning("Failed to estimate queryset count: %s", exc)
            return self.count()

Because the normal count method is unchanged you can use that QuerySet everywhere.

class MyModel(models.Model):
    ...

    objects = EstimatedQuerySet.as_manager()

Now using the estimated_count in the paginator will uncover a problem: sometimes it will underestimate. You can play with the estimate_bias but it will never work well with edge-cases (like heavy filtering).

A good compromise is to tune it for the general case and for everything else trick the pagination to always increment the page count when you're looking at the last page.

class EstimatedPaginator(Paginator):
    def validate_number(self, number):
        if number >= self.num_pages:
            # noinspection PyPropertyAccess
            self.num_pages = number + 1
        return super(EstimatedPaginator, self).validate_number(number)

    @cached_property
    def count(self):
        return self.object_list.estimated_count()

class MyAdmin(admin.ModelAdmin):
    paginator = EstimatedPaginator
    show_full_result_count = False

If you think that # noinspection PyPropertyAccess is funny it's because it is - num_pages is a cached_property and the following line destroys PyCharm's assumptions about how non-data descriptors should work.

It also goes against sane practices like not having unexpected side-effects. But alas, it gets worse. There's another problem there: there's always going to be a next page even if the current page is empty (or not full). To fix that we mess again with the internals:

def _get_page(self, objects, *args, **kwargs):
    # If page ain't full it means that it's the real last page, remove the extra.
    if len(objects) < self.per_page:
        # noinspection PyPropertyAccess
        self.num_pages -= 1
    return super(EstimatedPaginator, self)._get_page(objects, *args, **kwargs)

One could still input an out of bounds page number through in the URL but I think it's pointless to handle that.

What about that PyPropertyAccess? *

Suppose you have this:

class cached_property:
    def __init__(self, func, name=None):
        self.func = func

    def __get__(self, instance, cls=None):
        if instance is None:
            return self
        res = instance.__dict__[self.func.__name__] = self.func(instance)
        return res

class Foobar:
    @cached_property
    def foo(self):
        return "bar"

Because cached_property doesn't implement a __set__, assignments will be made through the instance's __dict__:

>>> x = Foobar()
>>> x.foo = '123'
>>> x.foo
'123'
>>> y = Foobar()
>>> y.foo += '123'
>>> y.foo
'bar123'

I suspect that PyCharm doesn't discern data vs non-data descriptors at all. Or perhaps it's a subtle hint that it's a bad idea to assign to something that doesn't implement a setter?

[1] Though you should be wondering if you want to take a look at this hard-to-test method every time you upgrade Django ...


from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...