Saturday, September 28, 2019

Peter Bengtsson: How much faster is Redis at storing a blob of JSON compared to PostgreSQL?

tl;dr; Redis is 16 times faster and reading these JSON blobs.*

In Song Search when you've found a song, it loads some affiliate links to Amazon.com. (In case you're curious it's earning me lower double-digit dollars per month). To avoid overloading the Amazon Affiliate Product API, after I've queried their API, I store that result in my own database along with some metadata. Then, the next time someone views that song page, it can read from my local database. With me so far?

Example view of affiliate links

The other caveat is that you can't store these lookups locally too long since prices change and/or results change. So if my own stored result is older than a couple of hundred days, I delete it and fetch from the network again. My current implementation uses PostgreSQL (via the Django ORM) to store this stuff. The model looks like this:

class AmazonAffiliateLookup(models.Model, TotalCountMixin):
    song = models.ForeignKey(Song, on_delete=models.CASCADE)
    matches = JSONField(null=True)
    search_index = models.CharField(max_length=100, null=True)
    lookup_seconds = models.FloatField(null=True)
    created = models.DateTimeField(auto_now_add=True, db_index=True)
    modified = models.DateTimeField(auto_now=True)

At the moment this database table is 3GB on disk.

Then, I thought, why not use Redis for this. Then I can use Redis's "natural" expiration by simply setting as expiry time when I store it and then I don't have to worry about cleaning up old stuff at all.

The way I'm using Redis in this project is as a/the cache backend and I have it configured like this:

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": REDIS_URL,
        "TIMEOUT": config("CACHE_TIMEOUT", 500),
        "KEY_PREFIX": config("CACHE_KEY_PREFIX", ""),
        "OPTIONS": {
            "COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
            "SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
        },
    }
}

The speed difference

Perhaps unrealistic but I'm doing all this testing here on my MacBook Pro. The connection to Postgres (version 11.4) and Redis (3.2.1) are both on localhost.

Reads

The reads are the most important because hopefully, they happen 10x more than writes as several people can benefit from previous saves.

I changed my code so that it would do a read from both databases and if it was found in both, write down their time in a log file which I'll later summarize. Results are as follows:

PG:
median: 8.66ms
mean  : 11.18ms
stdev : 19.48ms

Redis:
median: 0.53ms
mean  : 0.84ms
stdev : 2.26ms

(310 measurements)

It means, when focussing on the median, Redis is 16 times faster than PostgreSQL at reading these JSON blobs.

Writes

The writes are less important but due to the synchronous nature of my Django, the unlucky user who triggers a look up that I didn't have, will have to wait for the write before the XHR request can be completed. However, when this happens, the remote network call to the Amazon Product API is bound to be much slower. Results are as follows:

PG:
median: 8.59ms
mean  : 8.58ms
stdev : 6.78ms

Redis:
median: 0.44ms
mean  : 0.49ms
stdev : 0.27ms

(137 measurements)

It means, when focussing on the median, Redis is 20 times faster than PostgreSQL at writing these JSON blobs.

Conclusion and discussion

First of all, I'm still a PostgreSQL fan-boy and have no intention of ceasing that. These times are made up of much more than just the individual databases. For example, the PostgreSQL speeds depend on the Django ORM code that makes the SQL and sends the query and then turns it into the model instance. I don't know what the proportions are between that and the actual bytes-from-PG's-disk times. But I'm not sure I care either. The tooling around the database is inevitable mostly and it's what matters to users.

Both Redis and PostgreSQL are persistent and survive server restarts and crashes etc. And you get so many more "batch related" features with PostgreSQL if you need them, such as being able to get a list of the last 10 rows added for some post-processing batch job.

I'm currently using Django's cache framework, with Redis as its backend, and it's a cache framework. It's not meant to be a persistent database. I like the idea that if I really have to I can just flush the cache and although detrimental to performance (temporarily) it shouldn't be a disaster. So I think what I'll do is store these JSON blobs in both databases. Yes, it means roughly 6GB of SSD storage but it also potentially means loading a LOT more into RAM on my limited server. That extra RAM usage pretty much sums of this whole blog post; of course it's faster if you can rely on RAM instead of disk. Now I just need to figure out how RAM I can afford myself for this piece and whether it's worth it.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...