Monday, June 28, 2021

death and gravity: Why I wrote my own SQL query builder

Previously

This is the third article in a series about writing an SQL query builder in 150 lines of Python, why I wrote it, how I thought about it, and the decisions I had to make.

Today, we'll talk about:

  • why I decided to write my own
  • what alternatives I considered
  • why I didn't use an existing library
  • how I knew it wouldn't become a maintenance burden

This goes through the whole story in rough chronological order; if you're only interested in the list of libraries I looked at, you can find it at the end.

Background #

reader is a Python feed reader library – it provides a high-level API for building feed readers, allowing users to retrieve, store, and manage web feeds without having to deal with feed-related details.

It's mostly a hobby / learning project, so I can only spend a limited amount of time on it, sometimes quite far in between. It's not necessarily about learning technologies; rather, it is about library design, writing code that's maintainable long-term, and restraint – if I were to make the best library I could, what would it look like?

Because of this, the conclusions in this article, if any, may not be directly applicable to regular work projects. With reader, I have different constraints over different time scales, a somewhat different definition of success, and more freedom in some aspects.

However, not all projects are the same, and not all parts of a project are the same. Sometimes, this kind of long-term thinking can be useful, and it can actually be achieved through a combination of planning, saying no to or postponing some features, and strategical technical debt.

One of reader's main features is filtering articles: by article metadata, by user-set metadata (read, important, feeds tags), and by full-text search, sorted in different ways, and with cursor pagination.

In May 2019, with less than half of the above implemented, the function building the SQL query was over 100 lines, and I had already added this comment:

This needs some sort of query builder so badly.

The first prototype #

So I opened an issue, and did some research.

At some point I stumbled upon sqlbuilder.mini, which was built around an interesting insight – queries can be represented as plain datastructures:

>>> sql = [
...     'SELECT', ['first_name'],
...     'FROM', ['author'],
...     'WHERE', ['status', '==', P('new')],
... ]
>>> compile(sql)
('SELECT first_name FROM author WHERE status == %s', ['new'])

You can then modify the query directly:

>>> sql[sql.index('SELECT') + 1].append('last_name')
>>> compile(sql)
('SELECT first_name, last_name FROM author WHERE status == %s', ['new'])

Or via a wrapper that simplifies navigation:

>>> sql = Q(sql)
>>> sql.append_child(
...     ['SELECT'],  # path
...     ['age']
... )  # returns self to allow method chaining
<sqlbuilder.mini.Q object>
>>> compile(sql)
('SELECT first_name, last_name, age FROM author WHERE status == %s', ['new'])

I really liked how simple and flexible this is, and the choice of not dealing with SQL correctness or dialects – a middle ground between build-your-own-string and "proper" query builders. On the other hand, modifying things seemed too verbose (even with the wrapper), and the generated SQL is all on one line.

Surely, it would be possible to take this idea and make it look like sql.SELECT('age'), right?

So I made a prototype, with no real intention of using it, just to see how easy it is to do. The result was about 80 lines, and I was quite happy with it; my thoughts at the time:

The end result looks nice, but using it would add ~150 lines of code (that need to be tested), and it's less useful for simpler queries.

Also, it looks nice now, when I just wrote it; 6 months from now it may be hard to understand.

Afraid that I'm too happy with it, and with my curiosity satisfied, I proceeded to do just that: wait six months.

Requirements, and existing libraries #

My main concern with making my own was that over time, with various additions and fixes, the effort spent on it would be greater than getting an existing library to do what I needed.

To deal with it, I did two things.


First, I came up with detailed requirements.

Whatever I ended up using, it had to support the following features:

  • SELECT with conditional WHERE, ORDER BY, JOIN etc. (example)
  • scrolling window queries (should be possible to implement)
  • common table expressions (WITH)
  • arbitrary SQL (so I don't have to use the query builder for everything)
  • the order in which you add clauses shouldn't matter

Because reader is a library, I wanted to keep the number of (transitive) dependencies as small as possible, since any extra dependency gets passed down to the users.

Also, the solution should be easy to understand and maintain, and it should be possible to support additional SQL features, if needed.

Both to keep things simple and due to historical reasons, I didn't want to switch to an abstraction layer like SQLAlchemy Core just for query building – I needed (and still need) only SQLite support, and already had code to deal with stuff like migrations.


Second, I did a slightly more serious survey of existing libraries.

I didn't feel any of the ones I looked at was ideal, for at least one of the following reasons.

  • They came with a full abstraction layer. This isn't bad in itself, but meant I had to switch everything, eventually – using a mix would make things worse long-term.
  • They had too many features. Usually this is good, but it means there's more of everything: more features, more documentation to go through, more concepts to keep in your head, more things contributors need to know/learn.
  • They didn't make things more readable or more composable.

So I chose to do nothing, and wait until more features are implemented.

The second prototype #

By May 2020, most of the features were implemented. The function building the query was 150 lines, with part that duplicated for the search query. At some point, I tried to optimize the query by using indexes, but gave up because trying various things simply took too long.

So, a full year later, I made the prototype support all the required features and a few extra (UNION, nested queries), and tried it out on the full real-world queries.

It didn't take all that long, and the whole thing remained around 100 lines.

Deciding to use my own #

At this point, most of the work was already done, and integrating it took less than an hour.

Excluding the 136 lines of the builder itself with scrolling window query support, the code went from 1400 to 1300 lines. I took that as a win, since for the price of 36 lines I was able to reuse the filtering logic. (Now, one year later, it enabled a lot more reuse, without growing significantly.)

I ended up keeping it, because:

  • Using an existing library would take too much effort. (I'll reconsider when the requirements change, though.)
  • It is tiny, which makes it relatively easy to understand and modify. The two prototypes made me quite confident it's likely to stay that way. Because it's only used internally, I can leave out a lot of nice things that aren't actually needed.
  • It has 0 dependencies. That's even better than 1.
  • reader already great test coverage, so very little additional testing was required.

Other alternatives #

Here's a non-exhaustive list of other things I looked at. I'm only covering the libraries I actually considered using, or that are interesting in some way. There are others out there; some aren't actively maintained, some I simply missed.

Do nothing #

It's worth keeping in mind that "do nothing" is always an option – probably the first option to consider, in most cases.

There's two kinds of doing nothing: passive, where you wait for new requirements to come up – for the problem to reveal itself –, and active, where you explore options, but don't commit to anything just yet.

I ended up doing both, to a point.

Disable parts of the query #

An interesting (but quickly abandoned) idea was to not build queries; instead, have just one query, and disable parts of it with boolean or optional parameters, and hope the query planner optimizes it:

SELECT ...
FROM entries
WHERE
    (:read IS NOT NULL AND entries.read = :read) AND
    ...  -- 7 more expressions like this

There are two huge issues with this:

  • I'm not sure any optimizer is that smart (also, the query might be optimized before the parameters are passed in). Even if it were, I'm not smart enough to design indexes for a query like this.
  • It doesn't seem possible to do it for JOIN, different ORDER BY terms, or even WHERE conditions dealing an arbitrary number of parameters (e.g. for tags).

Even if all of these were possible, the result would be impossible to understand.

SQLAlchemy Core, Peewee query builder #

SQLAlchemy and Peewee are both SQL toolkits / ORMs.

SQLAlchemy has over 15 years of history, and is probably the database toolkit for Python. Hell, there's even an Architecture of Open Source Applications chapter about it.

Peewee is a bit younger (~10 years), and simple and small by design.

Both have a lot of extensions1, and can be used without the ORM part; Peewee can even generate plain SQL without defining models and so on.

In the end, both seemed too complicated, and meant I had to switch to them eventually, adding the burden of researching a use case I don't have yet. However, if I ever need multi-database support, it's likely I'll use one of them.

SQLBuilder, PyPika #

SQLBuilder (SmartSQL) and PyPika are standalone query builders – no ORM, no connection management, just SQL generation; they are similar to the Peewee query builder.

SQLBuilder doesn't seem actively maintained since 2017. Aside from this, I didn't use it because it would make a potential migration to SQLAlchemy or Peewee more difficult.

PyPika I discovered while writing this article; it is actively maintained, and has somewhat better documentation.

sqlbuilder.mini #

SQLBuilder comes with another, extremely lightweight SQL builder, sqlbuilder.mini.

As dicussed at the beginning of the article, I like the overall approach (and at ~500 lines, it's small enough to vendor), but it still seems verbose, and the generated SQL isn't very readable.

JinjaSQL, SQLpy #

These two are interesting because they use templating.

JinjaSQL is exactly what you'd expect: generate SQL using Jinja templates. I didn't use it because composition through macros would still be verbose, and a bit tricky (careful with the comma after that last column).

SQLpy is similar, but different. You put your named queries in a separate file, and access them from Python as functions. Query building happens via named parameters: if you don't pass a parameter when executing the query, the lines using that parameter aren't included in the query (as you'd expect, this comes with a lot of caveats).

Pony #

I don't think I considered Pony at the time, but it's worth mentioning: it is an ORM that's been around since 2012, is actively maintained, and has commercial support.

And it can translate this into an SQL query:

select(c for c in Customer if sum(c.orders.price) > 1000)

For reader it is overkill. It does look really interesting, though – maybe too interesting?.


That's it for now.

Learned something new today? Share this with others, it really helps!

Want more? Get updates via email or Atom feed.

This is my first planned series, and still a work in progress. This means you get a say in it. Email me your questions or comments, and I'll do my best to address them in one of the future articles.


  1. That's one of the benefits of using libraries that have been around for a while. Some extensions relevant to my project: SQLAlchemy has Alembic for migrations (from the same author) and sqlakeyset for scrolling window queries; Peewee has a lot of SQLite-specific functionality. [return]



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...