Friday, June 18, 2021

death and gravity: When your functions take the same arguments, consider using a class: counter-examples

In a previous article, I talk about this heuristic for using classes in Python:

If you have functions that take the same set of arguments, consider using a class.

Thing is, heuristics don't always work.

To make the most out of them, it helps to know what the exceptions are.

So, let's look at a few real-world examples where functions taking the same arguments don't necessarily make a class.

Counter-example: two sets of arguments #

Consider the following scenario:

We have a feed reader web application. It shows a list of feeds and a list of entries (articles), filtered in various ways.

Because we want to do the same thing from the command-line, we pull database-specific logic into functions in a separate module. The functions take a database connection and other arguments, query the database, and return the results.

def get_entries(db, feed=None, read=None, important=None): ...
def get_entry_counts(db, feed=None, read=None, important=None): ...
def search_entries(db, query, feed=None, read=None, important=None): ...
def get_feeds(db): ...

The main usage pattern is: at the start of the program, connect to the database; depending on user input, repeatedly call the functions with the same connection, but different options.


Taking the heuristic to the extreme, we end up with this:

class Storage:

    def __init__(self, db, feed=None, read=None, important=None):
        self._db = db
        self._feed = feed
        self._read = read
        self._important = important

    def get_entries(self): ...
    def get_entry_counts(self): ...
    def search_entries(self, query): ...
    def get_feeds(self): ...

This is not very useful: every time we change the options, we need to create a new Storage object (or worse, have a single one and change its attributes). Also, get_feeds() doesn't even use them – but somehow leaving it out seems just as bad.

What's missing is a bit of nuance: there isn't one set of arguments, there are two, and one of them changes more often than the other.

Let's take care of the obvious one first.

The database connection changes least often, so it makes sense to keep it on the storage, and pass a storage object around:

class Storage:

    def __init__(self, db):
        self._db = db

    def get_entries(self, feed=None, read=None, important=None): ...
    def get_entry_counts(self, feed=None, read=None, important=None): ...
    def search_entries(self, query, feed=None, read=None, important=None): ...
    def get_feeds(self): ...

The most important benefit of this is that it abstracts the database from the code using it, allowing you to have more than one kind of storage.

Want to store entries as files on disk? Write a FileStorage class that reads them from there. Want to test your application with various combinations of made-up entries? Write a MockStorage class that keeps the entries in in a list, in memory. Whoever calls get_entries() or search_entries() doesn't have to know or care where the entries are coming from or how the search is implemented.

This is the data access object design pattern. In object-oriented programming terminology, a DAO provides an abstract interface that encapsulates a persistence mechanism.


OK, the above looks just about right to me – I wouldn't really change anything else.

Some arguments are still repeating, but it's useful repetition: once a user learns to filter entries with one method, they can do it with any of them. Also, people use different arguments at different times; from their perspective, it's not really repetition.

And anyway, we're already using a class...

Counter-example: data classes #

Let's add more requirements.

There's more functionality beyond storing things, and we have multiple users for that as well (web app, CLI, someone using our code as a library). So we leave Storage to do only storage, and wrap it in a Reader object that has a storage:

class Reader:

    def __init__(self, storage):
        self._storage = storage

    def get_entries(self, feed=None, read=None, important=None):
        return self._storage.get_entries(feed=feed, read=read, important=important)

    ...

    def update_feeds(self):
        # calls various storage methods multiple times:
        # get feeds to be retrieved from storage,
        # store new/modified entries
        ...  

Now, the main caller of Storage.get_entries() is Reader.get_entries(). Furthermore, the filter arguments are rarely used directly by storage methods, most of the time they're passed to helper functions:

class Storage:

    def get_entries(self, feed=None, read=None, important=None):
        query = make_get_entries_query(feed=feed, read=read, important=important)
        ...

Problem: When we add a new entry filter option, we have to change the Reader methods, the Storage methods, and the helpers. And it's likely we'll do so in the future.

Solution: Group the arguments in a class that contains only data.

from typing import NamedTuple, Optional

class EntryFilterOptions(NamedTuple):
    feed: Optional[str] = None
    read: Optional[bool] = None
    important: Optional[bool] = None

class Storage:

    ...

    def get_entries(self, filter_options):
        query = make_get_entries_query(filter_options)
        ...

    def get_entry_counts(self, filter_options): ...
    def search_entries(self, query, filter_options): ...
    def get_feeds(self): ...

Now, regardless of how much they're passed around, there are only two places where it matters what the options are:

  • in a Reader method, which builds the EntryFilterOptions object
  • where they get used, either a helper or a Storage method

Note that while we're using the Python class syntax, EntryFilterOptions is not a class in the traditional object-oriented programming sense, since it has no behavior.1 Sometimes, these are known as "passive data structures" or "plain old data".

A plain class or a dataclass would have been a decent choice as well. Why I chose a named tuple is a discussion for another article.

I used type hints because it's a cheap way of documenting the options, but you don't have to, not even for dataclasses.

The example above is a simplified version of the code in my feed reader library. In the real world, EntryFilterOptions groups six options, with more on the way, and the Reader and Storage get_entries() are a bit more complicated.

Why not a dict? #

Instead of defining a whole new class, we could've just used a dict like:

{'feed': ..., 'read': ..., 'important': ...}

But this has a number of drawbacks:

  • Dicts are not type-checked. TypedDict helps, but still doesn't prevent using the wrong keys at runtime.
  • Dicts don't work well with code completion. Again, TypedDict can help smarter tools like PyCharm, but doesn't in interactive mode or in IPython.
  • Dicts are mutable. For our use case, immutability is a plus: the options don't have much reason to change, and it would be quite unexpected, so it's useful to disallow it from happening.

Why not take **kwargs? #

Since we're talking about dicts, why not make Reader.get_entries() etc. take and pass **kwargs directly to EntryFilterOptions?

While shorter, this also breaks completion.

Furthermore, it makes the code less self-documenting: even if you look at the Reader.get_entries() source, you still don't immediately know what arguments it takes. This doesn't matter as much for internal code, but for the user-facing part of the API, we don't mind making the code more verbose if it makes it easier to use.

Also, if we later introduce another data object (say, to hangle pagination options), we'll still have to write code to split the kwargs between the two.

Why not take EntryFilterOptions? #

Why not make reader.get_entries() take an EntryFilterOptions, then?

Because that'd make things too verbose for the user: they'd would have to import EntryFilterOptions, build it, and then pass it to get_entries(). And frankly, it's not very idiomatic.

This difference between the Reader and Storage method signatures exists because they're used differently:

  • Reader methods are mostly called by external users in varied ways
  • Storage methods are mostly called by internal users (Reader) in a few ways

That's all for now.

Learned something new today? Share this article with others, it really helps! :)


  1. Ted Kaminski discusses this distinction in more detail in Data, objects, and how we're railroaded into poor design. [return]



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...