Friday, March 19, 2021

death and gravity: Deterministic hashing of Python data objects

... in which we calculate deterministic hashes for Python data objects, stable across interpreter versions and implementations.

If you're in a hurry, you can find the final version of the code at the end.

Contents

Why is this useful? #

Let's say you have a feed reader library; one of its main features is retrieving and storing web feeds (Atom, RSS, and so on).

Entries (articles) usually have an updated date, indicating the last time the entry was modified in a significant way. An entry is updated only if its updated date in the feed is newer than the one we have stored for it.

However, you notice the content of some entries changes without updated changing, so you decide to update entries whenever the content changes, regardless of updated.

After the feed is retrieved, entries are converted to data objects like these (the real ones have more attributes, though):

@dataclass
class Content:
    value: str
    language: Optional[str] = None

@dataclass
class Entry:
    id: str
    updated: Optional[datetime] = None
    content: Sequence[Content] = ()

A naive approach is to get the stored entry data and compare it with the feed version, but that's pretty inefficient.

The better solution is to use a hash function – a way to map data of arbitrary size (the message) to a fixed-size chunk of data (the hash value), such that:

  • it is quick to compute the hash value for any given message
  • the same message always results in the same hash
  • it is extremely unlikely two slightly different messages have the same hash

Then, instead of getting the full entry data, we just get its (previously computed) hash, and compare it with the hash of the feed version.

Cryptographic hash functions have more properties, but we don't need them.

Requirements #

Looking at our use case, we need a hash function that:

  1. supports (almost) arbitrary data objects; in our case, the various built-in types, datetimes, and dataclass instances should be enough
  2. is safe; passing an unsupported object should be an error
  3. is stable across interpreter versions and implementations, operating systems, and host machines
  4. ignores "empty" values, to allow adding new fields without the hash changing
  5. can skip some of the fields (I actually realized this is needed much later)

Problem: we need a stable hash function #

hash() #

An easy solution seems to be the built-in hash() function, which returns the integer hash of an object. However, it has a couple of issues.

By default, the hashes of str and bytes objects are randomized for security reasons (details, second note), so they're not predictable between Python processes:

$ python3 -c 'print(hash("abc"))'
-4743820898567001518
$ python3 -c 'print(hash("abc"))'
-6699381079787346150

Also, hash() only supports hashable objects; this means no lists, dicts, or non-frozen dataclasses. For my specific use case, this wasn't actually a problem, but it already puts huge constraints on how arbitrary the data objects can be.

hashlib #

hashlib contains many different secure hash algorithms, which are by definition deterministic.

But it has one big problem – it only takes bytes:

>>> hashlib.md5(b'abc').hexdigest()
'900150983cd24fb0d6963f7d28e17f72'
>>> hashlib.md5('abc').hexdigest()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Unicode-objects must be encoded before hashing
>>> hashlib.md5(1).hexdigest()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object supporting the buffer API required

This is something we can work with, though: it changes our problem from "we need a stable hash function" to "we need a deterministic way of serializing objects to bytes".

Problem: we need a deterministic way of serializing objects #

pickle #

pickle can turn most Python objects into bytes.

It does have multiple protocols, and the default one can change with the Python version; but we can select one and stick with it – version 4 was added in Python 3.4, and seems good enough.

Again, the easy solution is deceiving, since it seems to work:

$ function pickle {
$1 <<EOD
import pickle
from datetime import datetime
print(pickle.dumps($3, protocol=$2))
EOD
}
$ pickle python3.6 4 '[1, "abc"]' | md5sum
02fa88b9fea0912efe731ed56906b251  -
$ pickle python3.7 4 '[1, "abc"]' | md5sum
02fa88b9fea0912efe731ed56906b251  -
$ pickle python3.8 4 '[1, "abc"]' | md5sum
02fa88b9fea0912efe731ed56906b251  -
$ pickle pypy3.6 4 '[1, "abc"]' | md5sum
02fa88b9fea0912efe731ed56906b251  -

... until it doesn't:

$ pickle python3.6 4 'datetime(1, 1, 1)' | md5sum
9c4423b791578d865d8fbeb070a1b934  -
$ pickle pypy3.6 4 'datetime(1, 1, 1)' | md5sum
3c7c834cb2f1cf4aba8be5c326bb9ddd  -

Version 0 isn't stable either, but in a different way:

$ pickle python3.6 0 'datetime(1, 1, 1)' | md5sum
01acd91b95556a09f5ff9b7495e120da  -
$ pickle pypy3.6 0 'datetime(1, 1, 1)' | md5sum
01acd91b95556a09f5ff9b7495e120da  -
$ pickle python3.7 0 'datetime(1, 1, 1)' | md5sum
a6c815eca494dbf716cd4a7e5556d779  -

Version 3 does seem to work fine across all of the above, on both macOS and Linux (I also tested it with more complicated data). But what's to say it'll remain that way?

In fairness, this is not an issue with pickle – it guarantees you'll get the same object back after unpickling, not that you'll get the same binary stream after pickling.

And it's easy to explain why: pickles are actually programs. Some relevant comments from pickletools:

"A pickle" is a program for a virtual pickle machine (PM, but more accurately called an unpickling machine). It's a sequence of opcodes, interpreted by the PM, building an arbitrarily complex Python object.

Historically, many enhancements have been made to the pickle protocol in order to do a better (faster, and/or more compact) job on [builtin scalar and container types, like ints and tuples].

As explained below [for compatibility], pickle opcodes never go away, not even when better ways to do a thing get invented. The repertoire of the PM just keeps growing over time.

This means there can be multiple pickles that unpickle to the same object1, even within the bounds of a specific protocol.

str() and repr() #

str() and repr() might seem like valid solutions, but neither of them are.

First, they're not stable, and not guaranteed to have all the information we may care about:

str(object) returns object.__str__(), which is the "informal" or nicely printable string representation of object. [...] If object does not have a __str__() method, then str() falls back to returning repr(object).

For many types, [repr()] makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.

More importantly, they're not safe – all Python objects have them, even some that we would not want to serialize at all:

>>> Content(object())
Content(value=<object object at 0x7f993cd5ff40>, language=None)
>>> Content(lambda: 1)
Content(value=<function <lambda> at 0x7f993e96ec10>, language=None)

json #

json might be what we need; even the pickle documentation seems to recommend it, although for different reasons.

Compared to the previous solutions, json has the opposite problem: it only supports dicts (with string/number keys only), lists, strings, and a few others.

But we can turn this to our advantage by converting only the stuff we want to support to a JSON-encodable form, and the json module makes this really easy; let's start with datetimes:

def json_default(thing):
    if isinstance(thing, datetime):
        return thing.isoformat(timespec='microseconds')
    raise TypeError(f"object of type {type(thing).__name__} not serializable")
>>> json.dumps(datetime(1, 1, 1), default=json_default)
'"0001-01-01T00:00:00.000000"'

Dataclasses aren't much harder to add, thanks to asdict(), which converts dataclass instances to a dict, recursing into other dataclasses, dicts, lists and tuples:

def json_default(thing):
    try:
        return dataclasses.asdict(thing)
    except TypeError:
        pass
    if isinstance(thing, datetime.datetime):
        return thing.isoformat(timespec='microseconds')
    raise TypeError(f"object of type {type(thing).__name__} not serializable")
>>> json.dumps(Entry('id', datetime(1, 1, 1), content=[Content('value')]), default=json_default)
'{"id": "id", "updated": "0001-01-01T00:00:00.000000", "content": [{"value": "value", "language": null}]}'

You may notice the dataclass type does not appear in the result, which for our use case is actually fine: dataclasses One(value=1) and Two(value=1) will both get converted to the dict {'value': 1}, resulting in the same hash. If we wanted them to be different, we could include the type name:

>>> {'__type': type(thing).__name__, **asdict(thing)}
{'__type': 'Content', 'value': 1, 'language': None}

To ensure the output remains stable across Python versions, we'll force all of the dumps() default arguments to known values, and require it to sort the keys in dicts:

def json_dumps(thing):
    return json.dumps(
        thing,
        default=json_default,
        ensure_ascii=False,
        sort_keys=True,
        indent=None,
        separators=(',', ':'),
    )

One more wrapper to hash the serialized value, and we're done:

def get_hash(thing):
    return hashlib.md5(json_dumps(thing).encode('utf-8')).digest()
>>> get_hash(Entry('id', datetime(1, 1, 1), content=[Content('value')])).hex()
'78b4b8120af5f832b7ecfc34db1fe02b'

Problem: we need to ignore empty values #

Say we have a dataclass like the following:

>>> @dataclass
... class Data:
...     value: int
...
>>> get_hash(Data(2)).hex()
'5d872de403edb944a7b10450eda2f46a'

Which in time evolves to get another attribute:

>>> @dataclass
... class Data:
...     value: int
...     another: str = None
...
>>> get_hash(Data(2)).hex()
'e54ad2c961239bd70da9a603cd078e18'

The old and new versions result in different dicts, so they have different hashes.

But should they? The only "real" data Data(2) contains is value=2. Likewise, there's not much of a difference in actual information between None and [] (for our use case, at least).

We can ignore "empty" values quite easily by using the asdict() dict_factory argument. I overlooked it initially, thinking it has to be a mapping class; it turns out any function that takes a key-value pair iterable works:

from collections.abc import Collection

def dict_drop_empty(pairs):
    return dict(
        (k, v)
        for k, v in pairs
        if not (
            v is None
            or not v and isinstance(v, Collection)
        )
    )

def json_default(thing):
    try:
        return dataclasses.asdict(thing, dict_factory=dict_drop_empty)
    # ...

We now get the same hash as for the first version of the object:

>>> Data(2)
Data(value=2, another=None)
>>> get_hash(Data(2)).hex()
'5d872de403edb944a7b10450eda2f46a'

Note that we are specific about the falsy values we ignore: an empty string, list, or dict are empty, but 0 or False are not.

collections.abc provides abstract base classes that can be used to test whether a class provides a particular interface.

isinstance(v, (str, tuple, list, dict)) might cover all our needs in the code above, but isinstance(v, Collection) checks for other collections we may failed to think about that don't inherit from them (e.g. set, collections.deque, or even numpy.ndarray).

Problem: we need to skip some fields #

Let's look at a more advanced version of our Entry class:

@dataclass
class Entry:
    feed_url: str
    id: str
    updated: Optional[datetime] = None
    content: Sequence[Content] = ()

It the feed URL changes, does the data of the entry change?

In the context of our problem, both feed_url and id are metadata; changing them should not change the hash, since the actual data does not change.

We could deal with this by changing json_default() to remove specific keys from the asdict() result.

However, this moves class-specific logic away from the class and into json_default() and forces us to change it whenever we add a new class or the list metadata attributes changes; passing the list of metadata attributes as an argument to get_hash() just moves the problem elsewhere.

A better way is to allow classes to tell json_default() what attributes to remove via a well-known class attribute.

Neither of these approaches work with asdict() and nested dataclasses, since asdict() is recursive, and we only get to look at the top-level class. We cannot do anything in dict_factory either, since we don't know which class the attribute pairs belong to.

To work around it, we'll implement a non-recursive version of asdict(), and rely on json.dumps() for recursion.

def dataclass_dict(thing):
    fields = dataclasses.fields(thing)
    if isinstance(thing, type):
        raise TypeError("got dataclass type, expected instance")

    exclude = getattr(thing, '_hash_exclude_', ())

    rv = {}
    for field in fields:
        if field.name in exclude:
            continue
        value = getattr(thing, field.name)
        if value is None or not value and isinstance(value, Collection):
            continue
        rv[field.name] = value

    return rv

def json_default(thing):
    try:
        return dataclass_dict(thing)
    # ...

We're using _hash_exclude_ (with an underscore at the end) to mimic the __double_underscore__ convention Python uses for magic methods; we are not using double underscores because they are reserved.

To use it, we just need to set the _hash_exclude_ class attribute to a tuple of the attributes to exclude from the hash:

@dataclass
class Entry:
    feed_url: str
    id: str
    updated: Optional[datetime] = None
    content: Sequence[Content] = ()
    _hash_exclude_ = ('feed_url', 'id')
>>> get_hash(Entry('feed one', 'id', datetime(1, 1, 1))).hex()
'4f97b1e8e99d3304f50cd3c4428f224e'
>>> get_hash(Entry('feed two', 'id', datetime(1, 1, 1))).hex()
'4f97b1e8e99d3304f50cd3c4428f224e'
>>> get_hash(Entry('feed one', 'id', datetime(2, 2, 2))).hex()
'0c739e158792c5a91ec1632804d905c1'

Conclusion #

By leveraging dataclasses and the json module, we've managed to get stable hashes for (almost) arbitrary Python data objects, with a decent trade-off between generality and safety, and two extra features, all in under 50 lines of code!

If you're interested in using this, also have a look at:

As always, if you found it useful and would like to see more, drop me a line :)

  1. I found this idea somewhere on Stack Overflow, but I can't seem to find that specific post again. [return]



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...