Sunday, January 24, 2021

Brett Cannon: Unravelling `for` statements

As part of my series on Python's syntactic sugar, I am going to cover the for statement. As usual, I will be diving into CPython's C code, but understanding or even reading those parts of this post won't be required in order to understand how the unravelling works.

The bytecode

Let's start with a simple for statement:

for a in b:
    c
for statement example

Passing the code through the dis module gives us:

  2           0 LOAD_GLOBAL              0 (b)
              2 GET_ITER
        >>    4 FOR_ITER                 8 (to 14)
              6 STORE_FAST               0 (a)

  3           8 LOAD_GLOBAL              1 (c)
             10 POP_TOP
             12 JUMP_ABSOLUTE            4
        >>   14 LOAD_CONST               0 (None)
             16 RETURN_VALUE
Disassembly of for a in b: c

Looking at the output you may notice that the GET_ITER and FOR_ITER opcodes seem to be specific to for statements while the rest are general opcodes.

Looking up GET_ITER in the eval loop, it appears to mostly be calling PyObject_GetIter(). I happen to know that it is mostly equivalent to the built-in function iter(). Since I also know that there will probably be something to do with next(), I'm going to handle defining both functions later in this post.

Speaking of next(), the FOR_ITER opcode is semantically a call to __next__() on an object's type (as represented by (*iter->ob_type->tp_iternext)(iter) in the C code). What that means is that understanding how iter() and next() work is critical to understanding how for statements work.

The built-ins

Before covering how iter() and next() work, we should define two important terms. An iterable is a container which can return its contained items one at a time. An iterator is an object which streams those contained items. So while every iterator is conceptually an iterable (and you should always make your iterators also an iterable; it isn't much work), the reverse is not true and not all iterables are an iterator on their own.

iter()

The reason I bothered defining those words is because the purpose of iter() is to take an iterable and return its iterator. Now, what iter() considers an iterable depends on whether you give it one or two arguments.

iter(iterable)

Starting with the semantics of the function when there's a single argument, we see that the implementation calls PyObject_GetIter() to get the iterator. The pseudocode for this function, which I will explain in a moment, is:

def iter(iterable):
        type_ = type(iterable)
    if hasattr(type_, "__iter__"):  # type_->tp_iter != NULL
        iterator = type_.__iter__(iterable)
        if hasattr(type(iterator), "__next__"):  # PyIter_Check
            return iterator
        else:
            raise TypeError
    elif hasattr(type_, "__getitem__"):  # PySequence_Check
        return SeqIter(iterable)  # PySeqIter_New
    else:
        raise TypeError
Pseudocode for iter(iterable)

The first step is looking for the __iter__() special method on an iterable. Calling this is meant to return an iterator (which is explicitly checked for on the return value). At the C level, the definition of an "iterator" is an object that defines __next__(); both the iterable and iterator protocols can also be checked via their requisite collections.abc classes and issubclass() (I didn't do it this way in the pseudocode simply because the hasattr() checks are closer to how the C code is written; in actual Python code I would probably use the abstract base classes).

There is a footnote relating to __iter__() that says if the attribute on the class is set to None that it won't be used (although I have not come across any explicit code doing this check). I think this is implicitly supported due to how the implementation is written, i.e. None is not callable and lacks the appropriate methods that are being checked for.

The second step is what happens if __iter__() isn't available? In that case there's a check to see if we are dealing with a sequence by looking for the __getitem__() special method. If the object turns out to be a sequence, then an instance of PySeqIter_Type is returned whose approximate Python implementation would be:

def _seq_iter(seq):
    """Yield the items of the sequence starting at 0."""
    index = 0
    while True:
        try:
            yield seq[index]
            index += 1
        except (IndexError, StopIteration):
            return
Implementation of PySeqIter_Type

I say "approximate" because the CPython version supports pickling and I don't want to be bothered. 😁

If all of the above fails, then TypeError is ultimately raised.

iter(callable, sentinel)

If you recall earlier, I mentioned that iter() had a two-argument version. In this case iter() is rather different than what we have already discussed:

def iter(callable, sentinel):
    if not hasattr(callable, "__call__"):  # PyCallable_Check
        raise TypeError
    return CallIter(callable, sentinel)  # PyCallIter_Type
Pseudocode for iter(callable, sentinel)

As you can see there's a check to see if the first argument is callable, and if it is then an instance of the PyCallIter_Type iterator is returned. With the function being so short, the important question is what does the PyCallIter_Type iterator do?

The use-case for this form of iter() is for when you have a no-argument callable that you persistently call until a certain value is returned representing that there isn't anything more coming from the callable.

def _call_iter(callable, sentinel):
    while True:
        val = callable()
        if val == sentinel:
            return
        else:
            yield val

An example of where this could be useful is if you are reading a file a chunk at a time, stopping once b"" is returned:

with open(path, "rb") as file:
    for chunk in iter(lambda: file.read(chunk_size), b""):
        ...

Pulling it all together and doing everything appropriately, the definition of iter() is:

def iter(obj, /, sentinel=_NOTHING,):
    """Return an iterator for the object.

    If 'sentinel' is unspecified, the first argument must either be an iterable
    or a sequence. If the argument is a sequence, an iterator will be returned
    which will index into the argument starting at 0 and continue until
    IndexError or StopIteration is raised.

    With 'sentinel' specified, the first argument is expected to be a callable
    which takes no arguments. The returned iterator will execute the callable on
    each iteration until an object equal to 'sentinel' is returned.
    """
    # Python/bltinmodule.c:builtin_iter
    obj_type = builtins.type(obj)
    if sentinel is _NOTHING:
        # Python/abstract.c:PyObject_GetIter
        try:
            __iter__ = _mro_getattr(obj_type, "__iter__")
        except AttributeError:
            try:
                _mro_getattr(obj_type, "__getitem__")
            except AttributeError:
                raise TypeError(f"{obj_type.__name__!r} is not iterable")
            else:
                return _seq_iter(obj)
        else:
            iterator = __iter__(obj)
            # Python/abstract.c:PyIter_Check
            iterator_type = builtins.type(iterator)
            try:
                _mro_getattr(iterator_type, "__next__")
            except AttributeError:
                raise TypeError(
                    f"{obj_type.__name__!r}.__iter__() returned a non-iterator of type {builtins.type(__iter__)!r}"
                )
            else:
                return __iter__(obj)
    else:
        # Python/object.c:PyCallable_Check
        try:
            _mro_getattr(obj_type, "__call__")
        except AttributeError:
            raise TypeError(f"{obj_type.__name__!r} must be callable")
        else:
            return _call_iter(obj, sentinel)
Implementation of iter()

next()

To get the next value from an iterator, we pass it to the next() built-in function. It takes an iterator and can optionally take a default value. If the iterator has a value to return, then it's returned. If the iterator is exhausted, though, either StopIteration is raised or the default value is returned if it's been provided.

def next(iterator, /, default=_NOTHING):
    """Return the next value from the iterator by calling __next__().

    If a 'default' argument is provided, it is returned if StopIteration is
    raised by the iterator.
    """
    # Python/bltinmodule.c:builtin_next
    iterator_type = builtins.type(iterator)
    try:  # Python/abstract.c:PyIter_Check
        __next__ = _mro_getattr(iterator_type, "__next__")
    except AttributeError:
        raise TypeError(f"{iterator_type.__name__!r} is not an iterator")
    else:
        try:
            val = __next__(iterator)
        except StopIteration:
            if default is _NOTHING:
                raise
            else:
                return default
        else:
            return val
Implementation of next()

The semantics

For each value returned by the iterable's iterator, they are assigned to the loop's target (sometimes called the loop variant), the statements in the body are run, and this repeats until the iterator is exhausted. If there is an else clause for the for loop, then it is executed if a break statement isn't encountered. Breaking this all down gives us:

  1. Get the iterable's iterator
  2. Assign the iterator's next value to the loop target
  3. Execute the body
  4. Repeat until the iterator is done
  5. If there's an else clause and a break statement isn't hit, execute the else clause's body

Kind of sounds like a while loop with an assignment, doesn't it?

  1. Get the iterator with iter()
  2. Call next() and assign the result to the loop target
  3. As long as calling next didn't raise StopIteration, execute the body
  4. Repeat as necessary
  5. Run the else clause as appropriate

for without an else clause

Let's start with the simpler case of not having an else clause. In that case we can translate:

for a in b:
    c
Example for loop

to:

_iter = iter(b)
while True:
    try:
        a = next(_iter)
    except StopIteration:
        break
    else:
        c
del _iter
Example for loop, unravelled

If you squint a bit it sort of looks like the for loop example. It's definitely a lot more verbose and would be a pain to write out every time you wanted to iterate, but it would get the work done.

for with an else clause

Now you might have noticed that we used a break statement in our unravelling above which would cause a while loop's else clause to always be skipped. How do we change the translation to work when an else clause is present?

Let's update our example first:

for a in b:
    c
else:
    d
Example for loop with an else clause

To eliminate our use of break in our original unravelling, we need to come up with another way to denote when the iterator is out of items. A variable simply tracking whether there are more values should be enough to get us what we are after.

_iter = iter(b)
_looping = True
while _looping:
    try:
        a = next(_iter)
    except StopIteration:
        _looping = False
        continue
    else:
        c
else:
    d
del _iter, _looping

This unravelling has a rather convenient side-effect of getting to rely on the while loop's own else clause and its semantics to get what we are after for the for loop!

And that's it! We can leverage a while loop to implement for loops with no discernible semantic changes (except for the temp variables).



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...