Thursday, November 14, 2019

Logilab: Typing Mercurial with pytype

Following the recent introduction of Python type annotations (aka "type hints") in Mercurial (see, e.g. this changeset by Augie Fackler), I've been playing a bit with this and pytype.

pytype is a static type analyzer for Python code. It compares with the more popular mypy but I don't have enough perspective to make a meaningful comparison at the moment. In this post, I'll illustrate how I worked with pytype to gradually add type hints in a Mercurial module and while doing so, fix bugs!

The module I focused on is mercurial.mail, which contains mail utilities and that I know quite well. Other modules are also being worked on, this one is a good starting point because it has a limited number of "internal" dependencies, which both makes it faster to iterate with pytype and reduces side effects of other modules not being correctly typed already.

$ pytype mercurial/mail.py
Computing dependencies
Analyzing 1 sources with 36 local dependencies
ninja: Entering directory `.pytype'
[19/19] check mercurial.mail
Success: no errors found

The good news is that the module apparently already type-checks. Let's go deeper and merge the type annotations generated by pytype:

$ merge-pyi -i mercurial/mail.py out/mercurial/mail.pyi

(In practice, we'd use --as-comments option to write type hints as comments, so that the module is still usable on Python 2.)

Now we have all declarations annotated with types. Typically, we'd get many things like:

def codec2iana(cs) -> Any:
   cs = pycompat.sysbytes(email.charset.Charset(cs).input_charset.lower())
   # "latin1" normalizes to "iso8859-1", standard calls for "iso-8859-1"
   if cs.startswith(b"iso") and not cs.startswith(b"iso-"):
      return b"iso-" + cs[3:]
   return cs

The function signature has been annotated with Any (omitted for parameters, explicit for return value). This somehow means that type inference failed to find the type of that function. As it's (quite) obvious, let's change that into:

def codec2iana(cs: bytes) -> bytes:
   ...

And re-run pytype:

$ pytype mercurial/mail.py
Computing dependencies
Analyzing 1 sources with 36 local dependencies
ninja: Entering directory `.pytype'
[1/1] check mercurial.mail
FAILED: .pytype/pyi/mercurial/mail.pyi
pytype-single --imports_info .pytype/imports/mercurial.mail.imports --module-name mercurial.mail -V 3.7 -o .pytype/pyi/mercurial/mail.pyi --analyze-annotated --nofail --quick mercurial/mail.py
File "mercurial/mail.py", line 253, in codec2iana: Function Charset.__init__ was called with the wrong arguments [wrong-arg-types]
  Expected: (self, input_charset: str = ...)
  Actually passed: (self, input_charset: bytes)

For more details, see https://google.github.io/pytype/errors.html#wrong-arg-types.
ninja: build stopped: subcommand failed.

Interesting! email.charset.Charset is apparently instantiated with the wrong argument type. While it's not exactly a bug, because Python will handle bytes instead of str well in general, we can again change the signature (and code) to:

def codec2iana(cs: str) -> str:
   cs = email.charset.Charset(cs).input_charset.lower()
   # "latin1" normalizes to "iso8859-1", standard calls for "iso-8859-1"
   if cs.startswith("iso") and not cs.startswith("iso-"):
      return "iso-" + cs[3:]
   return cs

Obviously, this involves a larger refactoring in client code of this simple function, see respective changeset for details.

Another example is this function:

def _encode(ui, s, charsets) -> Any:
    '''Returns (converted) string, charset tuple.
    Finds out best charset by cycling through sendcharsets in descending
    order. Tries both encoding and fallbackencoding for input. Only as
    last resort send as is in fake ascii.
    Caveat: Do not use for mail parts containing patches!'''
    sendcharsets = charsets or _charsets(ui)
    if not isinstance(s, bytes):
        # We have unicode data, which we need to try and encode to
        # some reasonable-ish encoding. Try the encodings the user
        # wants, and fall back to garbage-in-ascii.
        for ocs in sendcharsets:
            try:
                return s.encode(pycompat.sysstr(ocs)), ocs
            except UnicodeEncodeError:
                pass
            except LookupError:
                ui.warn(_(b'ignoring invalid sendcharset: %s\n') % ocs)
        else:
            # Everything failed, ascii-armor what we've got and send it.
            return s.encode('ascii', 'backslashreplace')
    # We have a bytes of unknown encoding. We'll try and guess a valid
    # encoding, falling back to pretending we had ascii even though we
    # know that's wrong.
    try:
        s.decode('ascii')
    except UnicodeDecodeError:
        for ics in (encoding.encoding, encoding.fallbackencoding):
            ics = pycompat.sysstr(ics)
            try:
                u = s.decode(ics)
            except UnicodeDecodeError:
                continue
            for ocs in sendcharsets:
                try:
                    return u.encode(pycompat.sysstr(ocs)), ocs
                except UnicodeEncodeError:
                    pass
                except LookupError:
                    ui.warn(_(b'ignoring invalid sendcharset: %s\n') % ocs)
    # if ascii, or all conversion attempts fail, send (broken) ascii
    return s, b'us-ascii'

It quite clear from the return value (last line) that we can change its type signature to:

def _encode(ui, s:Union[bytes, str], charsets: List[bytes]) -> Tuple[bytes, bytes]
   ...

And re-run pytype:

$ pytype mercurial/mail.py
Computing dependencies
Analyzing 1 sources with 36 local dependencies
ninja: Entering directory `.pytype'
[1/1] check mercurial.mail
FAILED: .pytype/pyi/mercurial/mail.pyi
pytype-single --imports_info .pytype/imports/mercurial.mail.imports --module-name mercurial.mail -V 3.7 -o .pytype/pyi/mercurial/mail.pyi --analyze-annotated --nofail --quick mercurial/mail.py
File "mercurial/mail.py", line 342, in _encode: bad option in return type [bad-return-type]
  Expected: Tuple[bytes, bytes]
  Actually returned: bytes
[...]

For more details, see https://google.github.io/pytype/errors.html.
ninja: build stopped: subcommand failed.

That's a real bug. Line 371 contains return s.encode('ascii', 'backslashreplace') in the middle of the function. We indeed return a bytes value instead of a Tuple[bytes, bytes] as elsewhere in the function. The following changes fixes the bug:

diff --git a/mercurial/mail.py b/mercurial/mail.py
--- a/mercurial/mail.py
+++ b/mercurial/mail.py
@@ -339,7 +339,7 @@ def _encode(ui, s, charsets) -> Any:
                 ui.warn(_(b'ignoring invalid sendcharset: %s\n') % ocs)
         else:
             # Everything failed, ascii-armor what we've got and send it.
-            return s.encode('ascii', 'backslashreplace')
+            return s.encode('ascii', 'backslashreplace'), b'us-ascii'
     # We have a bytes of unknown encoding. We'll try and guess a valid
     # encoding, falling back to pretending we had ascii even though we
     # know that's wrong.

Steps by steps, by replacing Any with real types, adjusting "wrong" types like in the first example and fixing bugs, we finally get the whole module (almost) completely annotated.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...