Following the recent introduction of Python type annotations (aka "type hints") in Mercurial (see, e.g. this changeset by Augie Fackler), I've been playing a bit with this and pytype.
pytype is a static type analyzer for Python code. It compares with the more popular mypy but I don't have enough perspective to make a meaningful comparison at the moment. In this post, I'll illustrate how I worked with pytype to gradually add type hints in a Mercurial module and while doing so, fix bugs!
The module I focused on is mercurial.mail, which contains mail utilities and that I know quite well. Other modules are also being worked on, this one is a good starting point because it has a limited number of "internal" dependencies, which both makes it faster to iterate with pytype and reduces side effects of other modules not being correctly typed already.
$ pytype mercurial/mail.py Computing dependencies Analyzing 1 sources with 36 local dependencies ninja: Entering directory `.pytype' [19/19] check mercurial.mail Success: no errors found
The good news is that the module apparently already type-checks. Let's go deeper and merge the type annotations generated by pytype:
$ merge-pyi -i mercurial/mail.py out/mercurial/mail.pyi
(In practice, we'd use --as-comments option to write type hints as comments, so that the module is still usable on Python 2.)
Now we have all declarations annotated with types. Typically, we'd get many things like:
def codec2iana(cs) -> Any: cs = pycompat.sysbytes(email.charset.Charset(cs).input_charset.lower()) # "latin1" normalizes to "iso8859-1", standard calls for "iso-8859-1" if cs.startswith(b"iso") and not cs.startswith(b"iso-"): return b"iso-" + cs[3:] return cs
The function signature has been annotated with Any (omitted for parameters, explicit for return value). This somehow means that type inference failed to find the type of that function. As it's (quite) obvious, let's change that into:
def codec2iana(cs: bytes) -> bytes: ...
And re-run pytype:
$ pytype mercurial/mail.py Computing dependencies Analyzing 1 sources with 36 local dependencies ninja: Entering directory `.pytype' [1/1] check mercurial.mail FAILED: .pytype/pyi/mercurial/mail.pyi pytype-single --imports_info .pytype/imports/mercurial.mail.imports --module-name mercurial.mail -V 3.7 -o .pytype/pyi/mercurial/mail.pyi --analyze-annotated --nofail --quick mercurial/mail.py File "mercurial/mail.py", line 253, in codec2iana: Function Charset.__init__ was called with the wrong arguments [wrong-arg-types] Expected: (self, input_charset: str = ...) Actually passed: (self, input_charset: bytes) For more details, see https://google.github.io/pytype/errors.html#wrong-arg-types. ninja: build stopped: subcommand failed.
Interesting! email.charset.Charset is apparently instantiated with the wrong argument type. While it's not exactly a bug, because Python will handle bytes instead of str well in general, we can again change the signature (and code) to:
def codec2iana(cs: str) -> str: cs = email.charset.Charset(cs).input_charset.lower() # "latin1" normalizes to "iso8859-1", standard calls for "iso-8859-1" if cs.startswith("iso") and not cs.startswith("iso-"): return "iso-" + cs[3:] return cs
Obviously, this involves a larger refactoring in client code of this simple function, see respective changeset for details.
Another example is this function:
def _encode(ui, s, charsets) -> Any: '''Returns (converted) string, charset tuple. Finds out best charset by cycling through sendcharsets in descending order. Tries both encoding and fallbackencoding for input. Only as last resort send as is in fake ascii. Caveat: Do not use for mail parts containing patches!''' sendcharsets = charsets or _charsets(ui) if not isinstance(s, bytes): # We have unicode data, which we need to try and encode to # some reasonable-ish encoding. Try the encodings the user # wants, and fall back to garbage-in-ascii. for ocs in sendcharsets: try: return s.encode(pycompat.sysstr(ocs)), ocs except UnicodeEncodeError: pass except LookupError: ui.warn(_(b'ignoring invalid sendcharset: %s\n') % ocs) else: # Everything failed, ascii-armor what we've got and send it. return s.encode('ascii', 'backslashreplace') # We have a bytes of unknown encoding. We'll try and guess a valid # encoding, falling back to pretending we had ascii even though we # know that's wrong. try: s.decode('ascii') except UnicodeDecodeError: for ics in (encoding.encoding, encoding.fallbackencoding): ics = pycompat.sysstr(ics) try: u = s.decode(ics) except UnicodeDecodeError: continue for ocs in sendcharsets: try: return u.encode(pycompat.sysstr(ocs)), ocs except UnicodeEncodeError: pass except LookupError: ui.warn(_(b'ignoring invalid sendcharset: %s\n') % ocs) # if ascii, or all conversion attempts fail, send (broken) ascii return s, b'us-ascii'
It quite clear from the return value (last line) that we can change its type signature to:
def _encode(ui, s:Union[bytes, str], charsets: List[bytes]) -> Tuple[bytes, bytes] ...
And re-run pytype:
$ pytype mercurial/mail.py Computing dependencies Analyzing 1 sources with 36 local dependencies ninja: Entering directory `.pytype' [1/1] check mercurial.mail FAILED: .pytype/pyi/mercurial/mail.pyi pytype-single --imports_info .pytype/imports/mercurial.mail.imports --module-name mercurial.mail -V 3.7 -o .pytype/pyi/mercurial/mail.pyi --analyze-annotated --nofail --quick mercurial/mail.py File "mercurial/mail.py", line 342, in _encode: bad option in return type [bad-return-type] Expected: Tuple[bytes, bytes] Actually returned: bytes [...] For more details, see https://google.github.io/pytype/errors.html. ninja: build stopped: subcommand failed.
That's a real bug. Line 371 contains return s.encode('ascii', 'backslashreplace') in the middle of the function. We indeed return a bytes value instead of a Tuple[bytes, bytes] as elsewhere in the function. The following changes fixes the bug:
diff --git a/mercurial/mail.py b/mercurial/mail.py --- a/mercurial/mail.py +++ b/mercurial/mail.py @@ -339,7 +339,7 @@ def _encode(ui, s, charsets) -> Any: ui.warn(_(b'ignoring invalid sendcharset: %s\n') % ocs) else: # Everything failed, ascii-armor what we've got and send it. - return s.encode('ascii', 'backslashreplace') + return s.encode('ascii', 'backslashreplace'), b'us-ascii' # We have a bytes of unknown encoding. We'll try and guess a valid # encoding, falling back to pretending we had ascii even though we # know that's wrong.
Steps by steps, by replacing Any with real types, adjusting "wrong" types like in the first example and fixing bugs, we finally get the whole module (almost) completely annotated.
from Planet Python
via read more
No comments:
Post a Comment