Monday, November 4, 2019

Julien Danjou: Finding definitions from a source file and a line number in Python

Finding definitions from a source file and a line number in Python

My job at Datadog keeps me busy with new and questioning challenges. I recently stumbled upon a problem that sounded easy but was more difficult than I imagined.

Here's the thing: considering a filename and a line number, can you tell which function, method or class this line of code belongs to?

I started to dig into the standard library, but I did not find anything solving this problem. It sounded like I write to do this myself.

The first steps sound easy. Open a file, read it, find the line number. Right.

Then, how do you know which functions this line is in? You don't, expect if you parse the whole file and keep tracks of function definitions. A regular expression parsing each line might be a solution?

Well, you had to be careful as function definitions can span multiple lines.

Using the AST

I decided that a good and robust strategy was not going to use manual parsing or the like, but using Python abstract syntax tree (AST) directly. By leveraging Python's own parsing code, I was sure I was not going to fail while parsing a Python source file.

This can be simply be accomplished with:

import ast

def parse_file(filename):
    with open(filename) as f:
        return ast.parse(f.read(), filename=filename)

And you're done. Are you? No, because that only works in 99.99% of the case. If your source file is using an encoding that is now ASCII or UTF-8, then the function fails. I know you think I'm crazy to think about this but I like my code to be robust.

It turns out Python has a cookie to specify the encoding in the form of # encoding: utf-8 as defined in PEP 263. Reading this cookie would help to find the encoding.

To do that, we need to open the file in binary mode, use a regular expression to match the data, and… Well, it's dull, and somebody already implemented it for us so let's use the fantastic tokenize.open function provided by Python:

import ast
import tokenize

def parse_file(filename):
    with tokenize.open(filename) as f:
        return ast.parse(f.read(), filename=filename)

That should work in 100% of the time. Until proven otherwise.

Browsing the AST

The parse_file function now returns a Python AST. If you never played with Python AST, it's a gigantic tree that represents your source code just before it is compiled down to Python bytecode.

In the tree, there should be statements and expression. In our case, we're interested in finding the function definition that is the closest to our line number. Here's an implementation of that function:

def filename_and_lineno_to_def(filename, lineno):
    candidate = None
    for item in ast.walk(parse_file(filename)):
        if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            if item.lineno > lineno:
                # Ignore whatever is after our line
                continue
            if candidate:
                distance = lineno - item.lineno
                if distance < (lineno - candidate.lineno):
                    candidate = item
            else:
                candidate = item

    if candidate:
        return candidate.name

This iterates over all the node of the AST and returns the node where the line number is the closest to our definition. If we have a file that contains:

class A(object):
    X = 1
    def y(self):
        return 42

the function filename_and_lineno_to_def returns for the lines 1 to 5:

A
A
y
y
y
Return of filename_and_lineo_to_def for lines 1 to 5

It works!

Closures?

The naive approach described earlier likely works for 90% of your code, but there are some edge cases. For example, when defining function closures, the above algorithm fails. With the following code:

class A(object):
   X = 1
   def y(self):
       def foo():
           return 42
       return foo

the function filename_and_lineno_to_def returns for lines 1 to 7:

A
A
y
foo
foo
foo
foo
Return of filename_and_lineo_to_def for lines 1 to 7

Oops. Clearly, lines 6 and 7 do not belong to the foo function. Our approach is too naive to see that starting at line 6, we're back in the y method.

Interval Trees

The correct way of handling that is to consider each function definition as an interval:

Finding definitions from a source file and a line number in PythonPiece of code seen as interval.

Whatever the line number we request is, we should return the node that is responsible for the smallest interval that the line is in.

What we need in this case is a correct data structure to solve our problem: an interval tree fits perfectly our use case. It allows for searching rapidly pieces of code that match our line number.

To solve our problem we need several things:

  • A way to compute the beginning and end line numbers for a function.
  • A tree that is fed with the intervals we computed just before.
  • A way to select the best matching intervals if a line is part of several functions (closure).

Computing Function Intervals

The interval of a function is the first and last lines that compose its body. It's pretty easy to find those by walking through the function AST node:

def _compute_interval(node):
    min_lineno = node.lineno
    max_lineno = node.lineno
    for node in ast.walk(node):
        if hasattr(node, "lineno"):
            min_lineno = min(min_lineno, node.lineno)
            max_lineno = max(max_lineno, node.lineno)
    return (min_lineno, max_lineno + 1)

Given any AST node, the function returns a tuple of the first and last line number of that node.

Building The Tree

Rather than implementing an interval tree, we'll use the intervaltree library. We need to create a tree and feed it with the computed interval:

def file_to_tree(filename):
    with tokenize.open(filename) as f:
        parsed = ast.parse(f.read(), filename=filename)
    tree = intervaltree.IntervalTree()
    for node in ast.walk(parsed):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            start, end = _compute_interval(node)
            tree[start:end] = node
    return tree

Here you go: the function parses the Python file passed as an argument and converts it to its AST representation. It then walks it and feeds the interval tree with every class and function definition.

Querying the Tree

Now that the tree is built, it should be queried with the line number. This is pretty simple:

matches = file_to_tree(filename)[lineno]
if matches:
    return min(matches, key=lambda i: i.length()).data.name

The build tree might return several matches if there are several intervals containing our line number. In that case, we pick the smallest interval and return the name of the node — which is our class or function name!

Mission Success

We did it! We started with a naive approach and iterated to a final solution covering 100% of our cases. Picking the right data structure, interval trees here, helped us solving this in an intelligent approach.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...