Tuesday, January 15, 2019

Will McGugan: PyFilesystem is greater than or equal to Pathlib

I was reading a post by Trey Hunner on why pathlib is great, where he makes the case that pathlib is a better choice than the standard library alternatives that preceded it. I wouldn't actually disagree with a word of it. He's entirely correct. You should probably be using pathlib were it fits.

Personally, however, I rarely use pathlib, because I find that for the most part, PyFilesystem is a better choice. I'd like to take some of the code examples from Trey's post and re-write them using PyFilesystem, just so we can compare.

Create a folder, move a file

The first example from Trey's post, creates a folder then moves a file into it. Here it is:

from pathlib import Path

Path('src/__pypackages__').mkdir(parents=True, exist_ok=True)
Path('.editorconfig').rename('src/.editorconfig')

The code above is straightforward, and hides the gory platform details which is a major benefit of pathlib over os.path.

The PyFilesystem version also does this, and the code is remarkably similar:

from fs import open_fs

with open_fs('.') as cwd:
    cwd.makedirs('src/__pypackages', recreate=True)
    cwd.move('.editorconfig', 'src/.editorconfig')

The two lines that do the work are somewhat similar -- you can probably figure them out without looking at the docs. The first line of non-import code may need some explanation. In PyFilesystem the abstraction is not a path but a directory. So open_fs('.') returns a FS object for the current working directory. It's this object which contains methods for making directories and moving files etc.

Create a directory if it doesn't already exist, write a blank file

This next example from Trey's post, creates a directory then creates an empty file if it doesn't already exist:

from pathlib import Path


def make_editorconfig(dir_path):
    """Create .editorconfig file in given directory and return filepath."""
    path = Path(dir_path, '.editorconfig')
    if not path.exists():
        path.parent.mkdir(exist_ok=True, parent=True)
        path.touch()
    return path

This function is tricky to compare, as it does things you might not consider doing in a project with PyFilesystem, but if I was to translate it literally, it would be something like the following:

def make_editorconfig(dir_path):
    """Create .editorconfig file in given directory and return filename."""
    with open_fs(dir_path, create=True) as fs:
        fs.touch(".editorconfig")
    return fs.getsyspath(".editorconfig")

The reason that you wouldn't write this code with PyFilesystem, is that you rarely need to pass around paths. You typically pass around FS objects which represent a subdirectory. It's perhaps not the best example to demonstrate this, but the PyFilesystem code would likely be more like the following:

def make_editorconfig(directory_fs):
    directory_fs.create(".editorconfig")

with open_fs("foo", create=True) as directory_fs:
    make_editorconfig(directory_fs)

Rather than a str or a Path object, the function excepts an FS object. An advantage of this that file / directory operations are sandboxed under that directory. Unlike the Pathlib version, which has access to the entire filesystem. For a trivial example, this won't matter. But if you have more complex code, it can prevent you from unintentionally deleting or overwriting files if there is a bug.

Counting files by extension

Next up, we have a short script which counts the Python files in a subdirectory using pathlib:

from pathlib import Path


extension = '.py'
count = 0
for filename in Path.cwd().rglob(f'*{extension}'):
    count += 1
print(f"{count} Python files found")

Nice and simple. PyFilesystem has glob functionality (although no rglob yet). The code looks quite similar:

from fs import open_fs

extension = '.py'

with open_fs('.') as fs:
    count = fs.glob(f"**/*{extension}").count().files
print(f"{count} Python files found")

There's no for loop in the code above, because there is built in file counting functionality, but otherwise it is much the same.

I think Trey was using this example to compare performance. I haven't actually compared performance of PyFilesystem's globbing versus os.path or pathlib. That could be the subject for another post.

Write a file to the terminal if it exists

The next example is a simple one for both pathlib and PyFilesystem. Here's the pathlib version:

from pathlib import Path
import sys


directory = Path(sys.argv[1])
ignore_path = directory / '.gitignore'
if ignore_path.is_file():
    print(ignore_path.read_text(), end='')

And here's the PyFIlesystem equivelent:

import sys
from fs import open_fs


with open_fs(sys.argv[1]) as fs:
    if fs.isfile(".gitignore"):
        print(fs.readtext('.gitignore'), end='')

Note that there's no equivalent of directory / '.gitignore'. You don't need to join paths in PyFilesystem as often, but when you do, you don't need to worry about platform details. All paths in PyFilesystem are a sort of idealized path with a common format.

Finding duplicates

Trey offered a fully working script to find duplicates in a subdirectory with and without pathlib. Coincidentally I'd recently added a similar example to PyFilesystem.

Here is Trey's pathlib version:

from collections import defaultdict
from hashlib import md5
from pathlib import Path


def find_files(filepath):
    for path in Path(filepath).rglob('*'):
        if path.is_file():
            yield path


file_hashes = defaultdict(list)
for path in find_files(Path.cwd()):
    file_hash = md5(path.read_bytes()).hexdigest()
    file_hashes[file_hash].append(path)

for paths in file_hashes.values():
    if len(paths) > 1:
        print("Duplicate files found:")
        print(*paths, sep='\n')

And here we have equivalent functionality with PyFilesystem:

from collections import defaultdict
from hashlib import md5
from fs import open_fs

file_hashes = defaultdict(list)
with open_fs('.') as fs:
    for path in fs.walk.files():
        file_hash = md5(fs.readbytes(path)).hexdigest()
        file_hashes[file_hash].append(path)

for paths in file_hashes.values():
    if len(paths) > 1:
        print("Duplicate files found:")
        print(*paths, sep='\n')

The PyFilesystem version compares quite favourable here (in terms of lines of code at least). Mostly because there was already an iterator of paths method built in.

Conclusion

First off, I would like to emphasise that I'm not suggesting you never use pathlib. It is better than the alternatives in the standard library. Pathlib also has the advantage that is is actually in the standard library, whereas PyFilesystem is a pip install fs away.

I would say that I think PyFilesystem results in cleaner code for the most part, which could just be down to the fact that I've been working with PyFilesystem for a lot longer and it 'fits my brain' better. I'll let you be the judge. Also note that as the primary author of PyFilesystem, there is obviously a bucket-load of bias here.

There is one area I think PyFilesystem is a clear winner. The PyFilesystem code above would work virtually unaltered with files in an archive, in memory, on a ftp server, S3 etc. or any of the supported filesystems.

I'd like to apologise to Trey Hunner if I misrepresented anything he said in his post!



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...