Wednesday, February 27, 2019

Stack Abuse: Introduction to the Python Pathlib Module

The Pathlib module in Python simplifies the way in working with files and folders. The Pathlib module is available from Python 3.4 and higher versions. It combines the best of Python's file system modules namely os, os.path, glob, etc.

In Python, most of the scripts involve interacting with file systems. Hence, it is important to deal with file names and paths. To achieve this, Python includes the Pathlib module which contains useful functions to perform file-related tasks. Pathlib provides a more readable and easier way to build up paths by representing filesystem paths as proper objects and enables us to write code that is portable across platforms.

In this article, we will study the Pathlib module in detail with the help of various examples.

The Concept of Path and Directory

Before moving further into details of the Pathlib module, it's important to understand 2 different concepts namely - path and directory.

The path is used to identify a file. The path provides an optional sequence of directory names terminated by the final file name including the filename extension. The filename extension provides some information about the file format/ contents. The Pathlib module can deal with absolute as well as relative paths. An absolute path begins from the root directory and specifies the complete directory tree, whereas a relative path, as the name suggests, is the path of a file relative to another file or directory (usually the current directory).

Directory represents the filesystem entry of the path and it includes file name, creation time, size, owner, etc.

The Pathlib module in Python deals with path related tasks, such as constructing new paths from names of files and from other paths, checking for various properties of paths and creating files and folders at specific paths.

How to use the Pathlib Module?

To use the pathlib module conveniently within our scripts, we import all the classes in it using:

from pathlib import *  

As a first task, let's retrieve the current working directory and home directory objects, respectively, using the code below:

current_dir = Path.cwd()  
home_dir = Path.home()  
print(current_dir)  
print(home_dir)  

We can choose to import pathlib instead of importing all the classes. In that case, all the subsequent uses of classes within the module should be prefixed with pathlib.

import pathlib

current_dir = pathlib.Path.cwd()  
home_dir = pathlib.Path.home()  
print(current_dir)  
print(home_dir)  

Why use the Pathlib Module?

If you've been working with the Python language for a while, you would be wondering what is the necessity of Pathlib module when os, os.path, glob, etc. modules are already available? This is a fully justified concern. Let's try to address this via an example.

Let's say we want to make a file called "output/output.xlsx" within the current working directory. The following code tries to achieve this using the os.path module. For this, os.getcwd and os.path.join functions are used.

import os  
outpath = os.path.join(os.getcwd(), 'output')  
outpath_file = os.path.join(outpath, 'out.xlsx')  

Alternately,

outpath_file = os.pathjoin(os.path.join(os.getcwd(), 'output'), "out.xlsx")  

Though the code works, it looks clunky and is not readable nor easy to maintain. Imagine how this code would look if we wanted to create a new file inside multiple nested directories.

The same code can be re-written using Pathlib module, as follows:

from pathlib import Path  
outpath = Path.cwd() / 'output' / 'output.xlsx'  

This format is easier to parse mentally. In Pathlib, the Path.cwd() function is used to get the current working directory and / operator is used in place of os.path.join to combine parts of the path into a compound path object. The function nesting pattern in the os.path module is replaced by the Path class of Pathlib module that represents the path by chaining methods and attributes. The clever overloading of the / operator makes the code readable and easy to maintain.

Another benefit of the method provided by the Pathlib module is that a Path object is created rather than creating a string representation of the path. This object has several handy methods that make life easier than working with raw strings that represent paths.

Performing Operations on Paths

The classic os.path module is used only for manipulating path strings. To do something with the path, for example, creating a directory, we need the os module. The os module provides a set of functions for working with files and directories, like: mkdir for creating a directory, rename to rename a directory, getsize to get the size of a directory and so on.

Let's write some of these operations using the os module and then rewrite the same code using the Pathlib module.

Sample code written using os module:

if os.path.isdir(path):  
    os.rmdir(path)

If we use Pathlib module's path objects to achieve the same functionality, the resulting code will be much more readable and easier to maintain as shown below:

if path.is_dir()  
    path.rmdir()

It is cumbersome to find path related utilies in the os module. The Pathlib module solves the problem by replacing the utilities of os module with methods on path objects. Let us understand it even better with a code:

outpath = os.path.join(os.getcwd(), 'output')  
outpath_tmp = os.path.join(os.getcwd(), 'output.tmp')  
generate_data(output_tmp)

if os.path.getsize(output_tmp):  
    os.rename(outpath_tmp, outpath)
else: # Nothing produced  
    os.remove(outpath_tmp)

Here, the function generate_data() takes a file path as a parameter and writes data to another path. However, if the file that is passed as a parameter is not changed, since the last time the generate_data() function was executed, an empty file is generated. In that case, the empty file is replaced with the previous version of the file.

The variable outpath stores the data by joining the current working directory with the filename "output". We create a temp version, as well, named as outpath.tmp. If the size of temp version is not zero, which implies that it is not an empty file, then the temp version is renamed to outpath, otherwise the temp version is removed and the old version is retained.

Using the os module, manipulating paths of filesystems as string objects become clumsy as there are multiple calls to os.path.join(), os.getcwd(), etc. To avoid this problem, the Pathlib module offers a set of classes that can be used for frequently used operations on the path, in a more readable, simple, object-oriented way.

Let's try to re-write the above code using Pathlib module.

from pathlib import Path

outpath = Path.cwd() / 'output'  
outpath_tmp = Path.cwd() / 'output_tmp'

generate_data(output_tmp)

if outpath_tmp.stat().st_size:  
    outpath_tmp.rename(outpath)
else: # Nothing produced  
    Path_tmp.unlink()

Using Pathlib, os.getcwd() becomes Path.cwd() and '/' operator is used to join paths and used in place of os.path.join. Using the Pathlib module, things can be done in a simpler way using operators and method calls.

Following are commonly used methods and it's usage:

  • Path.cwd(): Return path object representing the current working directory
  • Path.home(): Return path object representing the home directory
  • Path.stat(): return info about the path
  • Path.chmod(): change file mode and permissions
  • Path.glob(pattern): Glob the pattern given in the directory that is represented by the path, yielding matching files of any kind
  • Path.mkdir(): to create a new directory at the given path
  • Path.open(): To open the file created by the path
  • Path.rename(): Rename a file or directory to the given target
  • Path.rmdir(): Remove the empty directory
  • Path.unlink(): Remove the file or symbolic link

Generating Cross-Platform Paths

Paths use different conventions in different Operating Systems. Windows uses a backslash between folder names, whereas all other popular Operating Systems uses forward slash between folder names. If you want your python code to work, irrespective of the underlying OS, you'll need to handle the different conventions specific to the underlying platform. The Pathlib module makes working with file paths easier. In Pathlib, you can just pass a path or filename to Path() object using forward slash, irrespective of the OS. Pathlib handles the rest.

pathlib.Path.home() / 'python' / 'samples' / 'test_me.py'  

The Path() object will covert the / to the apt kind of slash, for the underlying Operating System. The pathlib.Path may represent either Windows or Posix path. Thus, Pathlib solves a lot of cross-functional bugs, by handling paths easily.

Getting Path Information

While dealing with paths, we are interested in finding the parent directory of a file/folder or in following symbolic links. Path class has several convenient methods for doing this, as different parts of a path are available as properties that include the following:

  • drive: a string that represents the drive name. For example, PureWindowsPath('c:/Program Files/CSV').drive returns "C:"
  • parts: returns a tuple that provides access to the path's components
  • name: the path component without any directory
  • parent: sequence providing access to the logical ancestors of the path
  • stem: final path component without its suffix
  • suffix: the file extension of the final component
  • anchor: the part of a path before the directory. / is used to create child paths and mimics the behavior of os.path.join.
  • joinpath: combines the path with the arguments provided
  • match(pattern): returns True/False, based on matching the path with the glob-style pattern provided

In path "/home/projects/stackabuse/python/sample.md":

  • path: - returns PosixPath('/home/projects/stackabuse/python/sample.md')
  • path.parts: - returns ('/', 'home', 'projects', 'stackabuse', 'python')
  • path.name: - returns 'sample.md'
  • path.stem: - returns 'sample'
  • path.suffix: - returns '.md'
  • path.parent: - returns PosixPath('/home/projects/stackabuse/python')
  • path.parent.parent: - returns PosixPath('/home/projects/stackabuse')
  • path.match('*.md'): returns True
  • PurePosixPath('/python').joinpath('edited_version'): returns ('home/projects/stackabuse/python/edited_version

Alternative of the Glob Module

Apart from os , os.path modules, glob module is also available in Python that provides file path related utils. glob.glob function of the glob module is used to find files matching a pattern.

from glob import glob

top_xlsx_files = glob('*.xlsx')  
all_xlsx_files = glob('**/*.xlsx', recursive=True)  

The pathlib provides glob utilities, as well:

from pathlib import Path

top_xlsx_files = Path.cwd().glob('*.xlsx')  
all_xlsx_files = Path.cwd().rglob('*.xlsx')  

The glob functionality is available with Path objects. Thus, pathlib modules make complex tasks simpler.

Reading and Writing Files using Pathlib

The following methods are used to perform basic operations like reading and writing files:

  • read_text: File is opened in text mode to read the contents of the file and close it after reading
  • read_bytes: Used to open the file in binary mode and return contents in binary form and closes the file after the same.
  • write_text: Used to open the file and writes text and closes it later
  • write_bytes: Used to write binary data to a file and closes the file, once done

Let's explore the usage of the Pathlib module for common file operations. The following example is used to read the contents of a file:

path = pathlib.Path.cwd() / 'Pathlib.md'  
path.read_text()  

Here, read_text method on Path object is used to read the contents of the file.
Below example is used to write data to a file, in text mode:

from pathlib import Path

p = Path('sample_text_file') p.write_text('Sample to write data to a file')  

Thus, in the Pathlib module, by having the path as an object enables us to perform useful actions on the objects for file system involving lots of path manipulation like creating or removing directories, looking for specific files, moving files etc.

Conclusion

To conclude, the Pathlib module provides a huge number of rich and useful features that can be used to perform a variety of path related operations. As an added advantage the library is consistent across the underlying Operating System.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...