Tuesday, March 31, 2020

Stack Abuse: Reading and Writing MS Word Files in Python via Python-Docx Module

The MS Word utility from Microsoft Office suite is one of the most commonly used tools for writing text documents, both simple and complex. Though humans can easily read and write MS Word documents, assuming you have the Office software installed, often times you need to read text from Word documents within another application.

For instance, if you are developing a natural language processing application in Python that takes MS Word files as input, you will need to read MS Word files in Python before you can process the text. Similarly, often times you need to write text to MS Word documents as output, which could be a dynamically generated report to download, for example.

In this article, article you will see how to read and write MS Word files in Python.

Installing Python-Docx Library

Several libraries exist that can be used to read and write MS Word files in Python. However, we will be using the python-docx module owing to its ease-of-use. Execute the following pip command in your terminal to download the python-docx module as shown below:

$ pip install python-docx

Reading MS Word Files with Python-Docx Module

In this section, you will see how to read text from MS Word files via the python-docx module.

Create a new MS Word file and rename it as "my_word_file.docx". I saved the file in the root of my "E" directory, although you can save the file anywhere you want. The my_word_file.docx file should have the following content:

reading ms word files in python

To read the above file, first import the docx module and then create an object of the Document class from the docx module. Pass the path of the my_word_file.docx to the constructor of the Document class, as shown in the following script:

import docx

doc = docx.Document("E:/my_word_file.docx")

The Document class object doc can now be used to read the content of the my_word_file.docx.

Reading Paragraphs

Once you create an object of the Document class using the file path, you can access all the paragraphs in the document via the paragraphs attribute. An empty line is also read as a paragraph by the Document. Let's fetch all the paragraphs from the my_word_file.docx and then display the total number of paragraphs in the document:

all_paras = doc.paragraphs
len(all_paras)

Output:

10

Now we'll iteratively print all the paragraphs in the my_word_file.docx file:

for para in all_paras:
    print(para.text)
    print("-------")

Output:

-------
Introduction
-------

-------
Welcome to stackabuse.com
-------
The best site for learning Python and Other Programming Languages
-------
Learn to program and write code in the most efficient manner
-------

-------
Details
-------

-------
This website contains useful programming articles for Java, Python, Spring etc.
-------

The output shows all of the paragraphs in the Word file.

We can even access a specific paragraph by indexing the paragraphs property like an array. Let's print the 5th paragraph in the file:

single_para = doc.paragraphs[4]
print(single_para.text)

Output:

The best site for learning Python and Other Programming Languages

Reading Runs

A run in a word document is a continuous sequence of words having similar properties, such as similar font sizes, font shapes, and font styles. For example, if you look at the second line of the my_word_file.docx, it contains the text "Welcome to stackabuse.com", here the text "Welcome to" is in plain font, while the text "stackabuse.com" is in bold face. Hence, the text "Welcome to" is considered as one run, while the bold faced text "stackabuse.com" is considered as another run.

Similarly, "Learn to program and write code in the" and "most efficient manner" are treated as two different runs in the paragraph "Learn to program and write code in the most efficient manner".

To get all the runs in a paragraph, you can use the run property of the paragraph attribute of the doc object.

Let's read all the runs from paragraph number 5 (4th index) in our text:

single_para = doc.paragraphs[4]
for run in single_para.runs:
    print(run.text)

Output:

The best site for
learning Python
 and Other
Programming Languages

In the same way, the following script prints all the runs from the 6th paragraph of the my_word_file.docx file:

second_para = doc.paragraphs[5]
for run in second_para.runs:
    print(run.text)

Output:

Learn to program and write code in the
most efficient manner

Writing MS Word Files with Python-Docx Module

In the previous section, you saw how to read MS Word files in Python using the python-docx module. In this section, you will see how to write MS Word files via the python-docx module.

To write MS Word files, you have to create an object of the Document class with an empty constructor, or without passing a file name.

mydoc = docx.Document()

Writing Paragraphs

To write paragraphs, you can use the add_paragraph() method of the Document class object. Once you have added a paragraph, you will need to call the save() method on the Document class object. The path of the file to which you want to write your paragraph is passed as a parameter to the save() method. If the file doesn't already exist, a new file will be created, otherwise the paragraph will be appended at the end of the existing MS Word file.

The following script writes a simple paragraph to a newly created MS Word file named "my_written_file.docx".

mydoc.add_paragraph("This is first paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")

Once you execute the above script, you should see a new file "my_written_file.docx" in the directory that you specified in the save() method. Inside the file, you should see one paragraph which reads "This is first paragraph of a MS Word file."

Let's add another paragraph to the my_written_file.docx:

mydoc.add_paragraph("This is the second paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")

This second paragraph will be appended at the end of the existing content in my_written_file.docx.

Writing Runs

You can also write runs using the python-docx module. To write runs, you first have to create a handle for the paragraph to which you want to add your run. Take a look at the following example to see how it's done:

third_para = mydoc.add_paragraph("This is the third paragraph.")
third_para.add_run(" this is a section at the end of third paragraph")
mydoc.save("E:/my_written_file.docx")

In the script above we write a paragraph using the add_paragraph() method of the Document class object mydoc. The add_paragraph() method returns a handle for the newly added paragraph. To add a run to the new paragraph, you need to call the add_run() method on the paragraph handle. The text for the run is passed in the form of a string to the add_run() method. Finally, you need to call the save() method to create the actual file.

Writing Headers

You can also add headers to MS Word files. To do so, you need to call the add_heading() method. The first parameter to the add_heading() method is the text string for header, and the second parameter is the header size. The header sizes start from 0, with 0 being the top level header.

The following script adds three headers of level 0, 1, and 2 to the file my_written_file.docx:

mydoc.add_heading("This is level 1 heading", 0)
mydoc.add_heading("This is level 2 heading", 1)
mydoc.add_heading("This is level 3 heading", 2)
mydoc.save("E:/my_written_file.docx")

Adding Images

To add images to MS Word files, you can use the add_picture() method. The path to the image is passed as a parameter to the add_picture() method. You can also specify the width and height of the image using the docx.shared.Inches() attribute. The following script adds an image from the local file system to the my_written_file.docx Word file. The width and height of the image will be 5 and 7 inches, respectively:

mydoc.add_picture("E:/eiffel-tower.jpg", width=docx.shared.Inches(5), height=docx.shared.Inches(7))
mydoc.save("E:/my_written_file.docx")

After executing all the scripts in the Writing MS Word Files with Python-Docx Module section of this article, your final my_written_file.docx file should look like this:

writing ms word files in python

In the output, you can see the three paragraphs that you added to the MS word file, along with the three headers and one image.

Conclusion

The article gave a brief overview of how to read and write MS Word files using the python-docx module. The article covers how to read paragraphs and runs from within a MS Word file. Finally, the process of writing MS Word files, adding a paragraph, runs, headers, and images to MS Word files have been explained in this article.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...