The MS Word utility from Microsoft Office suite is one of the most commonly used tools for writing text documents, both simple and complex. Though humans can easily read and write MS Word documents, assuming you have the Office software installed, often times you need to read text from Word documents within another application.
For instance, if you are developing a natural language processing application in Python that takes MS Word files as input, you will need to read MS Word files in Python before you can process the text. Similarly, often times you need to write text to MS Word documents as output, which could be a dynamically generated report to download, for example.
In this article, article you will see how to read and write MS Word files in Python.
Installing Python-Docx Library
Several libraries exist that can be used to read and write MS Word files in Python. However, we will be using the python-docx module owing to its ease-of-use. Execute the following pip
command in your terminal to download the python-docx
module as shown below:
$ pip install python-docx
Reading MS Word Files with Python-Docx Module
In this section, you will see how to read text from MS Word files via the python-docx
module.
Create a new MS Word file and rename it as "my_word_file.docx". I saved the file in the root of my "E" directory, although you can save the file anywhere you want. The my_word_file.docx file should have the following content:
To read the above file, first import the docx
module and then create an object of the Document
class from the docx
module. Pass the path of the my_word_file.docx to the constructor of the Document
class, as shown in the following script:
import docx
doc = docx.Document("E:/my_word_file.docx")
The Document
class object doc
can now be used to read the content of the my_word_file.docx.
Reading Paragraphs
Once you create an object of the Document
class using the file path, you can access all the paragraphs in the document via the paragraphs
attribute. An empty line is also read as a paragraph by the Document
. Let's fetch all the paragraphs from the my_word_file.docx and then display the total number of paragraphs in the document:
all_paras = doc.paragraphs
len(all_paras)
Output:
10
Now we'll iteratively print all the paragraphs in the my_word_file.docx file:
for para in all_paras:
print(para.text)
print("-------")
Output:
-------
Introduction
-------
-------
Welcome to stackabuse.com
-------
The best site for learning Python and Other Programming Languages
-------
Learn to program and write code in the most efficient manner
-------
-------
Details
-------
-------
This website contains useful programming articles for Java, Python, Spring etc.
-------
The output shows all of the paragraphs in the Word file.
We can even access a specific paragraph by indexing the paragraphs
property like an array. Let's print the 5th paragraph in the file:
single_para = doc.paragraphs[4]
print(single_para.text)
Output:
The best site for learning Python and Other Programming Languages
Reading Runs
A run in a word document is a continuous sequence of words having similar properties, such as similar font sizes, font shapes, and font styles. For example, if you look at the second line of the my_word_file.docx, it contains the text "Welcome to stackabuse.com", here the text "Welcome to" is in plain font, while the text "stackabuse.com" is in bold face. Hence, the text "Welcome to" is considered as one run, while the bold faced text "stackabuse.com" is considered as another run.
Similarly, "Learn to program and write code in the" and "most efficient manner" are treated as two different runs in the paragraph "Learn to program and write code in the most efficient manner".
To get all the runs in a paragraph, you can use the run
property of the paragraph
attribute of the doc
object.
Let's read all the runs from paragraph number 5 (4th index) in our text:
single_para = doc.paragraphs[4]
for run in single_para.runs:
print(run.text)
Output:
The best site for
learning Python
and Other
Programming Languages
In the same way, the following script prints all the runs from the 6th paragraph of the my_word_file.docx file:
second_para = doc.paragraphs[5]
for run in second_para.runs:
print(run.text)
Output:
Learn to program and write code in the
most efficient manner
Writing MS Word Files with Python-Docx Module
In the previous section, you saw how to read MS Word files in Python using the python-docx
module. In this section, you will see how to write MS Word files via the python-docx
module.
To write MS Word files, you have to create an object of the Document
class with an empty constructor, or without passing a file name.
mydoc = docx.Document()
Writing Paragraphs
To write paragraphs, you can use the add_paragraph()
method of the Document
class object. Once you have added a paragraph, you will need to call the save()
method on the Document
class object. The path of the file to which you want to write your paragraph is passed as a parameter to the save()
method. If the file doesn't already exist, a new file will be created, otherwise the paragraph will be appended at the end of the existing MS Word file.
The following script writes a simple paragraph to a newly created MS Word file named "my_written_file.docx".
mydoc.add_paragraph("This is first paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")
Once you execute the above script, you should see a new file "my_written_file.docx" in the directory that you specified in the save()
method. Inside the file, you should see one paragraph which reads "This is first paragraph of a MS Word file."
Let's add another paragraph to the my_written_file.docx:
mydoc.add_paragraph("This is the second paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")
This second paragraph will be appended at the end of the existing content in my_written_file.docx.
Writing Runs
You can also write runs using the python-docx
module. To write runs, you first have to create a handle for the paragraph to which you want to add your run. Take a look at the following example to see how it's done:
third_para = mydoc.add_paragraph("This is the third paragraph.")
third_para.add_run(" this is a section at the end of third paragraph")
mydoc.save("E:/my_written_file.docx")
In the script above we write a paragraph using the add_paragraph()
method of the Document
class object mydoc
. The add_paragraph()
method returns a handle for the newly added paragraph. To add a run to the new paragraph, you need to call the add_run()
method on the paragraph handle. The text for the run is passed in the form of a string to the add_run()
method. Finally, you need to call the save()
method to create the actual file.
Writing Headers
You can also add headers to MS Word files. To do so, you need to call the add_heading()
method. The first parameter to the add_heading()
method is the text string for header, and the second parameter is the header size. The header sizes start from 0, with 0 being the top level header.
The following script adds three headers of level 0, 1, and 2 to the file my_written_file.docx:
mydoc.add_heading("This is level 1 heading", 0)
mydoc.add_heading("This is level 2 heading", 1)
mydoc.add_heading("This is level 3 heading", 2)
mydoc.save("E:/my_written_file.docx")
Adding Images
To add images to MS Word files, you can use the add_picture()
method. The path to the image is passed as a parameter to the add_picture()
method. You can also specify the width and height of the image using the docx.shared.Inches()
attribute. The following script adds an image from the local file system to the my_written_file.docx Word file. The width and height of the image will be 5 and 7 inches, respectively:
mydoc.add_picture("E:/eiffel-tower.jpg", width=docx.shared.Inches(5), height=docx.shared.Inches(7))
mydoc.save("E:/my_written_file.docx")
After executing all the scripts in the Writing MS Word Files with Python-Docx Module section of this article, your final my_written_file.docx file should look like this:
In the output, you can see the three paragraphs that you added to the MS word file, along with the three headers and one image.
Conclusion
The article gave a brief overview of how to read and write MS Word files using the python-docx
module. The article covers how to read paragraphs and runs from within a MS Word file. Finally, the process of writing MS Word files, adding a paragraph, runs, headers, and images to MS Word files have been explained in this article.
from Planet Python
via read more
No comments:
Post a Comment