Tuesday, August 31, 2021

Python for Beginners: How to Extract a Date from a .txt File in Python

In this tutorial, we’ll examine the different ways you can extract a date from a .txt file using Python programming. Python is a versatile language—as you’ll discover—and there are many solutions for this problem.

First, we’ll look at using regular expression patterns to search text files for dates that fit a predefined format. We’ll learn about using the re library and creating our own regular expression searches.

We’ll also examine datetime objects and use them to convert strings into data models. Lastly, we’ll see how the datefinder module simplifies the process of searching a text file for dates that haven’t been formatted, like we might find in natural language content.

Extract a Date from a .txt File using Regular Expression

Dates are written in many different formats. Sometimes people write month/day/year. Other dates might include times of the day, or the day of the week (Wednesday July 8, 2021 8:00PM).

How dates are formatted is a factor to consider before we go about extracting them from text files. 

For instance, if a date follows the month/date/year format, we can find it using a regular expression pattern. With regular expression, or regex for short, we can search a text by matching a string to a predefined pattern. 

The beauty of regular expression is that we can use special characters to create powerful search patterns. For instance, we can craft a pattern that will find all the formatted dates in the following body of text.

minutes.txt
10/14/2021 – Meeting with the client.
07/01/2021 – Discussed marketing strategies.
12/23/2021 – Interviewed a new team lead.
01/28/2018 – Changed domain providers.
06/11/2017 – Discussed moving to a new office.

Example: Finding formatted dates with regex

import re

# open the text file and read the data
file = open("minutes.txt",'r')

text = file.read()
# match a regex pattern for formatted dates
matches = re.findall(r'(\d+/\d+/\d+)',text)

print(matches)

Output

[’10/14/2021′, ’07/01/2021′, ’12/23/2021′, ’01/28/2018′, ’06/11/2017′]

The regex pattern here uses special characters to define the strings we want to extract from the text file. The characters d and + tell regex we’re looking for multiple digits within the text.

We can also use regex to find dates that are formatted in different ways. By altering our regex pattern, we can find dates that use either a forward slash (\) or a dash () as the separator.

This works because regex allows for optional characters in the search pattern. We can specify that either character—a forward slash or dash—is an acceptable match.

apple2.txt
The first Apple II was sold on 07-10-1977. The last of the Apple II
models were discontinued on 10/15/1994.

Example: Matching dates with a regex pattern

import re

# open a text file
f = open("apple2.txt", 'r')

# extract the file's content
content = f.read()

# a regular expression pattern to match dates
pattern = "\d{2}[/-]\d{2}[/-]\d{4}"

# find all the strings that match the pattern
dates = re.findall(pattern, content)

for date in dates:
    print(date)

f.close()

Output

07-10-1977
10/15/1994

Examining the full extent of regex’s potential is beyond the scope of this tutorial. Try experimenting with some of the following special characters to learn more about using regular expression patterns to extract a date—or other information—from a .txt file.

Special Characters in Regex

  • \s – A space character
  • \S – Any character except for a space character
  • \d – Any digit from 0 to 9
  • \D – And any character except for a digit
  • \w – Any word of characters or digits [a-zA-Z0-9]
  • \W – Any non-word characters

Extract a Datetime Object from a .txt File

In Python we can use the datetime library for manipulating dates and working with time. The datetime library comes pre-packed with Python, so there’s no need to install it.

By using datetime objects, we have more control over string data read from text files. For example, we can use a datetime object to get a copy of the current date and time of our computer.

import datetime

now = datetime.datetime.now()
print(now)

Output

2021-07-04 20:15:49.185380

In the following example, we’ll extract a date from a company .txt file that mentions a scheduled meeting. Our employer needs us to scan a group of such documents for dates. Later, we plan to add the information we gather to a SQLite database.

We’ll begin by defining a regex pattern that will match our date format. Once a match is found, we’ll use it to create a datetime object from the string data.

schedule.txt

schedule.txt
The project begins next month. Denise has scheduled a meeting in the conference room at the Embassy Suits on 10-7-2021.

Example: Creating datetime objects from file data

import re
from datetime import datetime

# open the data file
file = open("schedule.txt", 'r')
text = file.read()

match = re.search(r'\d+-\d+-\d{4}', text)
# create a new datetime object from the regex match
date = datetime.strptime(match.group(), '%d-%m-%Y').date()
print(f"The date of the meeting is on {date}.")
file.close()

Output

The date of the meeting is on 2021-07-10.

Extracting Dates from a Text File with the Datefinder Module

The Python datefinder module can locate dates in a body of text. Using the find_dates() method, it’s possible to search text data for many different types of dates. Datefinder will return any dates it finds in the form of a datetime object.

Unlike the other packages we’ve discussed in this guide, Python does not come with datefinder. The easiest way to install the datefinder module is to use pip from the command prompt.

pip install datefinder

With datefinder installed, we’re ready to open files and extract data. For this example, we’ll use a text document that introduces a fictitious company project. Using datefinder, we’ll extract each date from the .txt file, and print their datimeobject counterparts.

Feel free to save the file locally and follow along.

project_timeline.txt
PROJECT PEPPER

All team members must read the project summary by
January 4th, 2021.

The first meeting of PROJECT PEPPER begins on 01/15/2021

at 9:00am. Please find the time to read the following links by then.
created on 08-12-2021 at 05:00 PM

This project file has dates in many formats. Dates are written using dashes and forward slashes. What’s worse, the month January is written out. How can we find all these dates with Python?

Example: Using datefinder to extract dates from file data

import datefinder

# open the project schedule
file = open("project_timeline.txt",'r')

content = file.read()

# datefinder will find the dates for us
matches = list(datefinder.find_dates(content))

if len(matches) > 0:
    for date in matches:
        print(date)
else:
    print("Found no dates.")

file.close()

Output
2021-01-04 00:00:00
2021-01-15 09:00:00
2021-08-12 17:00:00

As you can see from the output, datefinder is able to find a variety of date formats in the text. Not only is the package capable of recognizing the names of months, but it also recognizes the time of day if it’s included in the text.

In another example, we’ll use the datefinder package to extract a date from a .txt file that includes the dates for a popular singer’s upcoming tour.

tour_dates.txt
Saturday July 25, 2021 at 07:00 PM     Inglewood, CA
Sunday July 26, 2021 at 7 PM     Inglewood, CA
09/30/2021 7:30PM  Foxbourough, MA

Example: Extract a tour date and times from a .txt file with datefinder

import datefinder

# open the project schedule
file = open("tour_dates.txt",'r')

content = file.read()

# datefinder will find the dates for us
matches = list(datefinder.find_dates(content))

if len(matches) > 0:
    print("TOUR DATES AND TIMES")
    print("--------------------")
    for date in matches:
        # use f string to format the text
        print(f"{date.date()}     {date.time()}")
else:
    print("Found no dates.")
file.close()

Output

TOUR DATES AND TIMES
——————–
2021-07-25     19:00:00
2021-07-26     19:00:00
2021-09-30     19:30:00

As you can see from the examples, datefinder can find many different types of dates and times. This is useful if the dates you’re looking for don’t have a certain format, as will often be the case in natural language data.

Summary

In this post, we’ve covered several methods of how to extract a date or time from a .txt file. We’ve seen the power of regular expression to find matches in string data, and we’ve seen how to convert that data into a Python datetime object.

Finally, if the dates in your text files don’t have a specified format—as will be the case in most files with natural language content—try the datefinder module. With this Python package, it’s possible to extract dates and times from a text file that aren’t conveniently formatted ahead of time.

Related Posts

If you enjoyed this tutorial and are eager to learn more about Python—and we sincerely hope you are—follow these links for more great guides from Python for Beginners.

  • How to use Python concatenation to join strings
  • Using Python try catch to mitigate errors and prevent crashes

The post How to Extract a Date from a .txt File in Python appeared first on PythonForBeginners.com.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...