Thursday, November 11, 2021

Stack Abuse: Python: Validate Email Address with Regular Expressions (RegEx)

Introduction

Regular Expressions, or RegEx for short, are expressions of patterns that can be used for text search and replace actions, validations, string splitting, and much more. These patterns consist of characters, digits and special characters, in such a form that the pattern matches certain segments of text we're searching through.

Regular Expressions are widely used for pattern-matching, and various programming languages have interfaces for representing them, as well as interacting with the matches results.

In this article, we will take a look at how to validate email addresses in Python, using Regular Expressions.

If you would like to learn more about Python's interface with Regular Expressions, read our Guide to Regular Expressions in Python!

General-Purpose Email Regular Expression

It's worth noting that there is no such regular expression that matches every possible valid email address. Although, there are expressions that can match most valid email addresses.

We need to define what kind of email address format are we looking for. The most common email format is:

(username)@(domainname).(top-leveldomain)

Thus, we can boil it down to a pattern of the @ symbol dividing the prefix from the domain segment.

The prefix is the recipient;s name - a string that may contain uppercase and lowercase letters, numbers, and some special characters like the . (dot), -(hyphen), and _ (underscore).

The domain consists of its name and a top-level domain divided by a . (dot) symbol. The domain name can have uppercase and lowercase letters, numbers, and - (hyphen) symbols. Additionally, the top-level domain name must be at least 2 characters long (either all uppercase or lowercase letters), but can be longer.

Note: There are a lot more detailed rules regarding valid emails, such as character count, more specific characters that can be used, etc. We'll be taking a look at an extended, highly fail-proof Regular Expression as defined by RFC5322 after the general-purpose approach.

In simple terms, our email Regular Expression could look like this:

(string1)@(string2).(2+characters)

This would match correctly for email addresses such as:

name.surname@gmail.com
anonymous123@yahoo.co.uk
my_email@outlook.co

Again, using the same expression, these email addresses would fail:

johnsnow@gmail
anonymous123@...uk
myemail@outlook.

It's worth noting that the strings shouldn't contain certain special characters, lest they break the form again. Additionally, the top-level domain can't be ... Accounting for those cases as well, we can put these rules down into a concrete expression that takes in a few more cases into account than the first representation:

([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+

A special character in the prefix cannot be just before the @ symbol, nor can the prefix start with it, so we made sure that there is at least one alphanumeric character before and after every special character.

As for the domain, an email can contain a few top-level domains divided with a dot.

Obviously, this regex is more complicated than the first one, but it covers all of the rules we have defined for the email format. Yet again, it can probably fail to properly validate some edge case that we haven't thought of.

Validate Email Address with Python

The re module contains classes and methods to represent and work with Regular Expressions in Python, so we'll import it into our script. The method that we will be using is re.fullmatch(pattern, string, flags). This method returns a match object only if the whole string matches the pattern, in any other case it returns None.

Note: re.fullmatch() was introduced in Python 3.4, before that, re.match() was used instead. On newer versions, fullmatch() is prefered.

Let's compile() the Regular Expression from before, and define a simple function that accepts an email address and uses the expression to validate it:

import re

regex = re.compile(r'([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+')

def isValid(email):
    if re.fullmatch(regex, email):
      print("Valid email")
    else:
      print("Invalid email")

The re.compile() method compiles a regex pattern into a regex object. It's mostly used for efficiency reasons, when we plan on matching the pattern more than once.

Now, let's test the code on some of the examples we took a look at earlier:

isValid("name.surname@gmail.com")
isValid("anonymous123@yahoo.co.uk")
isValid("anonymous123@...uk")
isValid("...@domain.us")

This results in:

Valid email
Valid email
Invalid email
Invalid email

Awesome, we've got a functioning system!

Robust Email Regular Expression

The expression we've used above works well for the majority of cases and will work well for any reasonable application. However, if security is of higher concern, or if you enjoy writing Regular Expressions, you may opt to tighten the scope of possibilities while still allowing valid email addresses to pass.

Long expressions tend to get a bit convoluted and hard to read, and this expression is no exception:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=^_`{|}~-]+)*
|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
@
(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
|\[(?:(?:(2(5[0-5]|[0-4][0-9])
|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])
|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]
|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

This is the RFC5322-compliant Regular Expression that covers 99.99% of input email addresses.* Explaining it with words is typically off the table, but visualizing it helps a lot:

*Image and claim are courtesy of EmailRegex.com.

This actually isn't the only expression that satisfies RFC5322. Many of them do, with varying degrees of success. A shorter version which still complies with the specification can be easily imported into Python's re.compile() method to represent an expression:

import re

regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")@([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")

def isValid(email):
    if re.fullmatch(regex, email):
        print("Valid email")
    else:
        print("Invalid email")

isValid("name.surname@gmail.com")
isValid("anonymous123@yahoo.co.uk")
isValid("anonymous123@...uk")
isValid("...@domain.us")

This also results in:

Valid email
Valid email
Invalid email
Invalid email

Conclusion

To wrap up this guide, let's revise what we've learned. There are many ways to validate emails using Regular Expressions, mostly depending on what certain format we are looking for. In relation to that, there is no one unique pattern that works for all email formats, we simply need to define the rules that we want the format to follow and construct a pattern accordingly.

Each new rule reduces the degree of freedom on the accepted addresses.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...