Daily Python: Stack Abuse: Face Recognition in Python with FaceNet

Introduction

Face recognition systems are used for many different applications, like automatically labeling images of your friends on Facebook or unlocking your phone by showing it your face.

If you want to try creating a face recognition system yourself, this can be done fairly simply by using a pre-trained face recognition architecture called FaceNet. This article will walk you through an implementation of a face recognition system based on FaceNet, as well as explain how FaceNet was trained to recognize faces.

FaceNet grew out of older face recognition methods, like those that used another type of object recognition algorithm - systems called Siamese networks.

Methods for Face Recognition

There are various methods that can be used to carry out face recognition. Many facial recognition systems operate by utilizing convolutional neural networks, combining them with a specific type of loss function referred to as triplet loss. These triplet loss functions are commonly combined with one-shot face recognition approaches, which makes use of a neural network architecture referred to as a Siamese neural network. Let's take a closer look at Siamese neural networks and one-shot recognition problems.

First, let's quickly define some terms relevant to our discussion.

To begin with, what exactly is "one-shot learning"? One-shot learning methods are classification tasks, but unlike other forms of classification, every example of a class requires multiple predictions, the multiple predictions are made in one-shot. Face recognition is an example of one-shot learning, as when given an instance of a face, multiple predictions are delivered - the likelihood that a person in an image is one person and the likelihood of them being another person.

Siamese networks are special kinds of networks designed to tackle the problem of one-shot learning. They accomplish this by learning a feature vector and then comparing the candidate (possible) examples with the known/recognized examples of a class.

The kinds of loss functions used in Siamese networks are called "contrastive loss" and "triplet loss". These loss functions are used to create high-quality embeddings of a person's face and they function as the backbone for many face recognition applications.

Understanding One-Shot Methods

Convolutional Neural Networks (CNN) are one of the best and most reliable way it to recognize images - although, they have some limitations which makes them unfit for this task:

alt
Credit: Wikipedia

Training a CNN typically requires a massive image data set, so CNN's have difficulty training when the size of the training dataset is rather small - you probably won't have 1000 images of a single person's face to train on.

An additional problem occurs when a new person is added to the database. The whole network would need to be retrained, and this can be inconvenient and time intensive. Because training CNNs can be so involved, Siamese neural networks are often used to carry out face recognition instead.

The term "Siamese" refers to the fact that network architecture is actually comprised of two networks paired together. Both networks are functionally identical, having the same weights and and parameters. The two networks in the Siamese networks are simplified versions of convolutional networks, with far fewer hidden layers when compared to other CNN models used for face recognition. The goal of using the Siamese networks is to determine the similarity between two different images.

One network receives one image while the other network receives a completely different image. After the images proceed through their respective networks, a feature vector is computed for both of the images.

The encoded images are compared with one another, with the goal of computing the distance between the first encoding and the second encoding. If the distance between the two images is very small (less than some specified threshold), the image is classified as the same person as the reference image and vice versa.

Siamese networks traditionally used a specific kind of loss function, contrastive (sometimes called comparative) loss. The contrastive loss function is just a distance-based loss function, and it tries to tie similar examples close together in an embedding. This loss function proved viable and was used in Siamese networks for a while.

However, a new loss function was found, and this discovery lead to the creation of face recognition architectures that work even better than contrastive loss and Siamese networks - triplet loss.

FaceNet and the Triplet Loss Model

Improving upon the architecture of Siamese networks lead to the creation of architectures like FaceNet, a type of network that uses triplet loss.

Triplet loss functions work by attempting to pull positive examples close while also trying to push negative example farther away in space. Yet contrastive loss operates by analyzing pairwise examples, meaning that the distance between two possible pairs can't be calculated the way it can in triplet loss. Since the distance/closeness between objects can be considered relative to a third object, a third point of comparison makes for better analysis of the similarity between two images.

The term "triplet loss" refers to the fact that it isn't just the difference between the network's prediction and an expected output being compared, rather there are three specific images being compared. The first image is the positive image, the image that exists in the database and is used as a representative sample of the target individual. There is also an image that isn't of the person in question, the negative image. Finally, there's the image that is intended for classification (the anchor image):

alt

The goal when optimizing the triplet loss function is for the difference between the anchor image and the negative image to be greater than the difference between the encodings of the positive example and the anchor example. To put that another way, the encoding that the network learns should be closer to one another when they are the same person, and further away from each other when they are different.

The triplet loss function was used in the construction of the FaceNet system by Google, which achieved impressive accuracy at face recognition tasks.

There are two ways to validate a face recognition neural network model. There is a direct method and an indirect method.

The indirect method was the one traditionally used by Siamese networks. In the indirect validation approach, new possible pairs of two images are created, and the assumption is that if a model can accurately distinguish the two different classes of images, it can correctly classify an image.

Meanwhile, the direct approach is carried out like this:

Take an image and derive the similarity score between that image and some other randomly chosen images.
There should be one "same" target class and all the other classes should be different.
The predicted class is just the class which has the highest similarity score with the supplied image.

It was by using the direct approach and refining it that a more direct network was created - FaceNet.

In other words, FaceNet discriminates between an anchor example, a positive example, and a negative example. Meanwhile, Siamese networks operate by comparing just two examples. This distinction is what lets FaceNet achieve high recognition accuracy, and why we can use FaceNet to create complex image embeddings that allow us to build an accurate facial recognition system with only relatively few training images.

Implementing Face Recognition with FaceNet

FaceNet can be included as part of the classifier/recognition system. Under this implementation method, for every image fed into the classifier, a face embedding is created.

However, when doing things this way it takes a long time to create the face embeddings. A much quicker approach is to compute all of the face embeddings beforehand, and save them as a dataset we pass into our chosen classifier.

Other Face Recognition Architectures

While FaceNet with triplet loss functions are the way that many state of the art face recognition systems operate today, there are various other ways that you can build a face recognition system, and we'll mention them here briefly in case you are interested in learning more about them.

Different architectures and can be used for face recognition. Other popular architectures are:

DeepFace - utilizes convolutional neural networks and is the product of a research team based at Facebook.
DeepID - is capable of carrying out verification and identification using contrastive loss functions, and it was developed by Yi Sun et al. in 2014.
VGGFace - operated on an extremely large training dataset and achieved state of the art results when it was published in 2015.

What Approach will we be Using?

For this implementation, we will be using FaceNet, which was trained with a triplet loss function. We'll be using a pre-trained version of the network, downloading and using the predefined weights. After we have the weights loaded it is rather easy to calculate the distance/difference between two images, which can be done with a simple distance calculation algorithm like Euclidian Distance or Cosine Similarity.

However, our classifier will be one of Scikit-Learn's classifiers. We'll train the classifier on our saved face embeddings and then select a random image from our validation set to test the classifier on.

Implementation of One-Shot Face Recognition with FaceNet

There are multiple FaceNet models out there available for use with Python and Keras. We'll want to choose one to use, and one of the best supported implementations of FaceNet is Hiroki Taniai's FaceNet version, which supports color images, though they must be no larger than 160 x 160.

Let's download the FaceNet model and load it into our script. This is very easy to do once the weights and architecture have been downloaded. All we need to do is use Keras' load_model function:

from keras.models import load_model
facenet = load_model("facenet_keras.h5")

We can't make use of FaceNet without first detecting the faces. FaceNet simply performs the recognition after being trained, it doesn't detect the faces themselves. To accomplish face detection, we can make use of another type of CNN called the Multi-Task Cascaded Convolutional Neural Network (MTCNN). This network excels at the detection of faces and it already has a dedicated Python library, which can be found here.

All you need to do to install it is run this from the terminal:

$ pip install mtcnn

We'll now import the module and use it to detect and extract faces. After that we'll use the extracted faces and set them up as our training and testing data. We need to select a dataset to do our training and testing on, however. We'll be using the 14 Celebrity Faces Dataset, which can be found here.

Extracting Faces

In order to extract all the faces from our dataset, we'll create a function to handle that. Let's start by importing the libraries we need:

from os import listdir
from os.path import isdir
from PIL import Image
from numpy import savez_compressed
import numpy as np
from numpy import asarray
from mtcnn.mtcnn import MTCNN
import os

Our function will take the name of the file to extract the face from and the size the images should be (160, 160). We'll convert the image to RGB and make it an array, then use MTCNN to extract the faces.

The position of the faces is defined with a bounding box, and we want to get the pixels from within that box, so we'll filter the array of pixels. Finally, we resize the image array to our desired size and return it:

def get_face(image_file, size=(160, 160)):
    im = Image.open(image_file)
    im = im.convert('RGB')
    im_array = np.asarray(im)

    face_extractor = MTCNN()
    detected = face_extractor.detect_faces(im_array)
    x1, y1, width, height = detected[0]['box']
    xy, y1 = abs(x1), abs(y1)
    x2, y2 = x1 + width, y1 + height

    # This actually extracts the face from the region in the bounding box
    face = im_array[y1:y2, x1:x2]

    # Resize the image
    image = Image.fromarray(face)
    image = image.resize(size)
    resized_face = asarray(image)

    return resized_face

Now that we have a function to extract faces, we need to go through our directories/folders and extract the faces from the images within those folders. We need to go through the images and the train and validation sub-directories within those directories.

We'll make a function that goes through the given directory and gets all the faces within the subdirectories using the previously created get_face() function:

def extract_from_images(directory):
    faces = list()
    # Loop through all the files in the given directory
    for file in listdir(directory):
        path = directory + file
        current_face = get_face(path)

        # Append current face to the list of total faces
        faces.append(current_face)

    return faces

We've now got the faces from the train and val subdirectories within a directory. However, we need to do that for each of the directories in our dataset, not just one. There's also another problem we have to solve. What we have now are just the faces/features, but to make a dataset we need to have labels to accompany those features.

Making a Dataset

So now we'll write a function that calls both of our previous functions. It will get the faces from all directories and subdirectories, but it will also get the name of the directory and append those names to the images within the directory as the label:

def make_dataset(directory):
    X = list()
    y = list()

    for sub in listdir(directory):
        path = directory + sub + '/'
        if not isdir(path):
            continue
        faces = extract_from_images(path)
        labels = [sub for _ in range(len(faces))]
        X.extend(faces)
        y.extend(labels)
        print('>loaded %d examples for class: %s' % (len(faces), sub))

    X = asarray(X)
    y = asarray(y)
    return X, y

Now we just need to call the function on the master train and validation directories.

X_train, y_train = make_dataset('5-celebrity-faces-dataset/data/train/')
X_test, y_test = make_dataset('5-celebrity-faces-dataset/data/val/')

We should get something like this printed as part of the return:

>loaded 15 examples for class: anne_hathaway
>loaded 18 examples for class: arnold_schwarzenegger
>loaded 14 examples for class: ben_afflek
>loaded 15 examples for class: dwayne_johnson
>loaded 17 examples for class: elton_john
>...
>...
>>loaded 6 examples for class: anne_hathaway
>loaded 5 examples for class: arnold_schwarzenegger
>loaded 5 examples for class: ben_afflek
>loaded 5 examples for class: dwayne_johnson
>loaded 5 examples for class: elton_john
>...
>...

Why don't we check the shape of our datasets, and assuming it is good, we'll save the data into a file that we can reuse later:

print(X_train.shape, y_train.shape)
savez_compressed('14-celebrity-faces-dataset.npz', X_train, y_train, X_test, y_test)

Here's the result of that print statement:

(220, 160, 160, 3) (220,)

That's what we expected to see. 220 training images total, so 220 feature instances and 220 labels.

Creating Embeddings

Now that we have all the data saved as sets of training and testing feature labels, we need to convert this data into embedding. Embeddings are special vectors that the neural network can extract the face from and compare it with other faces by checking its representation in vector space.

Vectors which are close to one another are likely to be the same person and vectors that are far away from one another are likely to be a different person. Think back on how distance is calculated with the contrastive and triplet loss functions. We'll feed these embeddings of our classifier and it will predict the individual the face belongs to.

Creating embeddings to feed into a classifier is fairly simple. We need to create a function that will create a function that will take in the FaceNet model and use it to create embeddings. We'll then load in our dataset that we saved and pass it into the function.

Let's be sure we have all the imports we'll need to do this.

import numpy as np
from keras.models import load_model

First, we'll create the function to get embeddings for a single face. This is where we get to employ FaceNet's power. We need to standardize the values, as the FaceNet implementation we're using expects them to be standardized across the three channels:

def create_embedding(model, face_array):
    face_array = face_array.astype('float32')

    # Need to standardize the values, so we'll get the standard deviation and mean
    mean = face_array.mean()
    std = face_array.std()
    face_array = (face_array - mean)/std

    # Convert from array to a sample image
    sample = np.expand_dims(face_array, axis=0)

    # Get the embedding from the image
    y_pred = model.predict(sample)
    return y_pred[0]

Now we just load the saved data back in and get the training and testing data from it. Here's where we load in the FaceNet model as well:

data = np.load('14-celebrity-faces-dataset.npz')

# Now get the individual variables from the data
X_train, y_train, X_test, y_test = data['arr_0'], data['arr_1'], data['arr_2'], data['arr_3']

# Load in the facenet version we're using
model = load_model('facenet_keras.h5')

We have our model, data, and function all loaded in and created so now we're ready to create our embeddings:

X_train2 = list()
for face in X_train:
    embedding = create_embedding(model, face)
    X_train2.append(embedding)
X_train2 = np.asarray(X_train2)

X_test2 = list()
for face in X_test:
    embedding = create_embedding(model, face)
    X_test2.append(embedding)
X_test2 = np.asarray(X_test2)

Now we can save the embeddings themselves as a new file.

np.savez_compressed('14-celebrity-faces-embeddings.npz', X_train2, y_train, X_test2, y_test)

Now that we have the embeddings, we can finally classify them. In fact, just to prove that these embeddings can be used to build any sort of classifier you'd like, we'll use these exact same embedding functions to quickly create an embedding classifier.

All we would need to do is get the embeddings of two test images and get the Euclidean Distance or Cosine Similarity between them:

def get_face(image_file, size=(160, 160)):
    im = Image.open(image_file).convert('RGB')
    im_array = np.asarray(im)

    face_extractor = MTCNN()
    detected = face_extractor.detect_faces(im_array)
    x1, y1, width, height = detected[0]['box']
    xy, y1 = abs(x1), abs(y1)
    x2, y2 = x1 + width, y1 + height

    # This actually extracts the face from the region in the bounding box
    face = im_array[y1:y2, x1:x2]

    # Resize the image
    image = Image.fromarray(face)
    image = image.resize(size)
    resized_face = asarray(image)

    return resized_face

def extract_from_images(directory):
    faces = list()
    # Loop through all the files in the given directory
    for file in listdir(directory):
        path = directory + file
        current_face = get_face(path)

        # Append current face to the list of total faces
        faces.append(current_face)

    return faces

def make_dataset(directory):
    X = list()
    y = list()

    for sub in listdir(directory):
        path = directory + sub + '/'
        if not isdir(path):
            continue
        faces = extract_from_images(path)
        labels = [sub for _ in range(len(faces))]
        X.extend(faces)
        y.extend(labels)
    X = asarray(X)
    y = asarray(y)
    return X, y

to_embed_1 = get_face("test_image1.jpg")
to_embed_2 = get_face("test_image2.jpg")

image_test_1 = create_embedding(model, to_embed_1)
image_test_2 = create_embedding(model, to_embed_2)

image_test_1 = image_test_1.reshape(1, -1)
image_test_2 = image_test_2.reshape(1, -1)

def get_eucilidean(source_image, test_image):
    euclidean_distance = source_image - test_image
    euclidean_distance = np.sum(np.multiply(euclidean_distance, euclidean_distance))
    euclidean_distance = np.sqrt(euclidean_distance)
    return euclidean_distance

def get_sim(metric, threshold):
    chosen_threshold = threshold
    if metric < chosen_threshold:
        print("They are the same person!")
    else:
        print("Insufficient similarity! Not the same person!")

euclidean_distance = get_eucilidean(image_test_1, image_test_2)
get_sim(euclidean_distance, 0.35)

This method doesn't work exceptionally well as these embeddings haven't been optimized for this kind of comparison. But it proves that the embeddings can be used for face classification.

You can plug in two images there to see what it returns. If you like, you can also try changing the distance metric to something like cosine similarity, sigmoid, RBF, or a similar metric and see if it works any better.

Classifier

Finally, we're going to create the classifier, now that we have the embeddings. We can test various classifiers on the embeddings.

Once more, let's be sure that we have all the libraries and functions we need imported:

from random import choice
from numpy import load
from numpy import expand_dims
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Normalizer
from sklearn.svm import SVC
from matplotlib import pyplot
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score

Now we'll load in the data embeddings that we saved:

# Load faces
data = load('14-celebrity-faces-dataset.npz')
X_test_faces = data['arr_2']

Let's take the data we saved and extract the training and testing variables from it:

data = load('14-celebrity-faces-embeddings.npz')
X_train, y_train, X_test, y_test = data['arr_0'], data['arr_1'], data['arr_2'], data['arr_3']

We're going to need to normalize the data inputs, so that our classifiers can work with the data. We'll create an instance of both the normalizer and a label encoder. Then we'll transform the data with the transformers:

in_normalize = Normalizer(norm='l2')
encoder = LabelEncoder()
encoder.fit(y_train)

X_train = in_normalize.transform(X_train)
X_test = in_normalize.transform(X_test)
y_train = encoder.transform(y_train)
y_test = encoder.transform(y_test)

We can now select our chosen classification algorithms and fit them. Let's use a Support Vector Machine (SVC), K-Nearest Neighbors, Logistic Regression, Decision Tree, and XG Boost:

SVC_clf = SVC(kernel='linear', probability=True)
KNN = KNeighborsClassifier(n_neighbors=1, metric='euclidean')
LogReg = LogisticRegression()
DT_clf = DecisionTreeClassifier()
XGB = XGBClassifier()

SVC_fit = SVC_clf.fit(X_train, y_train)
KNN_fit = KNN.fit(X_train, y_train)
LogReg_fit = LogReg.fit(X_train, y_train)
DT_clf_fit = DT_clf.fit(X_train, y_train)
XGB_fit = XGB.fit(X_train, y_train)

Now let's see how the classifier performed by printing metrics:

classifiers = [SVC_fit, KNN_fit, LogReg_fit, DT_clf_fit, XGB_fit]

for clf in classifiers:
    print("Classifier is: " + str(clf.__class__.__name__))
    preds = clf.predict(X_test)
    print("Accuracy: " + str(accuracy_score(y_test, preds)))

Let's create a function to visualize a chosen number of predictions. It will take in the number of predictions we want to visualize and the model we want to calculate similarity with:

def test_recognition(ex_num, model):
    i = 1

    for n in range(ex_num):
        # Grab a random image from the dataset
        selection = choice([i for i in range(X_test.shape[0])])
        random_face_pixels = X_test_faces[selection]
        random_face_emb = X_test[selection]
        random_face_class = y_test[selection]
        random_face_name = encoder.inverse_transform([random_face_class])

        # Now that we have a random example, let's use our chosen
        # model and get a prediction. We'll also store the real
        # value and prediction in a list for later usage
        # samples = expand_dims(random_face_emb, axis=0)
        pred_class = model.predict(samples)
        pred_prob = model.predict_proba(samples)

        # Let's print out the label/name of the example along with
        # the confidence/probability that example belongs to that class
        # class_index = pred_class[0]
        class_probability = pred_prob[0, class_index] * 100
        predict_names = encoder.inverse_transform(pred_class)
        print('Predicted: %s (%.3f)' % (predict_names[0], class_probability))
        print('Expected: %s' % random_face_name[0])

        # Now place the image on the plot
        pyplot.subplot(2, 2, i)
        pyplot.imshow(random_face_pixels)
        title = '%s (%.3f)' % (predict_names[0], class_probability)
        pyplot.title(title)
        i += 1

    # Show the image
    pyplot.show()

All that remains is to call the function and pass it the number of example we want to visualize (up to 4):

test_recognition(4, XGB_fit)

Here's what we get back:.

Classifier is: SVC
Accuracy: 0.9714285714285714
Classifier is: KNeighborsClassifier
Accuracy: 0.9714285714285714
Classifier is: LogisticRegression
Accuracy: 0.9714285714285714
Classifier is: DecisionTreeClassifier
Accuracy: 0.6857142857142857
Classifier is: XGBClassifier
Accuracy: 0.8857142857142857

Predicted: keanu_reeves (96.227)
Expected: keanu_reeves
Predicted: anne_hathaway (96.454)
Expected: anne_hathaway
Predicted: elton_john (45.027)
Expected: sofia_vergara
Predicted: ben_afflek (98.442)
Expected: ben_afflek

alt

The XGB classifier has gotten three out of four correct. The lower left image is Sofia Vergara, not Elton John.

One last thing before we're done. Let's make a voting classifier composed of all our chosen classifiers to see if it performs better than the individual classifiers:

evaluators = [SVC_clf, KNN, LogReg, DT_clf, XGB]
model = VotingClassifier(estimators=[('SVC', SVC_clf), ('KNN', KNN), ('DT', DT_clf), ('XGB', XGB), ('LogReg', LogReg)], voting='soft')
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(accuracy_score(y_test, preds))
print(f1_score(y_test, preds, average='micro'))

test_recognition(4, model)

How did we do?

# Accuracy
0.9571428571428572

Predicted: keanu_reeves (35.981)
Expected: keanu_reeves
Predicted: simon_pegg (80.455)
Expected: simon_pegg
Predicted: sofia_vergara (84.000)
Expected: sofia_vergara
Predicted: arnold_schwarzenegger (84.265)
Expected: arnold_schwarzenegger

alt

Fantastic! That's all correct! After a number of tests, the voting classifier seems to consistently perform better than either the Decision Tree of XGBoost.

You may notice that the accuracy of the voting classifier is lower than the SVC, KNN, and Logistic classifiers, but the incredibly high accuracy of all that classifiers could hint at overfitting and a voting/ensemble method is less likely to overfit.

Conclusion

FaceNet is a powerful architecture that can easily be used to create face recognition systems. As you can see, we got a fairly reliable face recognition system with not all that much code, thanks to FaceNet's pretrained weights.

If you are curious to learn more, you could try implementing a Siamese network scheme with contrastive loss and seeing how the two recognition systems perform when compared.

You could also experiment with using the FaceNet architecture as part of the classifier. Finally, you could see experiment with fine-tuning FaceNet and see how it affects the performance of the recognition system.

If you'd like to play around with the code, we've got it on GitHub.

from Planet Python
via read more

Daily Python

Tuesday, August 27, 2019

Stack Abuse: Face Recognition in Python with FaceNet