Sunday, September 12, 2021

Bhishan Bhandari: Training a deep learning model with custom dataset for motion transfer

Through this article, I want to show the steps I took in preparing custom dataset for training a GAN model for motion transfer. Furthermore, I used google colab which offers free GPU/TPU usage for research purposes for 12 hrs a day.

The paper of interest today is “Everybody Dance Now” https://arxiv.org/pdf/1808.07371.pdf

The paper is in the realm of motion transfer. Given a reference subject and a target, the idea is to transfer the motion of the reference subject(movement) to the target. This is possible with GAN(Generative Adversarial Network). Please read the paper entirely for more information on their method for motion transfer and training parameters and experimental results. In short, there are two phases

  1. Training

In this phase the target subjects’ set of images from a video sequence is taken. Each image goes through pose estimation model P which creates a pose stick figure. This is followed by learning the mapping G alongside an adversarial discriminator D which attempts to distinguish between the real and fake correspondences. Furthermore, the discriminator takes into account t, t+1 as a whole for the distinction to improve temporal coherence. 

  1. Transfer

In this phase, the reference(source) subject whose motion is to be transferred to the target subject is taken. The set of images from a video sequence of the reference(for example a dance video) is passed through the pose estimation network P which gives the pose stick figure. This is followed by normalization of the pose to make it cohesive to the pose of the target subject. Now the normalized pose can be passed through the above trained model G to produce the target subjects’ image with the pose of the source subject.

A. Preparing the data for training

  1. Target person video -> images dataset

For the purpose of training the GAN, we need a video for the target person. This can be a casual video of the target person performing various simple poses. The more diverse the actions in the video sequence, the better it is for network training. On top of that, the length of the video is directly proportional to the better performance of the network. However, I used a very small video of me performing various poses due to the storage constraints on the google drive(which I am using as a storage for the dataset, project files and trained models).

Target person performing some pose

Given that we have a video file, we can create the dataset by iterating through the frames and saving it. Opencv(cv2) provides helper functions for doing this. You could also make use of the ffmpeg library to extract the frames from a video. Below is the ffmpeg command I used to extract images from a video.

!ffmpeg -i /content/drive/MyDrive/CV/targetperson.mp4 -vf "scale=1920:1080" -qscale:v 2 /content/drive/MyDrive/EverybodyDanceNow/dancedata/train/train_img/mv_%012d_rendered.jpg

  1. Pose detection (Generate stick figures-> used as input to the network)

Next step is using a pre-trained pose estimation model to produce pose stick figures which would be used as an input for the GAN(Generative Adversarial Network). The explanation about pose estimation and GAN is beyond the scope of this article. The original repository(https://github.com/carolineec/EverybodyDanceNow/tree/master/data_prep) for the project provides scripts for generating the pose stick figures and getting the bounding box information for the face(FaceGAN can be used to further enhance the result). Running the below script will prepare the dataset for training by creating a folder containing dataset images, another folder containing pose stick figures and another folder containing text files with bounding box information for the face. 

!python graph_train.py --keypoints_dir /content/drive/MyDrive/outputdance/ --frames_dir /content/drive/MyDrive/EverybodyDanceNow/dancedata/train/train_img/ --save_dir /content/drive/MyDrive/EverybodyDanceNow/dancedata/train/savedir --spread 120 3550 1 --facetexts

Sample image Pose

B. Preparing the source dataset

This step is the same as above. In this step we consider the source video. Source video is the video whose motion is to be transferred to the target person. Use ffmpeg to get the frames from the video as before.

!ffmpeg -i /content/drive/MyDrive/CV/source.mp4 -vf "scale=1920:1080" -qscale:v 2 /content/drive/MyDrive/EverybodyDanceNow/dancedata/source/images/mv_%012d_rendered.jpg

Similarly, the pose stick figures can be produced as described before using some pre-trained pose estimation model like openpose.

C. Pose Normalization

Since the source and the target person do not always share similar measurements, therefore we need to transform the pose stick figures corresponding to the source images to the category of the target person. This is called normalization. This allows us to have a set of pose stick figures of the source video which are similar to the pose stick figures of the target.

python graph_posenorm.py --target_keypoints /data/scratch/caroline/keypoints/wholedance_keys --source_keypoints /data/scratch/caroline/keypoints/dubstep_keypointsFOOT --target_shape 1080 1920 3 --source_shape 1080 1920 3 --source_frames /data/scratch/caroline/frames/dubstep_frames --results /data/scratch/caroline/savefolder --target_spread 30003 178780 --source_spread 200 4800 --calculate_scale_translation --facetexts

D. Training the network (Generator vs Discriminator)

Now that we have the dataset including the pose stick figures and its correspondences, we can proceed with training the model. Again, for the specifics about the hyperparameters, please read the paper. In short, we pass the pose stick figures to the generator which produces some image which the discriminator computes loss(using some loss function) for by taking reference of the original image. The idea is to train and adjust network weights via gradient descent based on the loss until the discriminator cannot distinguish between the real and fake sample. Finally, we can have a trained model with weights such that given a pose stick figure, the network can generate an image in the category of the target person. I followed the same training procedure as described in the original repository for the paper.

E. Output

Finally we can pass these source pose stick figures after pose normalization to the trained network to produce the images of the target set. This completes the entire pass and finally we can have a set of generated images with the pose of the source image. We can use ffmpeg or cv2 to combine image sequences to a video output. This completes motion transfer from source to target.

This is the output

Motion transfer from source to target

I only trained the network for 50 epochs with very short target video because of storage and computation constraints on the free google colab GPU + drive storage.

Extras

Links:

https://github.com/carolineec/EverybodyDanceNow

https://arxiv.org/pdf/1808.07371.pdf

https://github.com/CMU-Perceptual-Computing-Lab/openpose



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...