Using style transfer for generating facial expressions

Hi, my intuition is that Neural Style Transfer could be used to generating expressions for any given expression for a given face in an automated way. Think you could hire whatever actors you can afford and then improve the final outcome of the videos by correcting the expressions. What do you think?

Hello! Systems powered by Neural Networks can indeed achieve such results (as it is currently used in TikTok and Instagram filters quite a bit!)

However, if what you have in mind is Neural Style Transfer Networks as we most often see them being used, you must remember they generalize the location of the features of the image they draw from (which is very desirable for many applications).

So for example, let’s take a drawing of Rudolph the reindeer, and use Neural Style Transfer to apply the styles of a drawing of Tom & Jerry into it:


As you can see, the structures of our content image (Rudolph) get preserved, with features from out style image (Tom & Jerry), placed throughout the resulting picture.

So let’s see what would happen if we applied the stylings of the smiling face emoji into the serious face emoji, using that same network:

It doesn’t change the main structures of the serious emoji. Rather, in this case, it just makes it look noisier, since features from all over our smiling face image get mapped all over our serious face image, instead of looking for structures where each feature would fit in best. For that objective, you might want to look into GANs!

The results I’ve shown were produced without use of regularization. Here’s a result I got using regularization:

It is interesting how our serious face remained very similar, with just a slight tint of red in some regions. This is interesting since indeed, our style image (the smiling face) has similar features, one of the main differences being the presence of red, and the network with regularization therefore didn’t find much other than that coloring to change about our content image. But again, we see the localization of that wasn’t realized for the final product.

Interesting! Thanks for the time to reply.

For ‘Expression Transfer’ in the same spirit, we need to have more. Suppose, there are face recogniser and expression recogniser models. The first one is tuned to recognise landmarks of a face. The second one is trained on the output of the first to model the ‘Smile’ feature on the landmarks.

Given such a pair of models, a given face, we could try to generate a ‘Smiling Face’ that is similar to both the input face and has the features of ‘Smile’.

Does that sound reasonable to try?