Emotion Recognition Project

Hello,

My project is about building a model to detect emotions. I got a dataset called affwild2 with images tagged with the following emotions “Neutral”, “Anger”, “Disgust”, “Fear”, “Happiness”, “Sadness”, “Surprise”, “Other”.

The training dataset has this number of images:
Anger → 16573
Dislike → 10771
Fear → 9080
Happiness → 95817
Neutral → 177720
Other → 165866
Sadness → 79853
Surprise → 31615

The validation dataset has:
Anger → 6126
Dislike → 5296
Fear → 8408
Happiness → 34511
Neutral → 82258
Other → 106444
Sadness → 25157
Surprise → 12332

I am trying to do transfer learning with the vgg16 architecture and doing fine-tuning in the last layers. The accuracy reaches 95% but in the validation dataset, it does not exceed 40%.

My idea was to use vggFACE weights instead of Imagenet.
My second option is to reduce that validation set to something smaller.

I accept any ideas!!!

1 Like

I will assume the first number is training accuracy. If you haven’t studied up on the concepts of bias and variance, you should do. High training accuracy but low validation accuracy indicates a good chance of a specific condition present in your training regime. There could also be an issue of class imbalance: 1) the class presence in the training set is unequal and 2) the class presence in the validation set seems different (just eyeballed it but you should compute the actual %s). If true, consider applying one a few alternatives to mitigate. If these concepts don’t make sense, maybe take a look at some of the material from the Hyperparameters course. Let us know what you find out?

ps: In my opinion trying to back into good validation accuracy by using a smaller data set just masks the problem…it won’t make your model any more useful in the real world.

3 Likes

Thanks for answering. I continue improving the model and i saw the Hyperparameters course. These are my actuals results:

I think now i am facing a data problem. The variance between the traning a validation data It seems to be due to the lack of data on some emotions. It just so happens that the ones with the worst score are the classes with the least number of images.

Any suggestions?

Can I ask you one thing, is fear and disgust in the other group of dataset?

Hello @Rodrigo_Cotarelo

I don’t know if this will help you,

came across a similar project but not using vgg but it gives quite a detail attention on subject dependent and subject independent review, probably it will help you grab some more pointers in the direction you are looking for.

Emotional recognition using facial imaging.pdf (1.1 MB)

According to be if training is of 95% and validation is 40%, as already mentioned by ai_curious the issue is with the dataset distribution. Did you thought about trying to using your training set as validation set and validation set as training set, and see any changes?

Another reason I see the discrepancy more in validation than training is because I see your training dataset being more widely spread than the validation dataset.

before testing the dataset into the vgg model, you should have done data distribution analysis of your validation and training dataset which would have given you major understanding of difference between these two datasets.

Regards
DP

1 Like

also go through these screenshots for understanding bias and variance issue

![Screenshot 1945-01-22 at 12.02.06 PM|690x431](upload://ax


Yn1K8zQql4j6wfYJ2Qkmmylxe.jpeg)

Regards
DP

1 Like

Hi DP,

Thanks for answering! I appreciate it.

By other group of dataset do you mean if the photos of fear and disgust are from other dataset?

Yes, also I am not sure if the dataset images dimensions, if the images are normalised? before using this dataset to the vgg model

I am using the aff-wild2 dataset.

The Aff-Wild2 is annotated in a per frame basis for the seven basic expressions (i.e., happiness, surprise, anger, disgust, fear, sadness and the neutral state), twelve action units (AUs 1,2,4,6,7,10,12,15,23,24,25, 26) and valence and arousal. In total Aff-Wild2 consists of 564 videos of around 2.8M frames with 554 subjects (326 of which are male and 228 female)

I am working with the frames that are all 112x112 aligned images.

In my opinion the problem with fear and disgust is that they seem like a lot of photos but they aren’t. Since the images are extracted from videos, one person feeling fear is equivalent to a sequence of 1000 images tagged with fear. So of the 10,000 images we have for fear there are actually 10 people feeling fear (if the feeling of fear lasts 1000 frames then there are 100,000 images left).

That’s why I say that the variety we have in fear and disgust is not much.

Hello @Rodrigo_Cotarelo

I don’t know if it is allowed, could you share some images only few!!

Also, based on frame details

I would suggestion you to reshape and then normalising your dataset before using it into your model architecture. That would sure give you better result.

Also can?? I know how did you split data into training and validation, I mean on what criterion??

I am not sure what you mean by this!!!

this I can only confirm if I have access to the data, I did refer to the page link of aff-wild2 which actually also mentions

Some in-the-wild databases have been recently proposed. However: i) their size is small, ii) they are not audiovisual, iii) only a small part is manually annotated, iv) they contain a small number of subjects, or v) they are not annotated for all main behavior tasks (valence arousal estimation, action unit detection and basic expression classification).
To address these, we substantially extend the largest available in-the-wild database (Aff-Wild) to study continuous emotions such as valence and arousal. Furthermore, we annotate parts of the database with basic expressions and action units. We call this database Aff-Wild2. To the best of our knowledge, AffWild2 is the only in-the-wild database containing annotations for all 3 main behavior tasks. The database is also a large scale one. It is also the first audiovisual database with annotations for AUs. All AU annotated databases do not contain audio, but only images or videos.

So based on the above detail images are only annotated but not reshaped or no data normalisation has been used. You could think about reshaping the image to higher dimension and then normalise the data using feature extractor and scaling them based on from happiness to grief for experimental purpose(these are only suggestions)

Regards
DP

Hi @Deepti_Prasad. I am back.

What do you mean by normalizing using feature extractor?