Poor performance of attention-based encoder-decoder architecture for slot filling

Hi everyone,

I’m currently doing some research on methods that tackle the intent classification and slot filling problems in NLP. One of the approaches with which I choose to start experimenting is proposed in the following paper:

In this paper, the network is trained jointly for both tasks. The encoder-decoder architecture could be observed below:

The encoder is a bidirectional LSTM and the decoder is another LSTM. Due to the joint learning of both tasks, there is one decoder for intent classification and another decoder for slot labels prediction. At each timestep in the decoding phase, the block receives the previous hidden state, previous generated token (or correct token during training) and a context vector from an attention operation with all encoder’s hidden states. There are also some points worth noticing for implementation:

  • The last hidden state of the backward LSTM in the encoder is used as the initial context for the decoder (same with last cell state), following the paper “Neural Machine Translation by Jointly Learning to Align and Translate”.
  • For the output layer, the current hidden state, cell state (context) and previous output token are all linearly projected and maxed out before being projected again to produce probability classes of slot labels, as mentioned in the paper above.

Regarding this paper implementation, among the available implementations on GitHub, I found the following with the highest number of stars:

However, I think the repo has some issues with the attention mechanism and the initialization for the decoder and the entire network’s weights. Thus, I refine the repo temporarily as the following notebook:

Although this notebook isn’t quite organized yet, I applied some modifications to the Encoder, Decoder and the training procedure.

However, as you can see, while the intent classification performance is improving extensively, the slot filling F1 remains the same throughout training and evaluation (despite teacher forcing being used).

The testing result is poor:

I have looked at it for a while but still don’t understand. I would appreciate any help or comments on this!

Thanks.

hi @Minh_Tu_Canh

check if you have unbalanced dataset (99% being class 0 and 1% being class 1)

the distribution of class weight being not uniform causes the training accuracy to be overfitting, you need to make sure your class weights are balanced.

Regards
DP

Hi @Deepti_Prasad
Thank you for your response.

The ATIS dataset that I’m using is indeed imbalanced, as the distribution for intent labels are highly skewed as below:

In my old implementation, the accuracy and F1 score for intent classification during training and both high and exactly the same!

v

I’ve tried weighted the intent and slot losses in the overall loss but it seems that the slot filling metrics haven’t improved much. Could you suggest some ideas?

can I know how much dataset you have ? in total and how you split the data @Minh_Tu_Canh

Here is the link to the ATIS dataset that I’m using @Deepti_Prasad :

The training split contains 4478 utterances with 120 slot labels and 21 intent labels.

The validation split contains 500 utterances with 96 slot labels and 16 intent labels. The intent distribution is as follows:

The test split contains 893 utterances with 101 slot labels and 19 intent labels. The intent distribution is as below:

These splits are default from the ATIS dataset.

@Minh_Tu_Canh

Have you used the same codes from the shared link?

@Deepti_Prasad Yes, it’s version 6 of the notebook I shared. The training log is in the last block before the Test section.

In the version above, for each timestep of slot labeling in the decoder, I use the embedding of the previously generated token (or true token by teacher forcing). I’ve also tried to change this scheme a bit by feeding in the embedding of input token (for self-alignment) instead of previous output token, but the results don’t seem to improve.

The first 2 training epochs are:

The last 2 training epochs are:

It seems that the slot labeling decoder is having a hard time to converge (if my implemented architecture is correct).

1 Like

as I suspected seeing the codes of the labelling seemed little off and I can see you have done some progress.

why don’t you include both previous output token and embedding of input token, it will surely have better result than only choosing one.

In the labelling section, for slot and intent labelling, can you check the ratio of these two?

I’ve made a try but it seems to not improve. This is the last training epoch:

What do you mean by the ratio of slot and intent labeling?

in codes slot labelling and intent labelling are assigned separately as :-1 and -1, so check how many each of these labels are in whole set.

If you mean the statistics of slot and intent labels, I mention them above.
If you’re specifically referring to this part of the preprocess_data() function:

I’m just splitting each line in the dataset to obtain slot labels and intent label of each sentence.

Do you mean those above?

1 Like

I’ve done some revision and found out that the low F1 results for slot filling is of micro F1 score.

n

When I change to macro F1 score, the results are a lot higher:

1 Like

@Minh_Tu_Canh

as you know macro says the function to compute f1 for each label, and returns the average without considering the proportion for each label in the dataset. where as micro f1, considers average=weighted says the function to compute f1 for each label, and returns the average considering the proportion for each label in the dataset.

So check first proportion of each label in general. I think proportionality can be ignored only when there is equal amount of labels but if the proportionality highly imbalanced then macro f1 wouldn’t be the right measure here as it is giving the score of f1, recall and precision independently and not in relation to each other.

remember I did tell you to check how your labels are split between slot filling and intent labelling as well as true class/features in your dataset.

Regards
DP