I’m currently doing some research on methods that tackle the intent classification and slot filling problems in NLP. One of the approaches with which I choose to start experimenting is proposed in the following paper:
In this paper, the network is trained jointly for both tasks. The encoder-decoder architecture could be observed below:
The encoder is a bidirectional LSTM and the decoder is another LSTM. Due to the joint learning of both tasks, there is one decoder for intent classification and another decoder for slot labels prediction. At each timestep in the decoding phase, the block receives the previous hidden state, previous generated token (or correct token during training) and a context vector from an attention operation with all encoder’s hidden states. There are also some points worth noticing for implementation:
The last hidden state of the backward LSTM in the encoder is used as the initial context for the decoder (same with last cell state), following the paper “Neural Machine Translation by Jointly Learning to Align and Translate”.
For the output layer, the current hidden state, cell state (context) and previous output token are all linearly projected and maxed out before being projected again to produce probability classes of slot labels, as mentioned in the paper above.
Regarding this paper implementation, among the available implementations on GitHub, I found the following with the highest number of stars:
However, I think the repo has some issues with the attention mechanism and the initialization for the decoder and the entire network’s weights. Thus, I refine the repo temporarily as the following notebook:
Although this notebook isn’t quite organized yet, I applied some modifications to the Encoder, Decoder and the training procedure.
However, as you can see, while the intent classification performance is improving extensively, the slot filling F1 remains the same throughout training and evaluation (despite teacher forcing being used).
check if you have unbalanced dataset (99% being class 0 and 1% being class 1)
the distribution of class weight being not uniform causes the training accuracy to be overfitting, you need to make sure your class weights are balanced.
I’ve tried weighted the intent and slot losses in the overall loss but it seems that the slot filling metrics haven’t improved much. Could you suggest some ideas?
In the version above, for each timestep of slot labeling in the decoder, I use the embedding of the previously generated token (or true token by teacher forcing). I’ve also tried to change this scheme a bit by feeding in the embedding of input token (for self-alignment) instead of previous output token, but the results don’t seem to improve.
If you mean the statistics of slot and intent labels, I mention them above.
If you’re specifically referring to this part of the preprocess_data() function:
as you know macro says the function to compute f1 for each label, and returns the average without considering the proportion for each label in the dataset. where as micro f1, considers average=weighted says the function to compute f1 for each label, and returns the average considering the proportion for each label in the dataset.
So check first proportion of each label in general. I think proportionality can be ignored only when there is equal amount of labels but if the proportionality highly imbalanced then macro f1 wouldn’t be the right measure here as it is giving the score of f1, recall and precision independently and not in relation to each other.
remember I did tell you to check how your labels are split between slot filling and intent labelling as well as true class/features in your dataset.