Hey @vorpalsnark,
On the training dataset, I can see that it achieves upto 77% accuracy, as can be seen in the image below, and I guess that’s the only accuracy that the notebook talks about. But even if you are achieving only 70% accuracy, I believe it’s a good start, considering we only trained our model for 10 epochs on just 220 examples.

Yes, you can definitely play with the hyper-parameters, and see if you can improve the model’s performance.
You see the major reason behind this is that in the dataset the “Empty” tag is observed very often, while the other tags are observed quite less often. So, if the model just predicts every token as “Empty”, then also it can get a quite high accuracy. This is quite evident from the classification report presented towards the end of the notebook.

For tags like “Years of Experience”, “Degree”, “Location”, “College Name”, etc, despite of having around 100 instances, the model gives a precision and recall of 0, i.e., it doesn’t predict any token with these tags. Note that even in this scenario, our model gets an accuracy of 77%, and a weighted avg f1-score of 0.94. Things look good only until we observe the macro avg f1-score, which is 0.23 only. You can read more about micro avg, macro avg and weighted avg here.
To conclude, try to play with the hyper-parameters to see if you can make any major stride in macro avg f1-score. Since the dataset is highly skewed, metrics like accuracy and weighted avg f1-score aren’t apt for judging your model. And lastly, even with 77% accuracy, this is what we can expect from the model, in terms of performance.
Let us know if this helps you out.
Cheers,
Elemento