Week 2 - Emojify - Test Set has 'tab' endings => low test set accuracy?

When training the deep LSTM for Emojify assignment, I noticed that the test set has “tab character” at the end of each line (data/tesss.csv)

testPNG

These tabs (->) do not appear on the training set (data/train_emoji.csv)

train

They probably confuse the model. I removed them with

X_test = np.array([x.strip() for x in X_test])

and the accuracy on the test set jumped from 70% to 85%.

1 Like

Good point, I’ll look into it.

1 Like

Problem identification is the half solution of the problem.
Good work.

Note that a few months ago I got 85% test set accuracy without removing the tabs, so there’s something that needs further investigation.

What might have caused this: I used split(' ') instead of split() when processing the data (The latter would have removed the tab).

I found two other posts (1 2) with the same problem (low accuracy on test set 70% instead of 85%). Maybe it was the same gotcha.

That’s very interesting! Thank you, @Zackk!

Raymond