Week 2 - Emojify - Test Set has 'tab' endings => low test set accuracy?

Zackk · January 29, 2023, 7:51pm

When training the deep LSTM for Emojify assignment, I noticed that the test set has “tab character” at the end of each line (data/tesss.csv)

testPNG

These tabs (->) do not appear on the training set (data/train_emoji.csv)

train

They probably confuse the model. I removed them with

X_test = np.array([x.strip() for x in X_test])

and the accuracy on the test set jumped from 70% to 85%.

TMosh · January 29, 2023, 8:16pm

Good point, I’ll look into it.

Girijesh · January 29, 2023, 8:22pm

Problem identification is the half solution of the problem.
Good work.

TMosh · January 29, 2023, 8:32pm

Note that a few months ago I got 85% test set accuracy without removing the tabs, so there’s something that needs further investigation.

Zackk · January 29, 2023, 9:31pm

What might have caused this: I used split(' ') instead of split() when processing the data (The latter would have removed the tab).

I found two other posts (1 2) with the same problem (low accuracy on test set 70% instead of 85%). Maybe it was the same gotcha.

rmwkwok · January 30, 2023, 2:24am

That’s very interesting! Thank you, @Zackk!

Raymond

Topic		Replies	Views
Emojify! Assignement - Test accuracy range not met with LSTM Sequence Models	1	556	April 21, 2022
DLS C5-W2 A2 Emojify - Low accuracy on test set Sequence Models	1	558	October 11, 2021
C5 W2 A2 Emojify Sequence Models	8	748	October 31, 2022
Week2 Emoji_v3a low test accuracy but passed the grader Sequence Models	8	796	October 14, 2023
Week 2 - Emojify - Exercise 5 - Emojify_V2 Sequence Models	2	1209	May 28, 2021