C3 W4 n_grams_seqs input_sequences length is wrong

Hello, I’ve created the n_grams_seqs function and am getting the expected output matched for the one line and 3 line visual test, but the length of input_sequences is returning 15462 instead of the expected 15355.

In my n_grams function I am iterating over each line in the corpus in a for each loop and splitting the line by spaces into words. I then iterate over the works, incrementally taking an n_gram sequence, tokenizing it before extracting the sequence into a single array (as it returns a 2d array: [[34],[417]] and I transform it into [34, 417]). I suspect this is where I’m going wrong but I’m really not sure since all of the outputs look the same.

For context here is my output for the first two grader example output tests:

n_gram sequences for first example look like this:

[[34, 417],
 [34, 417, 877],
 [34, 417, 877, 166],
 [34, 417, 877, 166, 213],
 [34, 417, 877, 166, 213, 517]]
n_gram sequences for next 3 examples look like this:

[[8, 878],
 [8, 878, 134],
 [8, 878, 134, 351],
 [8, 878, 134, 351, 102],
 [8, 878, 134, 351, 102, 156],
 [8, 878, 134, 351, 102, 156, 199],
 [16, 22],
 [16, 22, 2],
 [16, 22, 2, 879],
 [16, 22, 2, 879, 61],
 [16, 22, 2, 879, 61, 30],
 [16, 22, 2, 879, 61, 30, 48],
 [16, 22, 2, 879, 61, 30, 48, 634],
 [25, 311],
 [25, 311, 635],
 [25, 311, 635, 102],
 [25, 311, 635, 102, 200],
 [25, 311, 635, 102, 200, 25],
 [25, 311, 635, 102, 200, 25, 278]]

I’d appreciate any advice on where I’m going wrong! Thanks.

Update: I got it working, I was splitting the sentence into words before transforming the text to sequences, I swapped to using texts_to_sequences on the sentence (in an array like the notebook says to do 2 cells above :man_facepalming:). But still, I’m not sure where I went wrong and how the length of input sequences increased? Are spaces being counted somehow?

2 Likes