C2W2 Coding project, corpus of Viterbi algorithm

I do not understand why the corpus is not a sequence of sentences, but a very, very long sequence of words. It is very confusing there is only one instance where the first word is preceded by a start POS (“–s–”). Is there any reason to do in this way?

This is a good question :slight_smile: It is a design choice and in this case it doesn’t matter much because we have '.' as a POS (which implies the ending of the current and the beginning of the next sentence). But in reality you should definitely think about what implications this would have on your desired outcome and you would want to compare them. For example if you would want to implement a chat bot and all you preprocessed sentences to start with '--s--' and your training data is not like that, what results would you get? But maybe you you have a large paragraph of text and you want to predict POS would you still want to have '--s--' at the beginning of every sentence?

P.S. as an exercise you could try to preprocess text with inserting '--s--' in places that you think they are appropriate and report the results here :slight_smile: