Found the issue - the test dataset CSV file has been mishandled in the past, and the strings contain \t (TAB character) at the end. The training set does not suffer from this issue.
I used .split(’ ') to separate the words, which meant the last word of the sentence had a \t stuck to it and was not in the dictionary. The solution is to always use .split() without argument, which strips whitespace by default.
With correctly split words, I’m getting 82% accuracy, which is within the expected range, but surprisingly, still below the much simpler average vector model.