I am working on a project where I have a tagged text corpus as input. I am currently forming the vocabulary from the corpus by taking words that are repeated more than once. Is this the correct approach to form the vocab? Also I do not understand why we are adding --unk-- in the tag list for the assignment. I understand that it should be part of vocab for handling unknown words, but why a tag? Because it will never match a real tag. Where am I getting this wrong?
Hi Astha.
In the assignment you are given the vocabulary (hmm_vocab.txt) - you do not have to form it.
Regards to the second question - you do not add --unk-- in the tag list, you have to find them by counting tags in your corpus (tag_counts
). If you do it correctly you will get:
View these states
['#', '
```, "''", '(', ')', ',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP
```, 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP
```, 'WRB', '``']
The --unk--
“word” is included in your vocabulary (hmm_vocab.txt) to mark the unknown words. Actually the unknown words are split into different categories as in the lab:
'--unk--', '--unk_adj--', '--unk_adv--', '--unk_digit--', '--unk_noun--', '--unk_punct--', '--unk_upper--', '--unk_verb--'
So for example if you complete your assignment you will get tag counts ((tag, word): count
) for the “word” ‘–unk–’ like this:
('VBG', '--unk--') 517
('JJ', '--unk--') 763
('NNS', '--unk--') 1432
('NN', '--unk--') 1307
('VBZ', '--unk--') 341
('VB', '--unk--') 245
('VBD', '--unk--') 305
('VBP', '--unk--') 60
('VBN', '--unk--') 462
('CD', '--unk--') 2
('JJS', '--unk--') 36
('FW', '--unk--') 31
('RB', '--unk--') 29
('PRP', '--unk--') 2
('MD', '--unk--') 1
('IN', '--unk--') 2
('SYM', '--unk--') 1
('UH', '--unk--') 6
('NNP', '--unk--') 2
Conversely for example for the tag ‘NNS’ you would get:
('NNS', '--unk_upper--') 350
('NNS', 'Arts') 2
('NNS', 'sales') 946
('NNS', 'cars') 138
('NNS', 'markets') 359
...
('NNS', 'officials') 493
('NNS', 'proposals') 57
('NNS', 'realities') 6
('NNS', '--unk--') 1432
('NNS', 'estimates') 90
('NNS', 'filings') 24
...
I hope this answers your questions
1 Like