Hello Folks,
Kinda overwhelmed with the knowledge I have gained after completing the NLP course. I really wanted to stop for a while and develop something of my own as a project. I stumbled upon this idea of building a disease prediction model last week and I really wanted to get some ideas on my model. I scraped diseases, and their symptoms from internet. Symptoms being separated by commas with no sequential meaning.
My questions are,
- Symptoms and disease names can be multi word, is there any work around to tokenize multi-words or any other work around in tensorflow?
- I guess using LSTM is quite useless as symptoms are randomly separated by commas. Comments?
- I multiplied my dataset by using symptoms in various combinations for every disease.
- My approach so far has been to tokenize each symptom. I did not tokenize labels and rather created a dictionary with indexes. This is because some diseases labels were multi word so I cannot pass a token array as label.
- That’s it
I pass the input through an embedding layer 64 dimension, Global avg pooling 1D, Dense layer 100 relu activations, and 261 outputs softmax activation.
I got 90% train and val accuracy. Is this approach right?
Another thing I wanted to do was to be able to predict symptoms from utterances, any suggestions?