Need advise on my personal project model

Hello Folks,

Kinda overwhelmed with the knowledge I have gained after completing the NLP course. I really wanted to stop for a while and develop something of my own as a project. I stumbled upon this idea of building a disease prediction model last week and I really wanted to get some ideas on my model. I scraped diseases, and their symptoms from internet. Symptoms being separated by commas with no sequential meaning.

My questions are,

  1. Symptoms and disease names can be multi word, is there any work around to tokenize multi-words or any other work around in tensorflow?
  2. I guess using LSTM is quite useless as symptoms are randomly separated by commas. Comments?
  3. I multiplied my dataset by using symptoms in various combinations for every disease.
  4. My approach so far has been to tokenize each symptom. I did not tokenize labels and rather created a dictionary with indexes. This is because some diseases labels were multi word so I cannot pass a token array as label.
  5. That’s it

I pass the input through an embedding layer 64 dimension, Global avg pooling 1D, Dense layer 100 relu activations, and 261 outputs softmax activation.

I got 90% train and val accuracy. Is this approach right?

Another thing I wanted to do was to be able to predict symptoms from utterances, any suggestions?

Please do the following:

  1. Since this is not a course content related topic, move it to general discussions sub-category
  2. It would help readers understand better if you shared your work and dataset via a link.
  3. Do clarify the following:
    a. I multiplied my dataset by using symptoms in various combinations for every disease.
    b. Another thing I wanted to do was to be able to predict symptoms from utterances.