I want to build deeplearning NLP model with datasets below. The dataset I have contains diseases and their corresponding symptoms like this
Disease | Symptom_1 | Symptom_2 | Symptom_3 | Symptom_4 | Symptom_5 |
---|---|---|---|---|---|
Fungal infection | itching | skin_rash | nodal_skin_eruptions | dischromic_patches | None |
Fungal infection | skin_rash | nodal_skin_eruptions | dischromic_patches | None | None |
Fungal infection | itching | nodal_skin_eruptions | dischromic_patches | None | None |
Fungal infection | itching | skin_rash | dischromic_patches | None | None |
Fungal infection | itching | skin_rash | nodal_skin_eruptions | None | None |
Before using it for Natural Language Processing (NLP) tasks, I want to preprocess the data to represent symptoms in a suitable format for my deep learning NLP model. I am considering two feature engineering options:
- List of Symptoms for Each Disease:
I could create a new dataset where each row corresponds to a disease, and the symptoms are listed as a string. For example:
Disease | Symptoms |
---|---|
Chronic cholestasis | itching, yellowish skin, nausea, loss of appetite, abdominal pain, yellowing of eyes |
Chronic cholestasis | itching, yellowish skin, nausea, loss of appetite, abdominal pain, yellowing of eyes |
or,
- Transformed Symptom Descriptions:
Alternatively, I could transform the symptoms into a single string description for each disease. For example:"Fungal infection. Itching. Reported signs of dischromic patches. Patient reports no patches in throat. Issues of frequent skin rash. Patient reports no spotting urination. Patient reports no stomach pain. Nodal skin eruptions over the last few days."
My question is, which kind of feature engineering should I use that would work better for my deep learning model? I would appreciate input and insights from the community to help me make an informed decision.