I have a data matrix that represents patients in rows and gene expression in columns. These are patients that have distinct stages of liver disease. However, I want to focus on a set of genes from the matrix that represent the gene expression of a particular cell type from the liver.
I want to use unsupervised learning to cluster the patients using DBSCAN in scikit learn.
The input matrix is 200 patients and 150 genes. I scale my data first with 2logr transformation (usually done in bioinformatics) and run gridsearch with DBSCAN (various epsilon and min_samples combinations), however even the best parameters gave me -1 (noise) in all my samples. I also tried with the normalized data as input with StandardScaler from scikit-learn, with the same results.
I ran UMAP to reduce the dimension of my data and then I ran DBSCAN and found 4 clusters and only 9 samples as noise (-1). My question is whether clustering on the 2 dimensions obtained from UMAP is acceptable, since i dont find clusters with original data (150 dimensions)