As my title states, my question is related with the method used by Andrew Ng, and the method that we should use for choosing a dataset, to help us decide about the level of entropy and information gain for each node, that finally will lead to build in the right way our decision tree model.

So, is Andrew choosing 10 images (5 cats and 5 dogs) for his dataset examples randomly or not. If not, if it was intended indeed then:

a). How large was the initial dataset from which he took only those 10 images?

b). Why 10?

c). Why 5 cats and 5 dogs? Is it important to choose equal parts of the two options?

d). What should be the percentage of the dataset in our situation, to conduct a test, in order to reveal the level of entropy and information gain for various nodes?

e). What wouldâ€™ve happen if he wouldâ€™ve have had beside cats and dogs, monkeys, mice, donkeys, cows? How the decision tree wouldâ€™ve changed?

The last question, from â€śe).â€ť Iâ€™m asking cause in Andrewâ€™s situation the tree was binary, not we donâ€™t know, at least till where Iâ€™ve reached with the course (Decision Tree Learning â†’ Putting it together), thereâ€™s no mention how to build a different decision tree than the binary classification one, used be Andrew in this example.
How you dill when you have dozens of objects, groups in 6 classes of objects? Binary classification could lead to a very big decision tree, isnâ€™t it?

If it was a random set (in terms of how many in total and how many of each kind) took from the all data set, than various scenarios like 3 cats and 7 dogs, 6 cats and 4 dogs, and so on, wouldnâ€™t change the calculus done for level of entropy , information gain, and ultimately the choice in terms of nodes for our tree? If so, than there should be a certain number of these choices and based results, somehow to decide in what way should we connect our nodes?

I think the course team chose a very simple and managable dataset just because they could show all the details on the slides to demo the ideas. It is NOT that when we train or test a model, we needed to pick 10 and only 10 images from our dataset.

Course 2 Week 3 Video â€śModel selection and training/cross validation/test setsâ€ť made one suggestion to split our dataset in the ratio of 60:20:20 to a training, a cv, and a test set. This is one way we can split them. The idea is your cv set and your test set have to be large enough to be statistically representative. If your dataset is very very small, like just 5 images, then however you split wonâ€™t be good because the resulting cv and test sets are too small to be representative.

If you have 6 classes, then it is no longer one binary classification problem. It is a multi-class classification problem.

In a multi-class problem, each leaf node can predict either a cat, a dog, a mouse, a donkey, or a cow. The cost function will not just take the entropy of cats and dogs into account, but also the other classes. So, the formula in this slide

Youâ€™ve done an AMAZING work in your reply. I think I do have for the moment answers to all of my questions thanks to you.
Thank you for your time Raymond.