Prof Andrew Ng mentions that there is a good mathematical reason for why logistic regression loss formula has a form similar to the entropy formula for Decision Trees. Can someone please explain that reasoning or point me to good resources to understand the mathematical derivations?

Thank you @TMosh for the link. It does make clear with a few examples how information gain and entropy are used in Decision Trees. What is not clear to me yet is why the loss function for logistic regression takes a similar form. For one, the curve has one maxima instead of one minima. How would gradient descent find a minimum for such a curve? (Unless the logistic loss curve is inverted compared to the entropy curve)

Maximum and minimum are only algebraic differences. You can convert between convex and concave curves easily, by subtracting from 1, or multiplying by -1, depending on the situation.

The purpose of the log() function is to create exponentially larger penalties as the errors increase.