Using tanh vs. sigmoid for output layer

I don’t know if there are any “simple words” that will suffice here, but “cross entropy loss” (also sometimes called “log loss”) is a function that is derived from the concept of “estimating maximum likelihood” in statistics. This has been around at least since the days of Leonhard Euler, so it’s not something new created just for machine learning that just popped into somebody’s mind. Prof Ng explains it in the Week 2 lectures and here’s a thread from Mentor Raymond that gives a really nice explanation. Sorry, but as warned above, neither of those probably qualifies as “simple words”.

Here’s another thread that discusses this and shows some graphs.