Help! What is the base of log in logistic regression activation function?

When learning math, if we say log(y), we will think by default that is loge(y). That is, the base of log is e. But I heard that it is different in deep learning. Can anyone give a explain about what is the base of log in deep learning.
Thanks a lot.

Hi @Carrie_Young,

I believe log base e is used in DLS. I’m afraid I can’t be of much use here, as I don’t remember where, but Andrew mentions it somewhere in one of the lecture videos.

@paulinpaloalto, @kenb, can you please confirm ?

Thanks,
Mubsi

2 Likes

Hi, @Carrie_Young !

As a quick note, loge or ln is normally used. Check tf documentation for further explanation.

2 Likes

Confirmed, Mubsi. The so-called “natural logarithm” with base e=2.718 ... is used exclusively
in the DLS (if memory serves!). Base 2 logarithms have their uses in information theory, which could be part of the machine learning universe, but not here.

3 Likes

Yes, the notation in the ML/DL world is different than in the math world, which looks confusing at first. In math, log means log base 10 and they use ln for natural log. But in the ML/DL world, log always means natural logarithm. The reason that natural log is preferred is clear once you get into the algorithms: we’re using logs for the loss function and the key point is we need to take the derivatives of the loss functions to implement back propagation, which is the fundamental technique on which learning is built. The beauty of using base e is that the derivatives are nice and clean. If you use log_{10}, then you’d get an extra constant factor every time you take the derivative. And the fundamental mathematical properties are the same, meaning that you would get no advantage from using base 10 logs and it would just make a mess with all the extra constant factors flying around.

I’m not sure historically where this notational difference arose, but one theory would be that 10 or 15 years ago, a lot of the ML work was done using MATLAB and in MATLAB the function names are log for natural log and log10 for base 10 logs. E.g. Prof Ng’s original Stanford Machine Learning course, which I think was first published in 2011 or maybe 2012, used MATLAB as the implementation language, so it would have been natural (pun intended :nerd_face:) to use the same function name that MATLAB used. Of course these days we’re using python and it’s the same there: np.log is natural log and np.log10 is base 10 log.

2 Likes

What does “back propagation” mean - that term hasn’t been mentioned by Andrew so far in Week 3.

He explains it in the lectures, but maybe you haven’t gotten to that one yet. It is using the gradients (partial derivatives) of the cost function w.r.t. the various trainable parameters of your model to push the parameters in the direction of a better solution, meaning a solution with a lower value of the cost function.

Isn’t that gradient descent?

Yes, but there are layers to this. Back propagation is just the calculation of the derivatives and applying them. Gradient Descent is applying back propagation in an iterative process. Each iteration of GD uses BP to modify the weights.

What do you mean by “…layers”?

@ai_is_cool, you have been posting on the MLS forum, so I suppose you meant MLS Course 1 Week 3. That course does not cover neural network, and we need to wait until Course 2 Week 2 to go into back propagation.

This post is in the DLS forum and back propagation is taught starting from DLS’s course 1.

Layers will also be covered in the MLS Course 2.

What do you mean by “…applying them”?

Please can explain what you mean in mathematical terms?

Sorry, I accidentally forgot that you were taking MLS and not DLS, so I don’t know what Prof Ng says about all this in MLS. This is all covered in great detail in DLS.

But if you already know what Gradient Descent is and have seen the lectures on how that works, then I hope you already know that there are these steps in each iteration:

  1. Forward propagation
  2. Compute the cost
  3. Back propagation to compute the gradients
  4. Update parameters by applying the gradients
  5. Repeat from step 1 if the convergence is not good enough yet

The update parameters step looks like this:

W = W - \alpha * dW

For each parameter W, whatever the notation is for the parameters in this section of MLS.

What is happening there is that the gradient dW is the partial derivative of the cost w.r.t. W and that by definition points in the direction of most rapid increase of J. Of course what we want is to go in the exact opposite direction to get the most rapid decrease of J, right? That’s what we multiply by -1 there: the vector -v points in the exact opposite direction from the vector v. Then \alpha is a hyperparameter which is called the “learning rate”: it is a scaling factor that controls how big a step you take on each iteration. The cost surfaces can be pretty complex, so you don’t want to take too big a step on any iteration or you may fall off a cliff. Meaning that rather than converging with smaller values of J on each iteration, it may diverge or oscillate if \alpha is chosen to be too large. And too small an \alpha will cost more compute because it takes more iterations to reach a good solution. So you need find a “Goldilocks” value of \alpha that is “just right”. There are more sophisticated versions of Gradient Descent which use an adaptive strategy to compute a dynamic value, rather than using a fixed \alpha, but that is a more advanced topic that is probably not covered in MLS.

No problem Paul,

Thanks for taking the time to respond and explain.

Unfortunately the terms “…forward propagation…” and “…back propagation…” are not used by Prof. Ng so far in Course 1 Week 3 of the MLS course so it’s still not clear to me what those are in the context of ML GD.

From ChatGPT these terms are used in the context of neural networks.

I’m not sure what you mean by “… points in the direction of…”.

I’m also not sure what you mean by “…hyperparameter…” or what this vector v is.

Also I don’t understand why the cost function is being computed every iteration as I don’t remember seeing that take place in ML GD algorithm - just evaluation of the partial derivatives of the cost function during each weight parameter w_j update.

Stephen.

The thread that you tagged onto here is from DLS, so it assumes you are familiar with the terminology as it is presented in DLS. Since you’re currently working on MLS, maybe the best idea is just to “hold that thought” and take DLS when you finish with MLS.