Same cost function for tanh fuction at output layer as sigmoid?

Can I get some insight or help on below?

I am building a NN model for financial assets.
Input is return of a financial asset(varies from negative certain percentage to positive certain percentage)
Output is also return of a financial asset(varies from negative certain percentage to positive certain percentage)

Since I want output varies from some negative number to some positive number, I like to use tanh function as output layer activation function(GL) to get Y hat.
Can I use same cost function L as sigmoid?
" L= - (y-log(a)) + (1-y)*log(1-a) "

It sounds like you have a “regression” problem, meaning that your output is a continuous real number which also has the property that it can be either positive or negative. The “log loss” or “cross entropy” cost function is probably not useful at all: it is intended for classification problems meaning that there are discrete answers (“Yes/No” or a category). It’s also not clear that you would want tanh as the output activation since it flattens out away from the origin. Farther away from the origin is even better, if it’s to the right, right? Have a look at the shape of the curve. And bad if it’s to the left. So I’d say you need something more like Euclidean distance as your output metric, but I’ve never dealt with a case in which the values could also be negative. Needs some more thought!

If you want to discuss more, it would help to know what your training data looks like.

Thanks for your reply. It sounds more complicated than I thought.

I am building a supervised NN model. Inputs(X) are 200 assets’ daily return and my Y is also daily return of a target asset. So basically I want to predict a target asset’s daily return(Y hat) by inputting returns of 200 other assets(X). Number of training data is around 700-900. Currently I have 2 hidden layers.

Above you asked that “Farther away from the origin is even better, if it’s to the right, right?” No. I need my Y hat as close to Y whether it is negative or positive. I am not trying to make Y hat big positive. I just need accurate Y hat which is closest to Y.

My inputs are daily return of 200 financial assets. They are mostly numbers from -0.1 to +0.1(their absolute value barely go over 10 percent). Y is also return of target asset. It mostly varies between -0.1 to +0.1(its’ absolute value barely go over 10percent)

Hope this help

In my opinion what you need in the output is just a simple Dense layer, and as you want to measure how close it is to the target you could use as loss the mean_absolute_error metric.

In any case, even when the inputs to the model look similar (in the range -0.1 to +0.1) I would normalize the input features prior feeding them to the model that makes the model more stable and make easier the convergence.

1 Like

Thanks for your reply.

I have some more questions.
What do you mean by “just a simple Dense layer”?
You mean sigmoid, tanh or relu function as activation function? Or don’t use activation function and leave only linear regression part(AL=ZL) at output layer?

Using mean_absolute_error as cost function seems good for me since I want to find closest Y hat.
Then, I will need derivative of this cost function.
I googled it and found derivative of loss fuction, dL/dA as below.
dL/dA = +1 where Y_hat > Y
-1 where Y_hat < Y

Is this the derivative you have in mind when use mean_absolute_error?

A ‘simple Dense layer’ just means a normal fully connected feed forward network layer. We use that term more frequently once we get to Convolutional Nets in Course 4, because then we have several different types of layers to consider (convolutional, pooling or “dense”/“fully connected”). Since your output is a real number that can be either positive or negative, it probably will work fine just to use the linear activation only (Z = W \cdot A + b) as the output layer. Adding sigmoid or tanh will just distort the results, since you want distance based cost. And ReLU will obviously kill your ability to do anything with negative outputs, which sounds like it is important in your case.

Yes, you have the derivative described correctly for mean absolute error as the cost. Of course it’s not differentiable at z = 0, but neither is ReLU. You can consider either of the “limit” values as the derivative at 0, just as we do with ReLU, and it should be fine. Give the above recipe from @albertovilla a try and let us know what happens! :nerd_face:

Thanks Paulinpaloalto and Albertovilla!

I will try this and see how it goes.

Hi Paulinpaloalto/Albertovila,

I am currently trying to change cost function from binary cross entropy to mean_absolute_error in course 2 h.w with tensorflow.

def compute_cost(logits, labels):
cost=tf.reduce_mean(tf.keras.losses.binary_crossentropy(y_true=labels, y_pred=logits, from_logits=True))
return cost

above is original code and i found tf.keras.losses.MeanAbsoluteError as code for MAE.
How can i change this cost function?


You can just replace the function, but it would be a good idea to read the documentation for both functions carefully to make sure that their parameters are the same. The question is why you would want to do that. This is a classification problem and there is a reason why they use “cross entropy” loss for classification problems. But if your goal is to see how it works and thus convince yourself of why cross entropy is better, then it could be a worthwhile educational exercise.

Note that if you were writing the code by hand in python, then there would be a second step: you would need to implement the gradients for the new cost function as well. But when you use TensorFlow, it calculates the gradients for you, so you don’t need to do anything other than switch the cost function.

Hi Paulinpaloalto,

As you advised me above, I have been building model using mean absolute error as cost function and try not to input activation function at last layer(ZL = AL).
But I also found in course 2, week3 tensorflow HW doesn’t use activation function at last layer.

 with tf.GradientTape() as tape:
            # 1. predict
            Z3 = forward_propagation(tf.transpose(minibatch_X), parameters)

            # 2. loss
            minibatch_cost = compute_cost(Z3, tf.transpose(minibatch_Y)) 

How is this possible? does tf.keras.losses.binary_crossentropy has imbeded sigmoid function?
It seems I am missing something.

Yes, the cross entropy loss functions have the ability to do the activation of the output layer for you internally (sigmoid in the binary case, softmax in the categorical case). You did read the documentation, right? Look at the description of the from_logits “named” parameter. It turns out that it is common practice to use from_logits = True in those cases: it’s less code and it’s more numerically stable, so why wouldn’t you use that method? This was also demonstrated in the code segment you showed earlier on this thread. But if you want to experiment, you can use the default value of from_logits = False and add your own sigmoid function.

But my understanding from your earlier discussions on this thread is that cross entropy loss is not relevant in your case. The best strategy for your problem would be not to use a non-linear activation at the output layer at all: just feed the logits (the linear output) to your MAE cost function.

Thanks Paulin Paloalto,

After read your thread and document and now everything clear about what does “from_logits
=True/false” do.

With your help, I was able to build my model with MAE cost function. Really appreciate it!