I am building a NN model for financial assets.
Input is return of a financial asset(varies from negative certain percentage to positive certain percentage)
Output is also return of a financial asset(varies from negative certain percentage to positive certain percentage)

Since I want output varies from some negative number to some positive number, I like to use tanh function as output layer activation function(GL) to get Y hat.
Can I use same cost function L as sigmoid?
" L= - (y-log(a)) + (1-y)*log(1-a) "

It sounds like you have a âregressionâ problem, meaning that your output is a continuous real number which also has the property that it can be either positive or negative. The âlog lossâ or âcross entropyâ cost function is probably not useful at all: it is intended for classification problems meaning that there are discrete answers (âYes/Noâ or a category). Itâs also not clear that you would want tanh as the output activation since it flattens out away from the origin. Farther away from the origin is even better, if itâs to the right, right? Have a look at the shape of the curve. And bad if itâs to the left. So Iâd say you need something more like Euclidean distance as your output metric, but Iâve never dealt with a case in which the values could also be negative. Needs some more thought!

If you want to discuss more, it would help to know what your training data looks like.

Thanks for your reply. It sounds more complicated than I thought.

I am building a supervised NN model. Inputs(X) are 200 assetsâ daily return and my Y is also daily return of a target asset. So basically I want to predict a target assetâs daily return(Y hat) by inputting returns of 200 other assets(X). Number of training data is around 700-900. Currently I have 2 hidden layers.

Above you asked that âFarther away from the origin is even better, if itâs to the right, right?â No. I need my Y hat as close to Y whether it is negative or positive. I am not trying to make Y hat big positive. I just need accurate Y hat which is closest to Y.

My inputs are daily return of 200 financial assets. They are mostly numbers from -0.1 to +0.1(their absolute value barely go over 10 percent). Y is also return of target asset. It mostly varies between -0.1 to +0.1(itsâ absolute value barely go over 10percent)

In my opinion what you need in the output is just a simple Dense layer, and as you want to measure how close it is to the target you could use as loss the mean_absolute_error metric.

In any case, even when the inputs to the model look similar (in the range -0.1 to +0.1) I would normalize the input features prior feeding them to the model that makes the model more stable and make easier the convergence.

I have some more questions.
What do you mean by âjust a simple Dense layerâ?
You mean sigmoid, tanh or relu function as activation function? Or donât use activation function and leave only linear regression part(AL=ZL) at output layer?

Using mean_absolute_error as cost function seems good for me since I want to find closest Y hat.
Then, I will need derivative of this cost function.
I googled it and found derivative of loss fuction, dL/dA as below.
dL/dA = +1 where Y_hat > Y
-1 where Y_hat < Y

Is this the derivative you have in mind when use mean_absolute_error?

A âsimple Dense layerâ just means a normal fully connected feed forward network layer. We use that term more frequently once we get to Convolutional Nets in Course 4, because then we have several different types of layers to consider (convolutional, pooling or âdenseâ/âfully connectedâ). Since your output is a real number that can be either positive or negative, it probably will work fine just to use the linear activation only (Z = W \cdot A + b) as the output layer. Adding sigmoid or tanh will just distort the results, since you want distance based cost. And ReLU will obviously kill your ability to do anything with negative outputs, which sounds like it is important in your case.

Yes, you have the derivative described correctly for mean absolute error as the cost. Of course itâs not differentiable at z = 0, but neither is ReLU. You can consider either of the âlimitâ values as the derivative at 0, just as we do with ReLU, and it should be fine. Give the above recipe from @albertovilla a try and let us know what happens!

You can just replace the function, but it would be a good idea to read the documentation for both functions carefully to make sure that their parameters are the same. The question is why you would want to do that. This is a classification problem and there is a reason why they use âcross entropyâ loss for classification problems. But if your goal is to see how it works and thus convince yourself of why cross entropy is better, then it could be a worthwhile educational exercise.

Note that if you were writing the code by hand in python, then there would be a second step: you would need to implement the gradients for the new cost function as well. But when you use TensorFlow, it calculates the gradients for you, so you donât need to do anything other than switch the cost function.

As you advised me above, I have been building model using mean absolute error as cost function and try not to input activation function at last layer(ZL = AL).
But I also found in course 2, week3 tensorflow HW doesnât use activation function at last layer.

with tf.GradientTape() as tape:
# 1. predict
Z3 = forward_propagation(tf.transpose(minibatch_X), parameters)
# 2. loss
minibatch_cost = compute_cost(Z3, tf.transpose(minibatch_Y))

How is this possible? does tf.keras.losses.binary_crossentropy has imbeded sigmoid function?
It seems I am missing something.

Yes, the cross entropy loss functions have the ability to do the activation of the output layer for you internally (sigmoid in the binary case, softmax in the categorical case). You did read the documentation, right? Look at the description of the from_logits ânamedâ parameter. It turns out that it is common practice to use from_logits = True in those cases: itâs less code and itâs more numerically stable, so why wouldnât you use that method? This was also demonstrated in the code segment you showed earlier on this thread. But if you want to experiment, you can use the default value of from_logits = False and add your own sigmoid function.

But my understanding from your earlier discussions on this thread is that cross entropy loss is not relevant in your case. The best strategy for your problem would be not to use a non-linear activation at the output layer at all: just feed the logits (the linear output) to your MAE cost function.