Hello -
I made a function for a NN that is put into a KerasClassifier and then into a pipeline. I did it using the approach using the “less numerically stable” way pointed out in class where from_logits=True is Not used, and again using the “more numerically stable” approach discussed in class (where ‘linear’ and from_logits=True are both used). I did it for a binary classification problem, so the differences in the approaches shouldn’t be noticeable. However, I have a question about the output.
The training data set is made up, and unrealistically small, but for the purposes of this question, it suffices. Everything needed to run this is provided. Here are the imports:
import numpy as np
from sklearn.pipeline import make_pipeline
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import relu,linear,sigmoid
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from scikeras.wrappers import KerasClassifier
Here is the data:
X_train_n = np.array([[-0.23287322, 0.29172998, 0.67238949],
[-0.23287322, 1.60451491, 0.95649772],
[-1.0712168 , 0.51052747, 0.8144436 ],
[ 2.00270966, -1.23985243, 0.33145961],
[-0.79176894, -1.23985243, -1.31636815],
[ 0.3260225 , 0.0729325 , -1.45842227]])
y_train = np.array([1, 0, 1, 1, 0, 0])
Here is the “less numerically stable way” of going about it:
def estimator_nn():
tf.random.set_seed(7)
model = Sequential(
[Dense(12,input_shape=(3,),activation='relu'),
Dense(1,activation='sigmoid') ])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(0.001))
return model
model_outside = KerasClassifier(estimator_nn(), epochs=10, verbose=0,batch_size=10)
pipe_nn = make_pipeline(model_outside)
pipe_nn.fit(X_train_n,y_train)
pipe_pred_probs_less_stable = pipe_nn.predict_proba(X_train_n)
The variable pipe_pred_probs_less_stable returns the following predicted probabilities.
array([[0.4867192 , 0.5132808 ],
[0.48584813, 0.5141519 ],
[0.49800926, 0.50199074],
[0.80937004, 0.19062999],
[0.6834421 , 0.31655788],
[0.57961285, 0.42038715]], dtype=float32)
It is my understanding from the sklearn documentation that the 0th column is the probability that a 0 classification will occur, and the 1st column is the probability that a 1 will occur, this makes sense because the two columns add to 1.0. The issue is when I do the same as above, but using the courses more numerically stable approach as shown below:
def estimator_nn():
tf.random.set_seed(7)
model = Sequential(
[Dense(12,input_shape=(3,),activation='relu'),
Dense(1,activation='linear') ])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(0.001))
return model
model_outside = KerasClassifier(estimator_nn(), epochs=10, verbose=0,batch_size=10)
pipe_nn = make_pipeline(model_outside)
pipe_nn.fit(X_train_n,y_train)
pipe_pred_not_probs_morestable = pipe_nn.predict_proba(X_train_n)
pipe_pred_probs_morestable = tf.nn.sigmoid(pipe_pred_not_probs_morestable).numpy()
The variable pipe_pred_probs_morestable returns the following predicted probabilities:
array([[0.72048414, 0.5132808 ],
[0.71978134, 0.5141519 ],
[0.7294901 , 0.50199074],
[0.92026275, 0.19062999],
[0.85441244, 0.31655788],
[0.7893787 , 0.42038715]], dtype=float32)
I noticed that the 0th column here does not match the previous output using the less stable approach (the first approach that didn’t use from_logits=True), but the 1st column here does in fact match the previous output. Also, the columns here do not add up to 1.0.
Hopefully someone can take a look at this and give me some feedback. My questions are:
1.) Is there a reason why the more numerically stable approach gives probabilities that don’t add up to 1.0? Maybe it is in my code, or maybe it is something to do with the approach presented in class that I don’t understand.
2.) I could just ignore the 0th column in the second approach (from_logits=True approach) and just use the 1st column since it is giving the same answer as the first approach, but, that seems risky to move forward with no attempt to understand what is happening, are there any insights into what the 0th column in the second approach is giving?
3.) Feel free to make any comments on how I am going about the overall process in my code if something looks wrong.
Thanks.