Outputs for more/less numerical stable binary classifications when using KerasClassifier & a pipeline

Hello -
I made a function for a NN that is put into a KerasClassifier and then into a pipeline. I did it using the approach using the “less numerically stable” way pointed out in class where from_logits=True is Not used, and again using the “more numerically stable” approach discussed in class (where ‘linear’ and from_logits=True are both used). I did it for a binary classification problem, so the differences in the approaches shouldn’t be noticeable. However, I have a question about the output.

The training data set is made up, and unrealistically small, but for the purposes of this question, it suffices. Everything needed to run this is provided. Here are the imports:

import numpy as np
from sklearn.pipeline import make_pipeline
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import relu,linear,sigmoid
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from scikeras.wrappers import KerasClassifier

Here is the data:

X_train_n = np.array([[-0.23287322,  0.29172998,  0.67238949],
       [-0.23287322,  1.60451491,  0.95649772],
       [-1.0712168 ,  0.51052747,  0.8144436 ],
       [ 2.00270966, -1.23985243,  0.33145961],
       [-0.79176894, -1.23985243, -1.31636815],
       [ 0.3260225 ,  0.0729325 , -1.45842227]])
y_train = np.array([1, 0, 1, 1, 0, 0])

Here is the “less numerically stable way” of going about it:

def estimator_nn():
	tf.random.set_seed(7)
	model = Sequential(
        [Dense(12,input_shape=(3,),activation='relu'),
         Dense(1,activation='sigmoid') ])
	model.compile(loss=tf.keras.losses.BinaryCrossentropy(), 
               optimizer=tf.keras.optimizers.Adam(0.001))
	return model
    
model_outside = KerasClassifier(estimator_nn(), epochs=10, verbose=0,batch_size=10) 
pipe_nn = make_pipeline(model_outside)
pipe_nn.fit(X_train_n,y_train)

pipe_pred_probs_less_stable = pipe_nn.predict_proba(X_train_n)

The variable pipe_pred_probs_less_stable returns the following predicted probabilities.

array([[0.4867192 , 0.5132808 ],
       [0.48584813, 0.5141519 ],
       [0.49800926, 0.50199074],
       [0.80937004, 0.19062999],
       [0.6834421 , 0.31655788],
       [0.57961285, 0.42038715]], dtype=float32)

It is my understanding from the sklearn documentation that the 0th column is the probability that a 0 classification will occur, and the 1st column is the probability that a 1 will occur, this makes sense because the two columns add to 1.0. The issue is when I do the same as above, but using the courses more numerically stable approach as shown below:

def estimator_nn():
	tf.random.set_seed(7)
	model = Sequential(
        [Dense(12,input_shape=(3,),activation='relu'),
         Dense(1,activation='linear') ])
	model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), 
               optimizer=tf.keras.optimizers.Adam(0.001))
	return model
    
model_outside = KerasClassifier(estimator_nn(), epochs=10, verbose=0,batch_size=10) 
pipe_nn = make_pipeline(model_outside)
pipe_nn.fit(X_train_n,y_train)


pipe_pred_not_probs_morestable = pipe_nn.predict_proba(X_train_n)
pipe_pred_probs_morestable = tf.nn.sigmoid(pipe_pred_not_probs_morestable).numpy()

The variable pipe_pred_probs_morestable returns the following predicted probabilities:

array([[0.72048414, 0.5132808 ],
       [0.71978134, 0.5141519 ],
       [0.7294901 , 0.50199074],
       [0.92026275, 0.19062999],
       [0.85441244, 0.31655788],
       [0.7893787 , 0.42038715]], dtype=float32)

I noticed that the 0th column here does not match the previous output using the less stable approach (the first approach that didn’t use from_logits=True), but the 1st column here does in fact match the previous output. Also, the columns here do not add up to 1.0.

Hopefully someone can take a look at this and give me some feedback. My questions are:
1.) Is there a reason why the more numerically stable approach gives probabilities that don’t add up to 1.0? Maybe it is in my code, or maybe it is something to do with the approach presented in class that I don’t understand.
2.) I could just ignore the 0th column in the second approach (from_logits=True approach) and just use the 1st column since it is giving the same answer as the first approach, but, that seems risky to move forward with no attempt to understand what is happening, are there any insights into what the 0th column in the second approach is giving?
3.) Feel free to make any comments on how I am going about the overall process in my code if something looks wrong.

Thanks.

1 Like

The columns only add to 1 if a softmax() function is used.

  • This happens automatically when you use from_logits = true.
  • It is not automatic when you use a sigmoid() activation.

I do not understand what you are saying with regard to the 0th column.

Hello Navead,

I ran your code but I do not get the same answers, and more importantly, the 2 approaches do not share the same set of values in column 1 which is expected because they are 2 different approaches - one more stable and one less.

approach 1, epochs = 1000

array([[0.02734417, 0.97265583],
       [0.97730356, 0.02269642],
       [0.01604927, 0.98395073],
       [0.00136888, 0.9986311 ],
       [0.9891432 , 0.01085678],
       [0.9922021 , 0.00779789]], dtype=float32)

approach 2, epochs = 1000

array([[0.12598003, 0.9496445 ],
       [0.98571414, 0.03790256],
       [0.05116596, 0.98054796],
       [0.03631667, 0.9863259 ],
       [0.99445045, 0.01494269],
       [0.99672526, 0.0088519 ]], dtype=float32)

I am not familiar with the underlying of scikeras so I can’t comment further on your implementation. However, from my result running your code, it looks fine to me.

Cheers,
Raymond

Thank you for the reply, I appreciate it. I should have defined what I meant by the 0th column. The 0th column was the first column in the array. It would have been more clear if I just said “the first column and the second column”.

Hi Raymond -
I reran the code that I put in the question and replicated the same results that I show in the question. I only used epochs = 10 (instead of 1000) and I made sure to use the same tf.random.set_seed(7) for each. I also just reran the code not using pipeline or a kerasclassifier, not using .predict_proba, and not using a function, so for example, the “more stable way” looks like this in the simplified version:

tf.random.set_seed(7)
model = Sequential(
    [Dense(12,input_shape=(3,),activation='relu'),
     Dense(1,activation='linear') ])
model.compile(loss=BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(0.001))
model.fit(X_train_n,y_train,epochs=10)

logit = model(X_train_n)  
pred_probs_more_stable_no_function = tf.nn.sigmoid(logit).numpy()

…and I still got the same answer as in my original question. I’m not sure how/why your answers differed. However, I think my question has changed since experimenting with the code a little bit.
One difference between the code in the original question and the one in the simplified version shown in this reply, is that the original code uses the .predict_proba that is associated with using the SKLearn make_pipeline function like I did in the original question. You can’t use .predict when using make_pipeline, it has to be .predict_proba. The docs are here:

So, I guess the question is: In the “more stable way” when I use .predict_proba as done in the original question, the second column are the probabilities and match the results from the simplified code, but what do the numbers in the first column (ie, the result starting with 0.72048414, then 0.71978134, etc ) represent? From the docs I can’t tell.

I realize this question is probably beyond the scope of the course and I can’t expect to get a solution, but I just thought I would see what you/anyone thought of it.

In any case I at least feel confident that doing a simplified version and doing a more complex version using the make_pipeline function do give the same results, if you are looking at the correct columns.

Thanks!

1 Like

Hello Navead,

If you add up all the columns, does each row give you a 1?

Raymond

Hi Raymond -
When I use the KerasClassifier(…) and make_pipeline(…) functions that uses .predict_proba(…), adding the rows in the two columns of outputs do not add to 1.0. But when I use just the Sequential/.compile/.fit witout any Keras or anything else involved, and I use the .predict(…) (not predict_proba), I get only one column, but that column matches the second column from using .predict_proba. Maybe that answers what you are asking for? Sorry if I’m not getting what you are asking. Basically, I just can’t figure out what the 1st column of numbers are when I use .predict_proba.

Thanks.

Hi Navead,

Would you mind sharing your code with me in a DM? I expected the columns add up to ones, but from your response, I should read your code and examine it myself first.

Thanks,
Raymond

yes, thanks, that might help clarify things.

Did you manage to find an answer to your question? I am also curious

Hi Daniel - I will DM you. The short answer is that the first column is kind of non-sensical data and we should just ignore it.

1 Like