I have seen in all the learning algorithms be it simple machine learning or neural nets, it start with calculating the line equation y = \vec W \cdot \vec X + b and then second function is applied on it. For example in the linear regression we can it is no function, which mathematically can be written as f(y) = 1 \cdot y, in the logistic regression that second function is the sigmoid function, which can be written as f(y) = \frac{1}{1+e^{-y}}.

I always wonder what is so special in this formula, why start with fitting line always? Is it because that it is easy to apply a function on line to make it non linear. Or in other words, like hitting it with functions and bending it to any curved decision boundary is easy than to start with curved boundary and then using tools to make it straight)

Also in the following distribution,

It is clear we need a circular decision boundary here, so can we start with using the circle equation ( x - h )^2 + ( y - k )^2 = r^2. In this case we would have 3 parameters to learn h, k and r. Also if we center the data around origin, then the function need to learn only r, the equation in that case would be x ^2 + y^2 = r^2

Hello @tbhaxor, thanks for sharing with us your thoughts! There are a few things I want to talk about:

1. the â€śsecond functionâ€ť is called the â€śactivation functionâ€ť in the context of neural network.
2. we do not start with fitting a line. Instead, we fit on the whole thing at a time. If it is a logistic regression, we fit on \frac{1}{1+\exp(- (\vec{W}\cdot\vec{X} + b))}, and we DO NOT fit on \vec{W}\cdot\vec{X} + b separately. Therefore, when we fit a logistic regression, the weights and bias are adjusted given the existence of the sigmoid function. I hope this point is clear.

I agree with what you have said about fitting a circle to the dataset and find out the optimal h, k, and r. My below discussion is to rephrase everything back to the way we discuss it in the Machine Learning Specialization.

This is called feature engineering. In particular, we are engineering a type of features called the polynomial features. You will see it if we expand your equation for circle:

x^2 + y^2 -2hx - 2ky -r^2

And then replace the coefficients with weights, it becomes

w_1x^2 + w_2y^2 + w_3x + w_4y +b

Letâ€™s say we originally only have features x and y, then we are engineering some second degree polynomial features which are x^2 and y^2. In general, the second degree polynomial feature will also include xy, so the most general form becomes

w_1x_1^2 + w_2x_2^2 + w_3x_1 + w_4x_2 + w_5x_1w_2 + b if we replace the feature symbols x and y with x_1 and x_2 respectively.

Now this is a binary classification, so

y = sigmoid(w_1x_1^2 + w_2x_2^2 + w_3x_1 + w_4x_2 + w_5x_1w_2 + b)

In summary, if you can inspect your dataset like the way you do, then you are informed to engineer those second-order features x_1^2 and x_2^2. If you are absolutely sure that you should set w_1 = 1, w_2 = 1, w_3 =0, w_4=0, and w_5=0 because it is centered to origin as you said, then the above model assumption would become

y = sigmoid(C + b) where C = x_1^2 + x_2^2

And then you can fit the above equation for the best b which would be equivalent to -r^2.

However, it is more usual that we cannot inspect the dataset in your way when the number of features are large, then we would need to try some higher-order features, and then evaluate how much better those extra features are delivering to us.

Cheers,
Raymond

1 Like

Yeah I think the same, the function related to problem should be started with but Andrew sir initially used in motivation lecture confuse me

Also I can see the feature engineering a very vast topic includes a lot of statistics in it. Currently I am not in that league to understand the details it. Thanks for explaining and going further, I will revisit it in some time.

Was Andrew trying to deliver the idea of fitting a binary classification problem with the linear regression approach? If so, I think the message he was trying to deliver is that itâ€™s better for us to use the logistic regression on a binary problem. I think he was NOT suggesting for that we should fit a line first and then apply the sigmoid later.

@tbhaxor, take your time. In that future after you are more familiar with feature engineering, and if you somehow, I donâ€™t know, suddenly think of my reply or think of your circle problem, please read my reply again and let me know if you have questions.

Cheers,
Raymond

1 Like

I mean to say this

when i wrote â€śwhat is so important with line equationâ€ť, I mean why only apply â€śactivationâ€ť function to the function of the line equation?

I see. Letâ€™s call it â€ślinear functionâ€ť instead of â€śline equationâ€ť.

w_1x^2 + w_2y^2 + w_3x + w_4y +b

which can further be â€śconvertedâ€ť into

w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 +b where x_1 = x^2, x_2 = y^2, x_3 = x, x_4=y.

And the above is basically a linear function because we can write it as \vec{w}\cdot\vec{x} +b where \vec{w} = \begin{bmatrix} w_1 & w_2 & w_3 & w_4 \end{bmatrix} and \vec{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix}

Therefore, the â€ślinear functionâ€ť is so special

Raymond

I think there is typo in the equation, getting Math Processing Error

Can you refresh this page (a few times) and see if the problem goes?

1 Like

Yes fixed with refresh

Though it looks like a line equation, but this is part of linear algebra. Using W \cdot X^T + b is used to perform vector operations for performance. I get it now.

Yes â€śit is not a line equation, but a linear algebra evaluationâ€ť

Also I found cool thing named â€śEDAâ€ť where data is explored by visual representations like plots. Here we as statisticians understand the nature of the data how two variables related with each other and with the dependent variable.

Hi @tbhaxor,

I agree: Feature Engineering will do the trick: Of course you can also apply a transformation as we know it from polar coordinates to get rid of the non-linearity in the data and solve the problem with a linear classifier.

import numpy as np
import math
import matplotlib.pyplot as plt

# Some Parameters
dim = 100
noise = 0.2

#Create inner circle
x_1 = np.cos(np.linspace(0,2*math.pi, dim)) + noise*np.random.normal(0, 1, dim)
y_1 = np.sin(np.linspace(0,2*math.pi, dim)) + noise*np.random.normal(0, 1, dim)

#Create outer circle
x_2 = 2*np.cos(np.linspace(0,2*math.pi, dim)) + noise*np.random.normal(0, 1, dim)
y_2 = 2*np.sin(np.linspace(0,2*math.pi, dim)) + noise*np.random.normal(0, 1, dim)

# Plot of Circle data
ax = plt.gca()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.scatter(x_1,y_1)
plt.scatter(x_2,y_2)
plt.title('Before Transformation')
plt.show()

# Transformation function
def pol_transform(x, y):
alpha = np.sqrt(x*x+y*y)
beta = np.arctan2(y,x)
return(alpha, beta)

# Apply transforation
alpha_1, beta_1 = pol_transform(x_1,y_1)
alpha_2, beta_2 = pol_transform(x_2,y_2)

# Plot of transformed data
ax = plt.gca()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.scatter(alpha_1,beta_1)
plt.scatter(alpha_2,beta_2)
plt.title('After Transformation')
plt.show()


After transformation, a generalized linear model, like a logistic regression will do the trick to solve this classification.

Best regards
Christian

2 Likes

It is clever to convert to second form, because fitting linear equation is the easiest thing as compared to polynomial decision boundary. I am impressed.

2 Likes

Great!

Just as additional input: Of course the kernel trick is very suitable for problems like this as well, see also:

(Source)

Happy Learning and best regards
Christian

1 Like

that has been a great discussion indeed. It sparked some thought.

so do we also try to find the best cv model by adding polinomial features into a logistic regression as we do for linear regression ?

FollowÄ±ng Raymondâ€™s excellent response to tbhaxor, I thought this tactic should also be applied to logistic regression but I dont know if this is a common practice.

2 Likes

I understand your question in a way that you would use polynomial modelling to derive features that you would use as input for your logistic regression.

This is absolutely fine and makes total sense. As long as your features are solid and sound from a domain perspective, this is a great approach. If you can model understood cause effects in your features, go for it!

I think in reality often â€žthe bestâ€ś model is not even needed. The pipeline needs to be good enough to solve your business problem in an efficient way. Side note: Maybe â€ždata qualityâ€ś can have a stronger leverage than the â€žmodelâ€ś, see also this thread on data-centric AI.

Best regards
Christian

1 Like

Usually here you have a different paradigm: Deeplearning basically takes care of Feature Engineering on its own to learn more abstract and complex patterns, e.g. with convolutional filters and hierarchical pooling.

In this thread some more thoughts are explained in a more detailed way: Do traditional algorithms perform better than CNN? - #2 by Christian_Simonis

So since a well-trained CV model learned so much already with DL, probably a polynomial feature will not have much impact (at least I do not see it yet currently). Still, transforming pictures or getting additional dimensions (e.g. with LiDAR / Radar / â€¦) with sensing or data fusion can definitely help by enabling the DL model to learn better due to better data.

Best regards
Christian

hi christian. thank you very much for your answers. Indeed, my question is more about a simple logistic regression than a dl model.
I will watch the video you referred to in the first response to get a better sense about why increasing the quality of the data is better than fine tuning a model.

re DL: that reminded me one of my previous question to the community. why to bother with feature engineering for even simple LR problems when a complex enough NN can easily do the job for us but then experienced members responded by saying that parameter adjustmens is also not that easy in a DL model. And what is more interesting is that my data scientists friends also say that they use ML models at work for tabular data rather than a DL model.

maybe the issue is the computational complexity for a not really complex problemâ€¦

1 Like

Is this kernel same as finding null matrix. I mean is this kernel thing here same as kernel of matrix (aka null vector space)? The wiki link for machine learning kernel method is too advanced for me right now.

1 Like

Hi there,

dependent on whether you come from statistics, linear algebra etc. the definition of a kernel is ambiguous, see also this list here:

What matters for your classification problem as discussed above, see Can we start with the circle equation as decision boundary? - #12 by Christian_Simonis, is:

The function is often referred to as a kernel or a kernel function . The word â€śkernelâ€ť is used in mathematics to denote a weighting function for a weighted sum or integral.

Source

â€”> here you find a good explanation of the kernel trick:

Hope that helps, @tbhaxor!

Best regards
Christian

1 Like