Can we start with the circle equation as decision boundary?

I have seen in all the learning algorithms be it simple machine learning or neural nets, it start with calculating the line equation y = \vec W \cdot \vec X + b and then second function is applied on it. For example in the linear regression we can it is no function, which mathematically can be written as f(y) = 1 \cdot y, in the logistic regression that second function is the sigmoid function, which can be written as f(y) = \frac{1}{1+e^{-y}}.

I always wonder what is so special in this formula, why start with fitting line always? Is it because that it is easy to apply a function on line to make it non linear. Or in other words, like hitting it with functions and bending it to any curved decision boundary is easy than to start with curved boundary and then using tools to make it straight)

Also in the following distribution,


It is clear we need a circular decision boundary here, so can we start with using the circle equation ( x - h )^2 + ( y - k )^2 = r^2. In this case we would have 3 parameters to learn h, k and r. Also if we center the data around origin, then the function need to learn only r, the equation in that case would be x ^2 + y^2 = r^2

Hello @tbhaxor, thanks for sharing with us your thoughts! There are a few things I want to talk about:

  1. the “second function” is called the “activation function” in the context of neural network.
  2. we do not start with fitting a line. Instead, we fit on the whole thing at a time. If it is a logistic regression, we fit on \frac{1}{1+\exp(- (\vec{W}\cdot\vec{X} + b))}, and we DO NOT fit on \vec{W}\cdot\vec{X} + b separately. Therefore, when we fit a logistic regression, the weights and bias are adjusted given the existence of the sigmoid function. I hope this point is clear.

I agree with what you have said about fitting a circle to the dataset and find out the optimal h, k, and r. My below discussion is to rephrase everything back to the way we discuss it in the Machine Learning Specialization.

This is called feature engineering. In particular, we are engineering a type of features called the polynomial features. You will see it if we expand your equation for circle:

x^2 + y^2 -2hx - 2ky -r^2

And then replace the coefficients with weights, it becomes

w_1x^2 + w_2y^2 + w_3x + w_4y +b

Let’s say we originally only have features x and y, then we are engineering some second degree polynomial features which are x^2 and y^2. In general, the second degree polynomial feature will also include xy, so the most general form becomes

w_1x_1^2 + w_2x_2^2 + w_3x_1 + w_4x_2 + w_5x_1w_2 + b if we replace the feature symbols x and y with x_1 and x_2 respectively.

Now this is a binary classification, so

y = sigmoid(w_1x_1^2 + w_2x_2^2 + w_3x_1 + w_4x_2 + w_5x_1w_2 + b)

In summary, if you can inspect your dataset like the way you do, then you are informed to engineer those second-order features x_1^2 and x_2^2. If you are absolutely sure that you should set w_1 = 1, w_2 = 1, w_3 =0, w_4=0, and w_5=0 because it is centered to origin as you said, then the above model assumption would become

y = sigmoid(C + b) where C = x_1^2 + x_2^2

And then you can fit the above equation for the best b which would be equivalent to -r^2.

However, it is more usual that we cannot inspect the dataset in your way when the number of features are large, then we would need to try some higher-order features, and then evaluate how much better those extra features are delivering to us.


1 Like

Yeah I think the same, the function related to problem should be started with but Andrew sir initially used in motivation lecture confuse me :sweat_smile:


Also I can see the feature engineering a very vast topic includes a lot of statistics in it. Currently I am not in that league to understand the details it. Thanks for explaining and going further, I will revisit it in some time.

Was Andrew trying to deliver the idea of fitting a binary classification problem with the linear regression approach? If so, I think the message he was trying to deliver is that it’s better for us to use the logistic regression on a binary problem. I think he was NOT suggesting for that we should fit a line first and then apply the sigmoid later.

@tbhaxor, take your time. In that future :wink: after you are more familiar with feature engineering, and if you somehow, I don’t know, suddenly think of my reply or think of your circle problem, please read my reply again and let me know if you have questions.


1 Like

I mean to say this

when i wrote “what is so important with line equation”, I mean why only apply “activation” function to the function of the line equation?

I see. Let’s call it “linear function” instead of “line equation”.

In my previous reply, I had “converted” your equation for circle into

w_1x^2 + w_2y^2 + w_3x + w_4y +b

which can further be “converted” into

w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 +b where x_1 = x^2, x_2 = y^2, x_3 = x, x_4=y.

And the above is basically a linear function because we can write it as \vec{w}\cdot\vec{x} +b where \vec{w} = \begin{bmatrix} w_1 & w_2 & w_3 & w_4 \end{bmatrix} and \vec{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix}

Therefore, the “linear function” is so special :wink:


I think there is typo in the equation, getting Math Processing Error


Can you refresh this page (a few times) and see if the problem goes?

1 Like

Yes fixed with refresh

Though it looks like a line equation, but this is part of linear algebra. Using W \cdot X^T + b is used to perform vector operations for performance. I get it now.

Yes “it is not a line equation, but a linear algebra evaluation”

Also I found cool thing named “EDA” where data is explored by visual representations like plots. Here we as statisticians understand the nature of the data how two variables related with each other and with the dependent variable.

Hi @tbhaxor,

in addition to @rmwkwok 's excellent answer.

I agree: Feature Engineering will do the trick: Of course you can also apply a transformation as we know it from polar coordinates to get rid of the non-linearity in the data and solve the problem with a linear classifier.

import numpy as np
import math
import matplotlib.pyplot as plt 

# Some Parameters
dim = 100
noise = 0.2

#Create inner circle
x_1 = np.cos(np.linspace(0,2*math.pi, dim)) + noise*np.random.normal(0, 1, dim)
y_1 = np.sin(np.linspace(0,2*math.pi, dim)) + noise*np.random.normal(0, 1, dim)

#Create outer circle
x_2 = 2*np.cos(np.linspace(0,2*math.pi, dim)) + noise*np.random.normal(0, 1, dim)
y_2 = 2*np.sin(np.linspace(0,2*math.pi, dim)) + noise*np.random.normal(0, 1, dim)

# Plot of Circle data
ax = plt.gca()
plt.title('Before Transformation')

# Transformation function
def pol_transform(x, y):
    alpha = np.sqrt(x*x+y*y)
    beta = np.arctan2(y,x)
    return(alpha, beta)

# Apply transforation
alpha_1, beta_1 = pol_transform(x_1,y_1)
alpha_2, beta_2 = pol_transform(x_2,y_2)

# Plot of transformed data
ax = plt.gca()
plt.title('After Transformation')

After transformation, a generalized linear model, like a logistic regression will do the trick to solve this classification.

Best regards


It is clever to convert to second form, because fitting linear equation is the easiest thing as compared to polynomial decision boundary. I am impressed.



Just as additional input: Of course the kernel trick is very suitable for problems like this as well, see also:


See also this article if you are interested.

Happy Learning and best regards

1 Like

that has been a great discussion indeed. It sparked some thought.

allow me to ask a follow up question.

so do we also try to find the best cv model by adding polinomial features into a logistic regression as we do for linear regression ?

Followıng Raymond’s excellent response to tbhaxor, I thought this tactic should also be applied to logistic regression but I dont know if this is a common practice.


Hi @mehmet_baki_deniz,

I understand your question in a way that you would use polynomial modelling to derive features that you would use as input for your logistic regression.

This is absolutely fine and makes total sense. As long as your features are solid and sound from a domain perspective, this is a great approach. If you can model understood cause effects in your features, go for it!

I think in reality often „the best“ model is not even needed. The pipeline needs to be good enough to solve your business problem in an efficient way. Side note: Maybe „data quality“ can have a stronger leverage than the „model“, see also this thread on data-centric AI.

Best regards

1 Like

Since you are asking about CV (Computer Vision).
Usually here you have a different paradigm: Deeplearning basically takes care of Feature Engineering on its own to learn more abstract and complex patterns, e.g. with convolutional filters and hierarchical pooling.

In this thread some more thoughts are explained in a more detailed way: Do traditional algorithms perform better than CNN? - #2 by Christian_Simonis

So since a well-trained CV model learned so much already with DL, probably a polynomial feature will not have much impact (at least I do not see it yet currently). Still, transforming pictures or getting additional dimensions (e.g. with LiDAR / Radar / …) with sensing or data fusion can definitely help by enabling the DL model to learn better due to better data.

Hope that answers your question, @mehmet_baki_deniz.

Best regards

hi christian. thank you very much for your answers. Indeed, my question is more about a simple logistic regression than a dl model.
I will watch the video you referred to in the first response to get a better sense about why increasing the quality of the data is better than fine tuning a model.

re DL: that reminded me one of my previous question to the community. why to bother with feature engineering for even simple LR problems when a complex enough NN can easily do the job for us but then experienced members responded by saying that parameter adjustmens is also not that easy in a DL model. And what is more interesting is that my data scientists friends also say that they use ML models at work for tabular data rather than a DL model.

maybe the issue is the computational complexity for a not really complex problem…

1 Like

Is this kernel same as finding null matrix. I mean is this kernel thing here same as kernel of matrix (aka null vector space)? The wiki link for machine learning kernel method is too advanced for me right now.

1 Like

Hi there,

dependent on whether you come from statistics, linear algebra etc. the definition of a kernel is ambiguous, see also this list here:

What matters for your classification problem as discussed above, see Can we start with the circle equation as decision boundary? - #12 by Christian_Simonis, is:

The function kolon {athcal {X}}imes {athcal {X}}o athbb {R} |0x0 is often referred to as a kernel or a kernel function . The word “kernel” is used in mathematics to denote a weighting function for a weighted sum or integral.


—> here you find a good explanation of the kernel trick:

Hope that helps, @tbhaxor!

Best regards

1 Like