In week 2, Lab4 we just add pow(x,2) and pow(x,3) column features to the ‘X’ matrix as we did linear features and then pass onto the gradient descent algorithm.
Could I clarify why this works even though this seems like a mistake since the GD has been derived assuming a linear base function f_wb(x), a derived cost function J(w,b) and w, b updates at each GD step applying partial deriviation on linear f_wb(x)?
Would the w3 update term corresponding to the pow(x,3) feature parameter w3, for instance, after partial deriviative wrt w3, just reduce to Alpha/m * Sum_i_1->m [(f(x_i) - y_i)(pow(x_i,3))] where the x_i**3 term is just substituted from the X mattrix as if it’s just a linear term?
So, in effect, the GD algorithm doesn’t need to be told that this or that feature is a non-linear term of a specific power?