Questions about Sharing Code, Relationship between Interactions and Alpha, and Scaling in Multiple Linear Regression

Sebastian_Rathmann · August 4, 2023, 5:06pm

Hello,

I hope you’re doing well. I have a few questions and would greatly appreciate your guidance.

Firstly, I’m wondering if it’s permissible to share external code that I’ve created. I’ve developed a program from scratch that uses concepts covered in the course, but it’s not related to activity answers or lab solutions. I reviewed the rules and understand that sharing activity answers or lab code is prohibited, but I believe my situation is different.

Secondly, I’d like to discuss the significance of interactions in relation to the learning rate, alpha. Are these two factors closely linked? I’ve encountered an interesting scenario where I obtained very similar cost and graph patterns by running 5500 iterations with alpha = 0.0003 and 550 iterations with alpha = 0.003. Is this a potential code issue or is it a normal outcome?

Lastly, I would appreciate feedback on my scaling approach. I’ve noticed that in my code, during the scaling process, it might not correctly account for the first feature of X, because even if that value changes, the prediction result is the same. Here’s the scaling code snippet I’ve used:

for i in range(3):
prediccion[i] = (prediccion[i] - np.min(X_train[:, i])) / (np.max(X_train[:, i]) - np.min(X_train[:, i]))

And this is the one I use to undo the scaling:

y_hat = y_hat_scaled * (np.max(y_train) - np.min(y_train)) + np.min(y_train)

TMosh · August 4, 2023, 6:10pm

That’s probably not an appropriate use of the forum. Your own repo would be a better choice.

That’s expected. If you use a larger learning rate, you need fewer iterations. The tradeoff is that if the learning rate is too large, the solution may diverge instead of converge.

Scaling by the range of the feature values is sometimes used, instead of the standard deviation. The key factor is the degree to which you want to consider outliers in the dataset. But it should be done on the entire data set, not by iterating through the examples (and not by using a hard-coded range).

rmwkwok · August 4, 2023, 7:49pm

Hello @Sebastian_Rathmann,

That sounds interesting!!

It is a good and professional way to store your code in a repository on your Github account, and then you may share a link to that repository. I can take a look!

Cheers,
Raymond

Sebastian_Rathmann · August 5, 2023, 9:10pm

Hello again!

First of all, I wanted to apologize for taking so long to respond. I hope you can still help me. After implementing some of the things you mentioned, translating most of my code into English, and learning to use Git from scratch, I wanted to share the link to my repository: GitHub - Sebastian-Rathmann/Multi-Linear-Regression: My first repository, in this I will publish my first code based on the Machine Learning Supervised course.

Any feedback is greatly appreciated, from issues related to “best practices” to errors that might be causing the code to not work correctly. Thank you very much for your attention!

rmwkwok · August 6, 2023, 1:00am

Hello @Sebastian_Rathmann,

Congratulations on your first Github Repo, and thanks for the translation and it is considerate

I have a few comments:

We don’t need to scale y_train. In fact, we almost always don’t need to scale our label, unless we do it for numerical stability. For example, if the labels are very large number like “the number of atoms in a solar system”, then we may want to scale it down because computer can only correctly handle the numbers when they are within a certain range.
If you don’t scale y_train, you don’t need to scale back.
We don’t apply np.max over the whole dataset to scale each feature. We apply np.max over one feature and use the result to scale that same feature. I recommend you to cross check your result with sklearn’s MinMaxScalar. This is how we check our work ourselves - by using existing libraries.
This is optional. It is completely fine to loop over, but you can do the min-max scaling in one line by vectorization. You might want to try.
It’s always a good practice to store the scaling factors, because we will need them to scale any new test set. sklearn’s MinMaxScalar will save the factors (we should be familiar with how sklearn’s works). Your current approach is to keep the X_train untouched, so that you can re-obtain the scaling factors. Your approach is fine, but the usual approach is to store just the scaling factors, because we don’t want to repeat the calculations of the scaling factors.
I didn’t check your gradient descent’s code. However, I recommend you to, again, cross-check its result with existing library’s. For example, sklearn’s LinearRegression. It is an important process that we find out some way to cross-check our works.
If you find that sklearn’s LinearRegression result is different from yours, you need to check if you have run sufficient iterations so that the model has really converged to some stable w and b. It is actually a very good exercise even if you have to spend hours to finally make your gradient descent to agree with sklearn’s LinearRegression. However, those hours won’t be wasted, because it will become some useful experience and this is how sometimes people can very quickly “spot” a problem.
Again, cross check if your prediction result is very similar to sklearn LinearRegression’s.

Keep practicing!
Raymond

Sebastian_Rathmann · August 11, 2023, 5:12pm

Hi there!

I hope you’re having a great week. I wanted to share my recent progress with you and seek your expert guidance on a couple of issues that have been giving me pause. Before we dive into the details, I’d like to mention that I’ve been actively refining my repository (GitHub - Sebastian-Rathmann/Multi-Linear-Regression: My first repository, in this I will publish my first code based on the Machine Learning Supervised course), ensuring that it contains only the essential files. I’ve also taken the time to enhance the overall quality of my code. This included addressing previous shortcomings such as untranslated sections, lacking comments, and variable names that were in a mix of English and Spanish. I’m pleased to say that the code is much cleaner and more organized now.

Moving on to my concerns, I’ve been working diligently this Week, focusing on your suggestions. I removed the scaling from y_train, split the max and min of my features into variable and vectorized the scaling process for x_train. I also extensively compared each section of my code with sklearn’s implementation. While everything appears to be in order, I’m still apprehensive and would greatly appreciate your insights to ensure my approach is correct.

There are two perplexing anomalies that I’ve encountered, which have been impeding my progress and causing me some confusion:

Sklearn Convergence: One observation that has been puzzling me is that sklearn’s implementation seems to converge much faster than my own. Despite both implementations utilizing gradient descent, sklearn appears to converge with relatively small values for parameters such as 0.0001 and only 2000 iterations. However, in my case, “achieving convergence” requires using parameters like 0.001 and around 5000 iterations. This discrepancy has raised questions in my mind about potential areas I may be overlooking or misunderstanding.

Graphical Representation: The second concern is linked to the first one, and it gives me the impression that the graph I’m employing to compare the cost concerning interactions lacks accuracy. I sense that my gradient descent is functioning properly; however, when it approaches the final stage, it alternates between values, preventing it from reaching the true minimum values. [And yes, I am aware that this could potentially be mitigated by selecting lower alpha values, but this strategy hasn’t yielded the expected results; I always find myself requiring a substantial number of iterations (100k for a reasonably accurate prediction)].
This hypothesis started when I performed 20k interactions with an alpha value of 0.001. The prediction I consistently make (150, 10, 5) returned $234,018 (while sklearn’s prediction sits at around $235k) which means it was very close to perfection. When the graph section appeared, surprisingly, it illustrated a kind of ‘L’ where the horizontal line was long and flat, suggesting that convergence had happened in previous iterations of my gradient descent.
The quandary lies in the fact that when I decrease the number of iterations, it produces values approximately around $225k, which further bolsters, in my view, the hypothesis I outlined earlier.

I would greatly appreciate your insights and guidance on these matters. If there’s a concept I’m misunderstanding or if there might be an issue in my code, please let me know. Additionally, could you please explain how I can create the graphs that were used in the “Optional Lab: Feature Scaling and Learning Rate (Multi-variable)”? Understanding this process would greatly benefit my future work.
Attach the 2 graphics that I would most like to know how to do

Thank you once again for your invaluable assistance and support!

TMosh · August 11, 2023, 5:17pm

This means your learning rate is too high. You need to reduce the learning rate, and that means you’ll also have to increase the number of iterations.
When you’re using fixed-rate gradient descent, that’s just how it works.

If you normalize the data set, then you can use higher learning rates, and therefore fewer iterations.

But that doesn’t address the fundamental shortcoming of gradient descent: Fixed-rate gradient descent is extremely inefficient. This is because as the gradients approach zero, the progress toward the minimum also slows down, yet it keeps consuming the same amount of computational resources to get decreasing improvements.

sklearn uses more sophisticated methods, so it converges faster and with less effort.

rmwkwok · August 12, 2023, 2:20am

Hello @Sebastian_Rathmann,

Good work! It is a bit above my expectation!

That is for certain. Your gradient descent is a “Batch Gradient Descent”, whereas Sklearn’s SGDRegression is, as its name says, a “SGD or Stochastic Gradient Descent”. I don’t recall where our MLS has explained their difference, but I am sure you can google it out pretty quickly. We can discuss why SGD is faster, but I hope you can get familiar with the topic first and come up with some reasons yourself first.

To give us some hope, I am able to reproduce SGDRegression 's result with 2 changes from your code:

And the changes are:

What parameters to send to SGDRegressor when we call it. I encourage you to go through the sklearn doc page for its parameters one-by-one, and determine what value should be taken to match with your gradient descent. This is a very good exercise to challenge your understanding. I have prepared two hints for you in case it is helpful:
**Click here for what parameters I have used**
I have sent 7 parameters in total into `SGDRegressor`
**Click here for exactly what parameters I have used**
```
     Gradient_Sklearn = SGDRegressor(
          max_iter=5000,
          penalty=???, 
          tol=???, 
          shuffle=???, 
          learning_rate=???, 
          eta0=???, 
          n_iter_no_change=???
      )
```

I have added three lines and changed another three lines in your grad_dec.py’s Gradient_Decent to convert your currently Batch GD to the SGD. I have prepared one hint for you in case it is helpful:
Click here for the six lines
```
 for i in range(len(y)):
     _X = X[[i], :]
     _y = y[[i]]

     dj_dw, dj_db = Compute_Gradient(_X, _y, w, b)

     w = w - alpha * dj_dw

     b = b - alpha * dj_db
```

If you look at my screenshot again, I have used a different code to do the printing work, and here it is for your reference:

with np.printoptions(precision=6):
    info = [
        ['Sklearn SGDRegressor', w_sklearn, b_sklearn],
        ['My', w, b],
        ['Sklearn LinearRegression', w_lr_sklearn, b_lr_sklearn],
        # Add more rows when needed
    ]
    message = '\n'.join([f'{name.rjust(30)} parameters: w: {w}, b: {b}' for name, w, b in info])
    print(message)

Lastly, I am also comparing the results with Sklearn LinearRegression, because it gives you literally the answer that is closer to the real optimal answer. If you have time after matching SGDRegression with your GD, do try to match the two sklearns.

I wanted you to match the two sklearns only because they are faster algorithms, so you can experiment more with less waiting time. It is also because you should have proven that your GD can produce the same result as SGDRegression, so we can skip your GD for the time being.

Lastly,

you named variables called “interactions” in many places. It should be called “iterations” instead.
Understanding the difference between Batch GD and SGD is a challenge. There is also a “Mini-batch GD” and you may also learn about it while learning about the first two because they are usually introduced together.
Figuring out what parameters (and their values) I have sent to SGDRegression to reproduce your GD’s (after modified) result is a challenge for how well you understand the parameters. A good ML practitioner should know very well of the parameters.
Matching the two sklearns is a challenge for how well you understand the parameters. A good ML practitioner should know very well of the parameters.

Good luck!
Raymond

rmwkwok · August 12, 2023, 2:22am

For the graphical representation part, after you have finished the first and the second challenge above, you can remake the graphs, and if you still have questions on the new graphs, ask them again, and we can discuss on your new graphs.

Cheers,
Raymond

Sebastian_Rathmann · August 15, 2023, 4:34pm

Hello Raymond,

I really appreciate your responses; they encourage me to investigate and learn. It’s great to attempt things on my own, and solving something based on your guidance always feels amazing. However, I feel like I’m falling behind in the course. While I would love to spend more time refining my model to the fullest extent, my family is also pressuring me to complete the course, both due to financial reasons and the belief that I won’t have time later. In part, they’re right. I’m currently in my last year of high school, and next month I’ll be starting preparatory courses for college, which will further limit my time.

As a result, I suppose I’ll be putting this topic on hold temporarily. Not without asking you a bit about what I’ve researched:

I’ve delved quite a bit into the difference between Batch Gradient Descent and Stochastic Gradient Descent. I think I was able to implement it in my code using the following:

random_index = random.choice(range(len(y)))
X_random = X[[random_index], :]
Y_random = y[[random_index]]

When comparing it to the hint you provided, I realized that your code doesn’t use any method for randomizing the values. In fact, it uses a nested sequence of loops, where for each iteration, you’re sending individual values of x and y one by one. The most puzzling part for me to comprehend is why use another loop. Isn’t the idea of SGD to send random values one by one instead of sending all the information at once? As I understand it, in your code, you’re not choosing a random number per iteration; instead, for each iteration, you’re “optimizing” the algorithm to go one by one instead of sending all the information at once. The question is, shouldn’t this also involve modifying the Compute_Gradient function? It’s currently set up with a loop to iterate through each parameter and then vectorize for the multiplication of features. How would this work now?

To be honest, I’m not sure if I managed to express my doubts well in words, but essentially, your code made everything work perfectly. However, I feel that in theory, you’re performing a batch gradient descent, but in an optimized manner (which is great, but not what I expected). I also find it challenging to understand the mechanism that makes it work so well.

Thank you very much!

rmwkwok · August 16, 2023, 1:39am

Hello @Sebastian_Rathmann,

You express things very clearly, and it is a pleasure working with you!

While learning ML is a challenge, time management is also a challenge and is even more critical, so please stick to your plan. This place is accessible even after you completed the specialization, and I always look forward to seeing your new investigations in the future.

The only difference in terms of code between Batch GD, Mini-batch GD, and Stochastic GD (SGD) is how many percent of samples we use in one gradient descent (aka one step). By convention, each iteration (aka epoch) always consumes all samples, consequently, they have different number of steps per epoch. Below is an example with a training set of 2048 samples:

Method	number of samples per step	number of steps per epoch
Batch GD	2048	2048/2048 = 1
Mini-batch GD	32	2048/32 = 64
SGD	1	2048/1 = 2048

Note again that “one gradient descent” means “one step”, and “one iteration” means “one epoch”.

Therefore, two levels of loops are needed in general. The outer loop goes over the required number of epochs, whereas the inner loop goes over the required number of steps. Batch GD is a special case that there is always only one step per epoch, so the inner loop can be got rid of.

Random sampling is another topic independent of the version of GD. Generally, before we start an epoch, we shuffle the training data. This makes sure each step sees a different combination of samples. However, because Batch GD always uses 100% of samples, it is a special case that shuffling means nothing, and so it can be got rid of.

Above are why, for Batch GD, we do not have an inner loop for steps, and we do not have randomization for training data.

Above is also why, for my SGD, I used an inner loop for steps. However, I didn’t do randomization NOT because it is wrong to do so, but I chose not to do it, because I wanted to show you that my SGD can reproduce sklearn’s SGDRegressor result. In sklearn’s SGDRegressor, I have set its shuffle parameter to False. Even if I set it to True for sklearn’s SGD and use randomization in my SGD, there is no guarantee that sklearn and I shuffle the data in the same way. If we shuffle differently, I can’t expect the same result.

I would also like to give a very very brief introduction to the difference of impact by Batch GD, Mini Batch GD, and SGD. Since it is brief, it might be inaccurate in some situations. Also, you may not fully understand, but please just take my words for it, and think and criticize when the time comes in the future. Again, it is not completely accurate, so it is criticizable.

SGD performs the largest number of GDs, meaning it walks more steps in each epoch, and so it moves faster towards the goal (at least comparing to Batch GD in most cases)
The role of training data is sometimes overlooked by learners. The data decides the optimal weights. The data decides the cost surface. The data decides the next gradient descent step. Now, the three versions of GD will supply different amounts of samples in each step, so they see different sets of optimal weights, different cost surfaces, and different next steps.
If we wish our model to learn to perform to the best with respect to the whole training set, then batch GD is the safest approach. However, it is slow (and there are other downsides). SGD is fast, however, since it only performs GD on one sample at a time, its cost surface is most deviated from the surface that the batch GD would see. And since SGD can see a very different sample at each step, its cost surface also keeps changing, and thus making the steps stochastic (going back and forth, or sometimes called a noisy convergence path). Therefore, by switching from Batch GD to SGD, we are trading off a clean convergence path for speed. It is also therefore we have the Mini-batch GD as something in between the Batch GD and SGD. Mini-batch GD is the mostly used approach.

Lastly, when you have time in the future, please come back and tune the parameters to match your SGD with sklearn’s SGD. There are actually a lot of discussion points around those parameters. For example, do you expect applying shuffle = True makes any difference to the final result and why?

Really lastly, I don’t expect for any work or feedback since you have your schedule to stick to. From what I have seen so far, you have really done good works and you were improving your work. Please keep this momentum for your learning as it will help you through a lot of ups and downs.

Good luck, keep learning, and see you later.

Cheers,
Raymond

Topic		Replies	Views
Week 1 Community Contributions: Share Your Notes Supervised ML: Regression and Classification week-module-1	37	1405	July 4, 2022
Optional Lab: Feature scaling and Learning Rate (Multi-variable) Supervised ML: Regression and Classification week-module-2	1	491	August 24, 2022
C1_W2_Lab03, the alpha that results in faster convergence Supervised ML: Regression and Classification week-module-2	2	543	July 18, 2022
Optional Lab: Feature scaling and Learning Rate , effect of alpha Supervised ML: Regression and Classification week-module-2	3	559	August 14, 2022
About gradient descent and Features scaling Supervised ML: Regression and Classification week-module-2	6	574	August 19, 2022

Questions about Sharing Code, Relationship between Interactions and Alpha, and Scaling in Multiple Linear Regression

Related topics