Why you get a much higher loss if you use ‘Adam’ instead of ‘sgd’ with 500 epochs?
Optimisation algorithms are designed to minimise the error rate when training your model. How efficient is it can be measured based on speed of convergence - how many epochs do you need to get a global optimum - and their generalisation capabilities - how good your model react to new data.
In our extremely simple case, SGD converges quicker compared to Adam, that’s why you get lower loss. We shouldn’t get any conclusion for that as you’ll find out many cases that Adam’s will make it much faster and better than SGD. As suggestion, you can also play with learning_rate and will notice that results can change in favour of Adam.