Hello Jordhan,
Regarding your query over the use of 2m and m, I would just say that it’s a convenient method to get the computation done for gradient descent. The derivative term of the square function gets cancel out with 1/2 term.
The 1/m averages the squared error over the no. of components to reduce its impact on the function.
Here’s a link that will add more insights.