In the course we are working with Jupyter Notebook which is amazing. I have installed the extension on VS Code on my pc, so I can fiddle with it on my own and not mess up the assignments. In a real world scenario, when you are working at a company, do they use Jupyter? Or do they use VS Code(or other programs) and make different python files for each function?
I was thinking to create 1 python file called model.py for example, and in there define the model, then make another file that’s about training the model and then another file that combines everything. So when I want to make a small change, I do have to run the whole script. Jupyter is nice, but when the code is long, it takes a bit of scrolling to actually find what you need. Markdown helps in that case, but I feel like splitting code into multiple small files sounds better.
Since I am learning ML/DL right now, I would like to start developing good practise habits to make my life(and others) easier in the future.
You are right, we should absolutely follow software engineering best-practices when implementing ML models into production in any corporate environment. The code should have all the ‘itlities’ like scalability, reliability etc.
The data-science work in a corporate environment itself follows two distinct steps usually, initially we are just researching a problem, trying out different algorithms, pre-processing steps etc. and the flexibility provided by Jupyter (or VSCode + Jupyter extension) is very handy here. But once we are sure how to tackle the problem and we are ready to deploy to production, it absolutely must be written in a way like you would write any other software piece (following coding guidelines, adhering to good design principles etc.)
There are 2 extreme cases here as well, you could write perfect code right at the start, breaking components into modules even when you are in the R&D phase, if you do this, your life will be very simple at the time of productionization and usually is followed by teams who have good software background. Just ensure you retain enough flexibility as the goal at this point is to try out different things quickly and see what leads to good results.
The other extreme is when we have a team of data scientists without much software background, they’ll do almost all work in their Jupyter. To assist them, we have platforms like Kubeflow which can directly take the code in Jupyter, create a container out of it and deploy it onto a kubernetes cluster. But even in this scenario, the data scientist has to ensure that he/she is writing the code as expected by Kubeflow and the results absolutely need to be reproducible.
Of-course, in the long run only code written following good software practices will survive. Google has an excellent paper covering these points and more.
Hope this helps.
It does, thank you for your reply!