Ok…so a fairly deep rabbit hole later…
So I thought about this problem a bit and came to the conclusion that building a docker image where python, tensorflow, etc versions match the online deep learning notebook environment would be beneficial. It would be easy to share and once it works it works.
So we can print some of the package versions by modifying the code in one of the online notebooks, something like this:
import sys # Import the sys module to access system-specific parameters and functions
import six
import h5py
import packaging
import opt_einsum
import keras
import matplotlib
import scipy
#import scikit_learn
import jupyter
print("python " + str(sys.version))
print("tensorflow " + str(tf. __version__))
print("numpy " + np.__version__)
print("six " + six.__version__)
print("h5py " + h5py.__version__)
print("packaging " + packaging.__version__)
print("opt_einsum " + opt_einsum.__version__)
print("keras " + keras.__version__)
print("matplotlib " + matplotlib.__version__)
print("scipy " + scipy.__version__)
print("jupyter " + jupyter.__version__)
Note I say some as this list is not the entire transitive dependency graph of the dependencies for the online notebook. I wish somebody with access to the environment hosting could provide us with the actual dependency tree, but more on this later.
This block run on the CNN W2A1 notebook gives me:
python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0]
tensorflow 2.9.1
numpy 1.22.3
six 1.15.0
h5py 3.6.0
packaging 20.9
opt_einsum v3.3.0
keras 2.9.0
matplotlib 3.5.2
scipy 1.7.3
jupyter 1.0.0
ok, so that’s a start.
As mentioned above, one problem on the M1/Mx apple silicon macs is that tensorflow 2.9.0 does not exist in any installable form in the pip/etc repositories. So off we go compiling the thing from source.
For the uninitiated, compiling tensorflow from source is not for the faint of heart. On my maxed out M1 Max macbook it’s about a one hour process with a plethora of possibilities for getting things wrong and breaking the build somewhere half way through…just to restart the whole one-hour process. Oh what fun!
Anyway, so I created a Dockerfile which installs all the build prerequisites, clones the tensorflow source, makes sure to install the correct versions of all the above mentioned python packages so that we compile tensorflow against the right stuff and off we go.
After various convolutions (ha) I got the docker build to complete and to also contain all the right packages and the jupyter python package so that the docker container could start a jupyter server.
The idea here is that I can start the docker container on my machine, it will have all the “right stuff” ™ and will also run a jupyter notebook server. I can then sit in my local VSCode environment and connect to the container jupyter kernel via a port and for VSCode things should look like I was developing locally…only now I should have the right versions of everything and more or less exactly replicate the online jupyter environment.
I mean…how hard can it be?
Well turns out this seems somewhat non-trivial. I did all the above. I have the container running. When I connect VSCode to that jupyter kernel (via an exposed port from the container) I can execute code in the notebook and my version printing block from above exactly matches the versions from the online environment.
It turns out however that it was now boot-in-face o’clock. Even after all the above, with the correct tensorflow version, the correct numpy version, and correct versions of all the packages listed in the above list, the results still do not match the online environment and the tests keep failing.
Output in VSCode from my code block:
my results: https://photos.app.goo.gl/ZqaVECr1xVhrD2DQ7
exception: https://photos.app.goo.gl/ZiFnRcQmK3LHzswv8
expected results: https://photos.app.goo.gl/o9s2vubaDWfC2Yd28
If you don’t feel like staring at images, the first non-zero value in the Training=True
matrix should be 0.40732
according to the tests, but is instead 0.40733
(with a 3 at the end) which seems to break the tests.
My current suspicion is that I’ve missed some crucial package which is responsible for the discrepancy. Could also be that the computer architecture actually affects the calculations somehow.
I would love for somebody with some deeper understanding to chime in and perhaps point me towards a likely candidate. Adding another package with a specific version at this point is not a whole lot of work. I just need to modify the dockerfile and re-run the build which probably at this point will no longer fail.
The upside here is that if I could get this to run I can share the docker image and write up a page on how to use it. This would mean that anybody with a apple silicon mac could run their notebooks locally with a couple of lines of code.
Anyway, any help much appreciated, I am at the moment out of ideas on how to proceed from here.