I implemented the algorithm in numpy as in the practise for optimization methods and did the same thing with tensorflow, but the tensorflow version was much slower.
I also made the model more complex, used another dataset. But even when activating multiprocessor usage in tensorflow, the direct numpy implementation is ~10x faster than tensorflow.
I would like to know, what is slowing down tensorflow here that much and is there a way to avoid that ?

Keep in mind also that TensorFlowâ€™s big selling point is not its efficiency, itâ€™s the ease with which you an create complex models that work reliably.

However, when implementing the tensorflow version, I just used 5 times the default dense layers with tanh activation, and already built-in adam optimizer and mse loss function. So, I have no idea where to use the tensorflow graph function ?
Maybe, in numpy I coded the backprop manually, but tensorflow needs some more logic to decide for the right backprop calculations ?
Maybe it is just due to some calculation overhead (I did see a blank for loop iteration in python takes ~500 CPU cycles, while in C it is just a few) ?

Oh! I thought you were re-implementing all the logic from scratch, but based on your reply, if you built the network with tf.keras.Sequential(...) and used tf.keras.losses.XXX as your loss function, then there was no need for using tensorflow graph function.

Overhead is a possible cause, but it is only significant if the training set size is small. Is it small? Btw, we can find the overhead by varying the training set size.

I think it would be useful to compare these numbers:

â€”

numpy

tensorflow

batch size

?

?

number of batch per epoch

?

?

One case is, If we do batch training in numpy, but mini-batch in tensorflow with like 5 batches, then I will not be surprised if numpy is faster.

Hmm, I did both. In numpy I did it from scratch, but in tensorflow, I just used the default sequential and loss function.
The thing were I was wondering about is that numpy is 10x faster.

I used the same traing data and batch size for both and also get very similar losses on both variants.

Here are my values:
total training dataset ~50000 values, with width of X=5 and Y=1, both types are float.
batch size is 64.
For testing, I actually use 5 hidden layers with 18,13,9,5 and 3 neurons, but the effect is the same also with slightly other topologies. I assume, for big CNNs tensorflow performs better but I want to understand the bottleneck.

50000 samples and 5 features per sample. Doesnâ€™t look much to me. The network is not big, too.

I seldom train with numpy so I am not sure, but perhaps you might really want to find out tensorflowâ€™s overhead: it does things in between two epochs. Also, the first epoch is usually slower than the rest.