You may be interested in this thread.
I listed up some data sources that provides image and annotation which should include label text. The number of classes is relatively large which may require additional time for training. And, you need to modify code slightly, but I suppose basic structure for the neural network should be similar.