Convolutional sliding windows

I understand the concept, but how would you actually implement this with a built-in network?

I have data that is 45x45 pixels, but in the app I’m making I have a user draw on a canvas that is much bigger (maybe 500x1000 or so). How would I train a model using the data (45x45), but then apply the convolutional sliding windows to the canvas, since I previously told the network to accept input sizes of 45x45?

Please go through transfer learning lectures and update this topic with further questions.

What do you mean update with further questions? This is my only question. If I trained a model using data of size A, how would I input an image of size B, greater than A, if I want to use convolutional sliding windows on image B? Also, I looked through the transfer learning lectures but they are not related to my question!

Here are 2 options based on trasfer learning:

  1. Resize the input image to match the trained model input size i.e. 45x45. This way, you can expect similar performance to training time as long as the distribution of the resized images fall within that of training data.
  2. Start with the trained weights of 45x45 images and refine the model based on the new target size (i.e. fine-tuning).

I’m not sure you understand what I’m saying. I have data for individual math symbols, which are 45x45 pixels. In the app I am making, there is a canvas that is bigger, maybe 500x1000 pixels. Users use the canvas to write equations and expressions, with multiple characters. That is why I want to use the concept of convolutional sliding windows that Andrew talked about in the video Convolutional Implementation of Sliding Windows, so I can identify all the symbols in the whole canvas in one go.

When training, your goal is to predict the class id of a symbol given the image containing only that symbol. When deploying the app, you want to input a screen full of symbols and get the predictions for all symbols in one go. I’m afraid that’s not possible.

Here’s what I recommend:

  1. Train 2 models:
    a. A yolo model to draw bounding boxes around a screen full of images. You might have to do additional work to figure out the right anchor boxes.
    b. A model to perform symbol classification.
  2. In your app, isolate the symbols using yolo boxes and cut these sections of the screen for your classifier model to predict the symbol, one small image at a time.

Please see this app

1 Like

Can you help me understand why it isn’t possible?

Also, doesn’t yolo require data in the form of, in my case, the large canvas, not a single symbol, to train? I only have data for single symbols.

Can someone help me?

Yolo does support class predictions.
2 approaches come to mind:

  1. If you can get training images of the real world images, forget your standalone classifier model and use yolo.
  2. See if you can generate synthetic data with multiple symbols on a blank canvas and generate training images.

The closer your training data is to your real world data, better the model training and the metrics you use for model selection will make sense.

The reason why you can’t use just a vanilla classifier for your problem is that this task requires 2 steps:

  1. Isolating the portion of the image that stands for a symbol
  2. Classifying the class of the isolated symbol.

A classifier trained on individual images is capable of performing only the 2nd step.

Have you done courses 2 and 3 of this specialization?
What did you think about the app?

The app looks cool, and is almost exactly what I’m trying to do.

My idea was to use convolutional sliding windows by running a convnet trained on 45x45 images on the whole canvas, which would give a final output matrix of each prediction for every “window”, as shown by Andrew.

This is the video I’m referencing: Convolutional Implementation of Sliding Windows - Object Detection | Coursera

There are 3 factors to consider when using convolutional sliding windows:

  1. The stride length (since you need to slide your bounding boxes across the entire image). Smaller stride length could result in better results at the cost of incresed compute requirement.
  2. Accuracy of the bounding box is likely to be bad (since the size of bounding boxes are fixed)
  3. Either train your convolution network using the larger box you want to examine or fine-tune your convolution network to account for the distribution of rescaled images (from target screen to 45x45).

Why do you not what to use yolo?

Since you haven’t answered my question on whether you’ve completed courses 2 and 3, I recommend you complete the courses to better understand the effect of difference in distributions between training / serving datasets.

I do not want to use yolo because it would require extra data for real life data which I do not have.

Please move this topic to the General Discussions subcategory. Someone might be able to help you out.
Here’s the community user guide to get started.

Left to me, I’d start with generating the dataset for yolo.

Hello @sickopickle ,

YOLO is certainly a great option, but if you want to stick with Sliding window, and knowing that each object drawn in your 500 x 1000 canvas is recognizable in a 45 x 45 slice, then it very much worths a try!

I have never done this before but one way is to define a new model with the same architecture but a different input shape. Please follow this walk-through with me:

Consider we are training a model for photos of shape (16, 16, 3):

model1 = tf.keras.models.Sequential()
model1.add(tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(16, 16, 3)))
model1.add(tf.keras.layers.Conv2D(4, (14, 14), activation='relu'))

But now we want to predict for a photo of (32, 32, 3) with the above trained model, then we can define a new model with the same architecture except for the input shape:

model2 = models.Sequential()
model2.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model2.add(layers.Conv2D(4, (14, 14), activation='relu'))

Note from both summaries that both models have the same number of parameters (25,988).

Then we can copy the trained weights from model1 to model2:


And finally we can pass our batch of 32 x 32 x 3 images into model2 to get an output of shape (N, 17, 17, 4), where N is the batch size, for the 17 x 17 results in each image.


1 Like

Thanks so much @rmwkwok ! This is exactly what I’ve been trying to ask. How would I do this for a state-of-the-art model? Would I train a (state-of-the-art) model on 45x45 images, and then copy the trained weights to another (state-of-the-art) model with an input shape of 500x1000? Also, what does the 4 represent in (N, 17, 17, 4)?

Oh, is the 4 from the second Conv2D? Also, you said “Note from both summaries that both models have the same number of parameters (25,988).” If there’s supposed to be a summary showing, I don’t see it.

Because I didn’t copy the summaries here, but if you run the code you will see them :wink:

Yes. For 4 classes, which is the number used in Andrew’s video that you shared earlier.

Yes, this will work. However, since I had never done this before, so I can’t say my suggestion is the simplest way, but at least it will work :wink:


1 Like

Thanks so much, I’ll try to implement it!

Ok, I went straight into Google Colab with tons of confidence, and realized EfficientNet (and other state-of-the-art networks) uses a Flatten layer plus softmax, and I’m not sure how I deal with that. In the example that Andrew used, as well as your example, there is no softmax.