It’s difficult in the human mind to visualize more than three dimensions.
One example of four dimensions might be a stack of ‘m’ color images (maybe the frames of a video), each with red, green, and blue pixel values.
So the shape might be (m, r, g, b).

RIght. The variable W is the “filters” for the convolution operation. Those will be learned through back propagation, so we start with random values of the correct shape. In this case they are defining each filter to be 3 x 3 with 4 input channels, to match the number of channels in the input tensors. Then there are 8 of those filters (that’s the last dimension), which means the output of this convolution step will have 8 channels.

do I understand correct, that each new dimension more than 3 is something like multiverse? So, it just shows us how many there are of this 3-dimensional spaces

I don’t know what the “multiverse” has to do with this, but I think your last sentence is a good way to think of it. In terms of visualization, it matters which of the dimensions you consider as the “new” or “extra” dimension. In the case of the filters, it makes sense to think of the last dimension as the “new” dimension: so in the example above, you have 8 filters each of which is 3 x 3 x 4.

But take the example where the first dimension is the “samples” dimension. One very common example of this is a “batch” of images. So that will be a tensor of shape m x h x w x c. So think of it as m different 3D images, each of which is h x w x c (height, width, colors or channels).

If you are talking about my last example of the “samples” being the first dimension, then I don’t think I agree with your interpretation. In that case, the point is for each value of the first (samples) dimension you have an image which has 3 dimensions: height, width and colors. But the common terminology is to refer to the last dimension as “channels” instead of colors, because once you get past the input layer of a ConvNet, the values aren’t really colors any more: they are just numbers representing internal state values.

Think about what an image is: it is a bunch of pixels, right? The dimensions are h (height) by w (width) of the image. At each of the h * w locations in the image you have color values. If it’s a greyscale image then the c dimension will be 1 and you have one value between 0 - 255. If it’s an RGB image, then you’ll have c = 3. If it’s a PNG file and you have the Alpha channel present, then c = 4 and the values are RGBA.

oh, I am sorry, I read as “and” here. So, looks like that multiverse will be when we get more than 3 dimensions.

I think, that any of this features (axis) can be in extra space, that has connections with any of this 3 axis that are in one “universe” just in 0 point.

Unfortunately there is no way to depict more than 3 dimensions on a graph.

So my point in all of this is that it matters which dimension you decide to pick as the “extra” dimension in terms of how you visualize things. With a batch of images, it makes sense to consider the “samples” dimension (the first dimension) as the “extra” dimension, as I was trying to describe above.

When you’re looking at the weights W in a Conv layer, then it makes more sense to treat the 4th dimension (number of output channels) as the “extra” dimension. The shape of W is:

f x f x nC_{in} x nC_{out}

and you look at that as a collection of nC_{out} filters, each of which has the 3D shape f x f x nC_{in}.

At least that’s the way that makes the most sense to me. You can put on your VR goggles and dive into your multiverse in a different way, if that makes more sense to you.