A suggestion for making the shape of matrices consistent and redrawing the ANN processing chain in 3 D

Nothing to really write home about :smiling_face_with_sunglasses:, but I was thinking, we describe (and program out) the processing of the ANN as matrix operations, but this seems low-level and also inherited from the conventions of linear algebra. It becomes subtle once batch processing enters the picture. Then one needs fixes between the math level and the program level like “broadcasting”. Furthermore the “individual elements of a batch” appear along the “2nd dimension” of matrices, but the “2nd dimension” is also used as usual for the W (weight) linear transformation matrices. As an IT guy, that looks somewhat sloppy.

First note a subtlety that one should make clear: one can always remove or add an arbitrary number of dimensions of size 1 to an array, thus making arrays of the following shapes (for example) equivalent:

  • (a, b)
  • (a, b, 1)
  • (a, b, 1, 1)
  • etc..

This is not unlike being able to add or remove 0’s on the left of a number in standard notation and nothing changes in what the number designates: 000012 = 12.

But note this:

  • (), no shape at all, a scalar
  • (a), undetermined orientation (“upgrade” the array to (a, 1) or (1, a), as desired)
  • (a, 1), a column vector of a rows
  • (a, 1, 1)
  • etc..

Contrariwise:

  • (1, a), a row vector of a columns
  • (1, a, 1)
  • (1, a, 1, 1)
  • etc..

After this intro, the basic idea (it’s very pedestrian). would be to do the following:

  1. express all the matrices we encounter during processing always as 3-D matrices/arrays
  2. …with dimension row, col, depth (in that order, order being important), and
  3. …with the “batch examples” always arranged along the depth dimension, and
  4. with the operations transforming one array into the next disentangled from the underlying linear algebra (but of course still explained via linear algebra) as follows:
    • MIX: performs a linear transformation by multiplying each (a, 1, 1) shaped sub-array (slice) found along the depth dimension of an input array of shape (a, 1, m) with a matrix of shape (b, a) “on the left”, yielding an output array of shape (b, 1, m) composed of the individually transformed slices along depth
    • ADD: adds an array of shape (a,1) to each (a, 1, 1) shaped sub-array (slice) found along the depth dimension of an input array of shape (a, 1, m), yielding an output array of shape (a, 1, m), composed of the individually transformed slices along depth
    • APPLY: applies an arbitrary 1-argument operation to each element of an input array of shape (a, 1, m), yielding an output array of identical shape (a, 1, m)
    • NegLL: applies the two-argument log-likelihood computation on pairwise elements picked from two input arrays of shape (1, 1, m) along the depth dimension, yielding an output arrays of shape (1, 1, m)
    • MEAN: computes the mean of an input array of shape (1, 1, m) along the depth dimension, yielding an output array of shape (1, 1, 1), i.e. a scalar.

Now consider a processing chain of a 2-layer ANN like the one below. Note that the layer “widths” (i.e. the number of neurons in that layer) are denoted with:

  • \lambda[0]
  • \lambda[1]
  • \lambda[2], which in this case is \lambda[2] = 1

We can now express the above through an isometric diagram like the following (I have no software to do that, so it’s hand.drawn, though there is PlotNeuralNet but I have no idea how it works). Note the indications of dimension sizes on each array.

Now the “depth” of the arrays always expresses the size of the batch and the matrixes W^{[i]} and b^{[i]} are used to parametrize the transformations from one array to the next, but are not bound up with them.

1 Like

Hello, David, @dtonhofer

This seems interesting, but I had to stop after the first bullet point because it ended with a shape of (\lambda_1,1,m) which was never reused. I thought it should flow to one of your subsequent steps. Was this intended?

Also, would you mind exemplify what could \lambda_0 and \lambda_1 mean? You suggested the last dimension could be the batch example dimension, what about the other two? What could they be?

Cheers,
Raymond

1 Like

Good catch. I made updates to the text.

But it’s nothing very complex in any case, just filing off the edges, so to say, concerning conceptual problems when one starts to think about the backpropagation algorithm a fourth time.

2 Likes

(b, 1, m) was yielded in your first bullet point but never used again. :wink:

However, I think your hand drawn diagram explained everything, so this time, my question is, is there any reason for you to stick to 3 dimensions? It seems 2 dimensions are sufficient for this text, because your col dimension always has length 1.

The point is that it is more consistent (engineering wise) to always use the 3rd dimension for the “batch dimension”, it’s a visualization help.

I always picture the linear transform or the activation function as a printer head sliding along the array, generating the new array, maybe with purely local operations (slices along the 3rd dimension) or in the case of MEAN, appropriately accumulating along the 3rd dimension.

Contrariwise, the matrix-based formulation has the batch examples in the $A$s and $Z$s arranged along the second dimension, which is also the dimension along which the linear transformation target vectors are arranged in the W. It’s the convention but I don’t like it :joy:

“But in your diagram the 2nd dimension is always of size 1” you say. That’s right, but if we have image pixels, we could put the RGB channels in there. That would make the 2nd dimension of size 3.

OTOH, if our data elements have even more dimensions than 2, should we move the example dimension beyond 3? That would destroy the whole idea.

1 Like

A stack of m 2D pictures is 4D - (height, width, channels, m). Likewise, a stack of geological map of temperature, humidity, pressure, … can be (longitude, latitude, measurements, m). If time is treated as an extra dimention, we will have 5 dimensions.

That’s a good idea and I think part of the idea still remains, only we can’t visualize higher dimensions. :wink:

Cheers,
Raymond

1 Like

This is true, but we have an advantage:

We use our high-dimensional matrices for storage only (or mostly). Unlike in mathematical geometry where the dimensions can be linear mixed at will as we change either the basis or the space, we do not mix for example x and y position, R, G, B color values etc.

We can thus visualize our higher-dimensional matrices as “regular trees”, a structured store of numbers (there seems to be no word for a tree that has a constant number of children at each node of a level, so “regular” will do):

1 Like

True!