RNN with varying nx for each sequence / time step

Hi, I am currently in the deep learning specialisation and just finished in parallel week 1 of course 5 and and I am appraoching the end of course. However one question is really going around in my head: If it would be possible for one of our usecases (available data) to to utilize a RNN (LSTM or GRU Units).

The use-case
We have collected panel data with lets say T_x = 8 waves (i.e., sequences, e.g. from T_1 to T_8). With some questionnaires assessed at every timestep (e.g. our outcome y, let’s assume it could be a categorical variable, either 1 = ‘clinical relevant symptoms’ and 0 = ‘not clinically relevant’ as measured by the cut-off for the Depression Sumscale BDI), and some shared predictors between all sequences (e.g., age, gender, daily activiy). This, as far as my understandings goes, could work fine with a many-to-many architecture where T_x = T_y. However the number of features (n_x) is different for every wave (timestep or sequence). So e.g., n_x^{<1>} = 43 but in wave five there could be more available features (n_x^{<5>} = 71) and in wave 7 less n_x^{<7>} = 38

Open Question
Can RNNs (either RNN traditional Units, GRU, or LSTM) handle different numbers of features in the input of x^{<t>}? E.g. a questionnaire on quality of relationships is assessed at timepoits x^{<1>} and x^{<2>}, but was dismissed for all further waves. Or a newly developed questionnaire on fear related to Cov-19 was introduced in x^{<8>}? So my question is, can RNN handle such form of input data? And beyond that (I hope this might be covered in Course 2 or 3), how would I wrap my data (currently in .csv format: with one file for each wave (so 8 .csv s), with tabular data, i.e., features in columns and n=1..m cases in rows) in a numpy array? I know how to easly read in all data with e.g. a list comprehension. However as the n_x differs for each wave, I am unsure how to stack up the third dimension.

Since machine learning relies pretty much on regular matrix algebra to learn the weights, it’s very difficult to handle examples that do not all have the same number of features.

All the work-arounds for this are significant compromises.

Either you ignore examples that have missing features (tossing out lots of otherwise usable data), or you exclude any cost and gradient contributions for the missing features (this slows down the computations a lot), or you substitute the average value for missing features (you deliberately lie about the values of features that are missing, leading to results you can’t really trust).

These all result in degraded performance of the learning algorithm.

1 Like

Hi TMosh, thanks already for this great response.
One point I would like to add/clarify, there are no examples with missing features. All m examples have the same amount of features in each wave (with nearly no missing values). It is just that the number of features n_x is different between the wave (not between cases / examples / observations).
However this does not change the rest of your answer (especially that ML is pretty much based on regular (linear?) matrix algebra). Maybe I am not versed enough in Matrix Algebra, but as far as I understand you and regular (linear?) Matrix Algebra, it is not possible to stack multiple (to be precise t_x = 8) dataframes, that have the shape (m,n_x), however that differ in the number of features (n_x )together to a 3D numpy tensor of the form: (n_x,m,t_x). Is that correct?

That’s going to make it very difficult to create a single training set.