TabNet attentive transformer

Hello everyone,

I am about to start a personal project to implement TabNet from this paper: https://arxiv.org/pdf/1908.07442.pdf

if some of you have read this article, I have a specific question about the attentive transformer.
It is said :" We employ a learnable mask \mathbf{M[i]} \in \Re ^ {B \times D} for soft selection of the salient features."

B is batch size and D the number of features
and \sum\nolimits_{j=1}^{D} \mathbf{M[i]_{b,j}} = 1.

But I am not sure to understand clearly the mask shape.
The only way that make sense to me is that the mask is a row vector concatenated B times, right ?
And that imply the FC of the attentive transformer has D units ?

Thank you in advance