Implementation of TSMAE model in Keras

Hi everyone,

I’m currently implementing the TSMAE model described in the paper “TSMAE: A Novel Anomaly Detection Approach for Internet of Things Time Series Data Using Memory-Augmented Autoencoder”. However, I’ve encountered multiple challenges and would appreciate insights from those more experienced.

1. Implementation Issues & NaN Loss

I have attempted to implement the model as below, but the training process becomes unstable, leading to NaN loss after some epochs. I’m unsure about the root cause.

I am, however, aware of the several issues in the implementation of the code, mostly the passing of q_normalized (which is calculated in the call() method of the TSMAE class) to be used in the custom loss function. Since I’m not very familiar with Keras (or deep learning frameworks in general), I’ve struggled with handling this properly. I’ve tried multiple approaches, but none have worked without introducing further issues. Any guidance on the correct way to handle this in Keras would be greatly appreciated.

# Define the LSTM Encoder model
class LSTMEncoder(Model):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.lstm = layers.LSTM(hidden_size, activation='sigmoid', return_state=True)

    def call(self, x):
        # Forward pass through LSTM; only keep the final hidden state (h) as the latent representation
        _, h, _ = self.lstm(x)
        z = h  # Latent representation
        return z

# Define the LSTM Decoder model
class LSTMDecoder(tf.keras.Model):
    def __init__(self, sequence_length, latent_dim, dropout_rate=0.2):
        super(LSTMDecoder, self).__init__()
        self.sequence_length = sequence_length
        self.latent_dim = latent_dim
        self.dropout_rate = dropout_rate

        # Define the layers in the LSTM decoder
        self.lstm_decoder = Sequential([
            layers.RepeatVector(sequence_length),                   # Repeat z_hat for each time step
            layers.LSTM(sequence_length, return_sequences=True),    # First LSTM layer
            layers.Dropout(dropout_rate),                           # Dropout layer
            layers.LSTM(sequence_length, return_sequences=True),    # Second LSTM layer
            layers.Dropout(dropout_rate),                           # Dropout layer
            layers.TimeDistributed(layers.Dense(1))                 # Output layer for each time step
        ])

    def call(self, z_hat):
        # Pass the latent representation through the LSTM decoder layers
        x_hat = self.lstm_decoder(z_hat)
        # Reshape the output to match the shape of X_normalized
        # The -1 ensures batch size is handled automatically
        # sequence_length and 1 provide the correct dimensions
        #x_hat = tf.reshape(x_hat, [-1, sequence_length, 1])

        return x_hat # Remove tf.squeeze to preserve all dimensions


class TSMAE(Model):
    def __init__(self, input_size, hidden_size, sequence_length, latent_dim,
                 dropout_rate=0.2, N=20, E=10, lambda_threshold=0.05,
                 epsilon=1e-10, eta=0.01):
        super(TSMAE, self).__init__()
        self.encoder = LSTMEncoder(input_size, hidden_size)
        self.decoder = LSTMDecoder(sequence_length, latent_dim, dropout_rate)

        # Memory module parameters
        self.N = N  # Number of memory items
        self.E = E  # Dimension of latent representation
        self.lambda_threshold = lambda_threshold  # Sparsification threshold
        self.epsilon = epsilon  # Small value to avoid division by zero

        # Initialize M with Xavier initialization
        initializer = tf.keras.initializers.GlorotUniform()
        self.M = tf.Variable(initializer(shape=(self.N, self.E)), trainable=True, dtype=tf.float32)


    def q_normalized_method(self, inputs):
        # Encoder
        z = self.encoder(inputs)

        # Memory Module
        similarity_scores = tf.matmul(z, self.M, transpose_b=True)
        q = tf.nn.softmax(similarity_scores, axis=1)
        q_rectified = (tf.maximum(q - self.lambda_threshold, 0) * q) / abs(q - self.lambda_threshold)
        q_l1_norm = tf.reduce_sum(tf.abs(q_rectified), axis=1, keepdims=True)
        q_normalized = q_rectified / tf.maximum(q_l1_norm, self.epsilon)

        return q_normalized

    def call(self, inputs):
        # Encoder + Memory Module
        q_normalized = self.q_normalized_method(inputs)

        # Decoder
        x_hat = self.decoder(tf.matmul(q_normalized, self.M))

        return x_hat


# Parameters - Encoder
T = 140 # Number of time steps per sample
hidden_size = 10  # Size of the hidden layer (latent representation)
batch_size = 20 # Number of samples in each batch
num_features = 1 # Number of features per time step (single acquisition per action)

# Parameters - Memory Module
E = 10  # Dimension of latent representation
N = 20  # Number of memory items
lambda_threshold = 1 / N  # Sparsification threshold, lambda >= 1/N
epsilon = 1e-10  # Small value to avoid division by zero in normalization

# Parameters - Decoder
sequence_length = 140  # Length of the original sequence
latent_dim = 10        # Dimensionality of the latent representation z_hat
decoder = LSTMDecoder(sequence_length=sequence_length, latent_dim=latent_dim, dropout_rate=0.2)

# Parameter - Loss function
eta = 0.01

model = TSMAE(input_size=num_features,
              hidden_size=hidden_size,
              sequence_length=sequence_length,
              latent_dim=hidden_size)

def custom_loss(original_x, reconstructed_x):
    """
    Custom loss function combining reconstruction loss and sparsity loss.

    Args:
        original_x: The original input data (ground truth).
        reconstructed_x: The reconstructed data (x_hat).

    Returns:
        The total loss: reconstruction loss + eta * sparsity loss

    """
        
    # Reconstruction Loss (Mean Squared Error)
    reconstruction_loss = tf.reduce_mean(tf.square(original_x - reconstructed_x)) / 2.0

    # Compute q_normalized
    q_normalized = model.q_normalized_method(original_x)

    # Sparsity Loss (log sparsity penalty)
    sparsity_loss = tf.reduce_sum(-tf.math.log(1 + tf.square(q_normalized)))

    # Total loss
    total_loss = reconstruction_loss + eta * sparsity_loss

    return total_loss

# Compile model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss=custom_loss,  
              metrics=['mse'])  
# Huấn luyện model
model.fit(X_normalized, X_normalized, epochs=50, batch_size=20)

Additionally, I also have some questions regarding details written in the paper itself

2. Clarification on Latent Representation (z) Processing

The paper states:

“Combining the output states of each cell yields the encoded latent representation z.”
z = {h1, h2, …, hT} (14)

Since h is the output state of each LSTM cell and T is the number of timesteps, I expected z to be a matrix. However, the paper later describes z as a vector:

“The encoder produces the latent representation z, which has dimension ℝ^E.”

This confuses me.

  • In my current implementation, I only use the final hidden state (h_T) as the latent representation.
  • However, I’m wondering whether a what is the common way to “combine the output states” would be that the paper might refer to?

3. Potential Inconsistencies in the Paper’s Dimension Descriptions

Another thing that confuses me is the paper’s notation regarding input dimensions. It states:

“where x_T has dimension ℝ^T as the input at the current moment.”

Since T is defined as the number of timesteps, shouldn’t the input at each timestep have dimension ℝ^F, where F is the number of features (because isn’t this how LSTM works?) ? If T is the number of timesteps, it seems incorrect to use it as the dimensionality of a single input at time t.

So in short,
TSMAE_A_Novel_Anomaly_Detection_Approach_for_Internet_of_Things_Time_Series_Data.pdf (2.1 MB)
I would really appreciate any insights on:

  1. Possible reasons for the NaN loss in my implementation.
  2. The correct way to handle passing q_normalized in Keras for loss calculation.
  3. Clarification on how z should be constructed from LSTM outputs.
  4. Whether there are notation inconsistencies in the paper’s descriptions of input dimensions.