Image segmentation: dataset pipeline - customized output

Hi @rmwkwok,

How are you doing Raymond?

Following our discussions, I created a notebook to make a prototype using image segmentation.
imageSegmNN1.ipynb (40.0 KB)

My end goal is to use image segementation (UNet) to get species probabilities. In the notebook prototype, I used:

  • X: The origin data is a single predictor matrix of shape (10’000, 10’000, 4). Each channel being a predictor. From this X origin matrix, extractions are done on the fly to generate “3’798’601 images” of shape (512, 512, 4) that will then be used as X data for training. The extraction of image of height and width of 512 is done to match the input size of the UNet architecture.
  • Y: The origin is a single response matrix of shape (10’000, 10’000, 2). Each channel is a species with values (0: present, 1: absent, 2: NA). From the Y origin matrix, extractions are done on the fly to generate “3’798’601 images” of shape (512, 512, 4) that will then be used as Y data
  • Prediction objectives: Train a segmentation network (UNet) to be able to input the X predictors with shape (512, 512, 4) and output the probabilities of presence with shape (512, 512, 2). Each channel being a species, containing the values of the probabilities of presence within each cell.
  • UNet architecture: I replicated the original UNet architecture, the idea being to change the output to suit my needs.

Questions

  • Dataset pipeline: I was thinking about optimizing the dataset pipeline. The process I used for extraction and generation of images on the fly works on small data sets, but fails for larger data set. (see the point 1.3 convert_to_tensor) Furthermore, keeping 3 millions generated images of 512x512x4 in memory is not optimal either. I was wondering if I should save them on disk, but this would have the disadvantage to replicate data and consume a lot of space.

  • Output UNet: the idea is to have one layer by species with probabilities of presences. So I guess that my number of class in the UNet should be 2, in the current example at point 3.3.3, if I am not mistaking.

  • Model dimensions: In 3.3.2, I did some test to ensure that shapes are right. For some reason, why when I check the dims of the model in 3.3.3, the channel depths are wrong.

Any ideas or suggestions ?
Thanks a lot!
Manu

Hey @Manu, I am doing great. In fact, I am busy with preparing for something, so I plan to go through your notebook in details on Friday. I glanced it and seems things are getting more interesting now.

Quick response:

  1. Dataset pipeline: we can use tf.data.Dataset to build the pipeline (as you have tried). The idea is, we first decide what the data source looks like - (1.1) the original raster or (1.2) the mini-raster, and how we store it - (2.1) in memory or (2.2) in file.
    tf.data.Dataset will work in all those ways, so it;s better to stick with the way the actual data is. If your source data is full raster and stored in files of particular format, please do save them the same way. We can generate some random raster like you did in jupyter notebook, and then save it as files first, and then those files being read by the pipeline. The idea is, the closer our toy demo is to the actual case, the easier for you to transfer the implementation to your real problem. Of course it’s just a recommendation, and you might choose not to at this stage, but I think the minimum is at least save the random data as file of any format so we can free the memory for the neural network, the pipeline, and the training process.
    tf.data.Dataset works this way: you start pipeline with reading from the source, then connect it to a series of any number of custom transformations/filtering that are defined by you, then connect to the batch builder, before to a buffer for caching a certain number of batches ready for training. Other features like random sampling can be added to the pipeline. All features can be found here with some examples.

  2. Model dimensions: Your input has the shape of (None, 512, 512, 4 ), and output (None, 512, 512, 2) which are what you asked for in “Prediction objectives”. The ‘None’ in both shapes is reserved for the batch size that can be decided by the batch builder in the pipeline. If the batch size is 32, which means processing 32 mini-rasters at a time, then the actual input and output shapes will be (32, 512, 512, 4 ) and (32, 512, 512, 2). For now it’s normal to have None there. Let me know if you think something else is wrong and what it is.

  3. Output UNet: It looks alright, and I think we need to add sigmoid activation to the last layer.

  4. Lastly, I usually use tf functions inside pipeline, loss function, metric function, building neural network layers. And I won’t use tf function outside of them. I think numpy is good outside, and tf is the choice inside.

I will check this thread from time to time before I really start to dig into your code on Friday.

Raymond

1 Like

Btw, if you planned to start building the pipeline, and if in your actual data, the observation data (y=0 or y=1) is very sparse, here are 2 ideas.

  1. add a filter to the pipeline to remove mini-raster that has no 0 and 1 in its Y, because those mini-rasters won’t contribute to the training process at all (as required by our loss function), and they are wasting your computational power and can be problematic if such mini-rasters are majority.

  2. current idea is you cut a mini-raster every certain number of pixels which is completely fine. It is the regular way of creating mini-rasters. However, it might worth also considering the irregular way, which is for each y=0 or y=1 observation, you try to cut more than one mini-rasters but positioning that observation at different parts of each mini-raster. E.g. could be such that the observation is at the center for mini-raster #1, or at somewhere to the north-east for #2, or so.

Any way, it depends on whether your observations are sparse. If it is, (1) is necessary, and the irregular way in (2) can be thought about later. Also, if the observations are indeed sparse, I suggest you to do the same for the toy random data. The idea is, the closer it is to the real case, we might be able to discover more in this development stage.

1 Like

I had been modifying my replies, but I will stop now. Sorry if you had been reading.

Hi Raymond,
Great thanks a lot for your precious advices !
Looking forward working on this tomorrow and thursday, I will keep you posted,
Take care,
Manu

Hi @rmwkwok,

Based on your comments, I revised the whole data pipeline.

You will find the new notebook here:
imageSegmNN3 (1).ipynb (53.2 KB)

The “toy” data in TIF format to be used are ziped here, it is 17 MB so I had to use a transfer: SwissTransfer.com - Envoi sécurisé et gratuit de gros fichiers

The shape for miniraster to be processed by the UNet are :

  • X.shape (512, 512, 4), Y.shape (512,512,2)
  • When I will run this model for real, the nr of channels for X will be approximately 70 and the nr of channels for Y approximately 100. So the risk of having no data on Y (0 or 1) will be quite limited given the great depth.
  • Given the great depth of Y, I will stick with the regular way, as you mentioned. It might also lessen bias, as it will have a fixed and same amount of miniRasters processed on each day.

Looking forward to hear your opinion,

Thanks a lot Raymond!
Manu

Hi @Manu,

Some comments/questions about the data.

  1. You are replacing X0 with normalized X1, is this intended?
# Normalize
    X0 = X1/np.linalg.norm(X1)
    X1 = X2/np.linalg.norm(X2)
  1. Suggest to normalize X1, X2, X3, X4 such that they all have range from 0 and 1, e.g. by min-max normalization. This is just to resemble how usual computer vision CNN does, unless you have other concerns.

I agree. I did the following check on your toy Y, the chance of seeing a all-NA pixel is only 5%. If you have time, please give it a try on real data. No harm to get a quantitiative estimate.

Y0 = np.array(Image.open(filePath + '/Y0.tif'))
Y1 = np.array(Image.open(filePath + '/Y1.tif'))
Y_list = [Y0, Y1]
Y = np.stack(Y_list, axis=2)
print( (Y.min(axis=2) == 2).sum() / np.prod(Y.shape) )
  1. What is the total size of training data do you have in terms of GB?

  2. I will give you an example of using TFRecord and tf.data.Dataset for preprocessing, storing, and reading the data.

Raymond

1 Like

Hey Manu,

Here is the example using TFRecords and TF data pipeline. TFRecords save data in binary format, advantage is that it reads faster. TFRecords can also be compressed.

I believe this example can give you some good idea to further expand your idea. Let me know if you have any questions about the example.

Raymond

import os
import glob
import numpy as np
import tensorflow as tf
from PIL import Image

# Data specification
raster_w = 1000
raster_h = 1000

source_folder = './rasterData'
tfrecord_folder = './tfrecords'
if not os.path.exists(tfrecord_folder):
    os.mkdir(tfrecord_folder)
        
# specify correct stacking order
XIDs = ['X0.tif', 'X1.tif', 'X2.tif', 'X3.tif'] 
YIDs = ['Y0.tif', 'Y1.tif']

# specify ranges for normalization
X_minmax = {
    'X0.tif': [200., 4000],
    'X1.tif': [200, 400],    
    'X2.tif': [0, 1],    
    'X3.tif': [0, 1], 
}
X_minmax = np.array(list(map(X_minmax.get, XIDs)))

# specify patches
stride = 50
filter_size = 512
row_slices = [slice(i, f) for i, f in zip(range(0, raster_h, stride), range(filter_size, raster_h, stride))]
col_slices = [slice(i, f) for i, f in zip(range(0, raster_w, stride), range(filter_size, raster_w, stride))]
patch_slices = [(r, c) for r in row_slices for c in col_slices]

# specify TF record
tfrecord_feature_description = {
    'X': tf.io.FixedLenFeature([filter_size, filter_size, len(XIDs)], tf.float32),
    'y': tf.io.FixedLenFeature([filter_size, filter_size, len(YIDs)], tf.float32),
}

def load_image(path):
    return np.array(Image.open(path))

def parse_tfrecord(example):
    example = tf.io.parse_single_example(example, tfrecord_feature_description)
    return example['X'], example['y']

def create_tfrecord_example(X, y):
    feature = {
        'X': tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten())),
        'y': tf.train.Feature(float_list=tf.train.FloatList(value=y.flatten())),
    }
    return tf.train.Example(features=tf.train.Features(feature=feature))

def raw_data_to_tfrecords(source_path, tfrecord_writer):
    # read data
    X = np.stack([load_image(os.path.join(source_path, i)) for i in XIDs], axis=2)
    y = np.stack([load_image(os.path.join(source_path, i)) for i in YIDs], axis=2)
    
    # normalize X
    X = (X - X_minmax[:, 0]) / (X_minmax[:, 1] - X_minmax[:, 0])
    
    # patches
    Xs = np.stack([X[s] for s in patch_slices], axis=0) # shape (100, 512, 512, 4)
    ys = np.stack([y[s] for s in patch_slices], axis=0) # shape (100, 512, 512, 2)
    
    # save as TFRecords
    for sample_X, sample_y in zip(Xs, ys):
        example = create_tfrecord_example(sample_X, sample_y)
        tfrecord_writer.write(example.SerializeToString())

# Get a list of date folders
dates = glob.glob(os.path.join(source_folder, '*'))

# Train - cv split
train_size = int(len(dates)*.7)
train_dates, cv_dates = np.array_split(dates, [train_size,])

# preprocess data and save as TFRecords
with tf.io.TFRecordWriter(os.path.join(tfrecord_folder, 'train.tfrec'), options=tf.io.TFRecordOptions(compression_type='GZIP')) as writer:
    for date in train_dates:
        raw_data_to_tfrecords(date, writer)
        
with tf.io.TFRecordWriter(os.path.join(tfrecord_folder, 'cv.tfrec'), options=tf.io.TFRecordOptions(compression_type='GZIP')) as writer:
    for date in cv_dates:
        raw_data_to_tfrecords(date, writer)
        
# Each date creates 100 samples, to shuffle samples from different dates, need 100xN in the shuffle cache
train_dataset = tf.data\
    .TFRecordDataset(os.path.join(tfrecord_folder, 'train.tfrec'), compression_type='GZIP')\
    .map(parse_tfrecord)\
    .shuffle(1000, seed = 32, reshuffle_each_iteration=True)\
    .batch(32, drop_remainder=True)\
    .prefetch(tf.data.AUTOTUNE)

cv_dataset = tf.data\
    .TFRecordDataset(os.path.join(tfrecord_folder, 'cv.tfrec'), compression_type='GZIP')\
    .map(parse_tfrecord)\
    .batch(32, drop_remainder=False)\
    .prefetch(tf.data.AUTOTUNE)

# check dimenions
for i, (X, y) in enumerate(train_dataset):
    pass
print('# batches', i+1, 'shapes of each batch', X.shape, y.shape)
1 Like

@Manu, for this

# Train - cv split
train_size = int(len(dates)*.7)
train_dates, cv_dates = np.array_split(dates, [train_size,])

Your data is by date. I suppose your prediction will be on future dates, so the idea is to use earlier dates as training set and later dates as cv set.

And I have gone through your code, and I think it is very clear explained and I can see that you have explored a lot. However, I think there is no simpler way than for me to make an example for you to more easily leverage TFRecords and the TF data pipeline. I expect you will keep develop the pipeline so my example is definitely not the end.

Please if you have time, take a look at my example. Based on the explanation on your notebook, I am sure you will figure out what every step is doing. But feel free to ask any question that could be just about my code, or difference between your way and my way.

Lastly, my computer crashed when I try to train, but it’s a little late now, so I will try again tomorrow. If it is because the architecture is too big for my laptop, then I will reduce it to try out the loss and the metrics.

UPDATE:
I had to reduce the architecture to run it on my laptop, but both the metrics and the loss look fine.

Raymond

1 Like

Hi @rmwkwok,

Thank you very much for your incredibly brilliant inputs!!!
Amazing :wink: This is a very good lesson of humility… being used to work with R, I see that I still have a lot to learn in tf and python…

I studied your scripts line by line, thank you so much for such a condensed knowledge! It helps me a lot. I will keep researching, studying and elaborating on this during the course of the next week

Answers to your posts:

  1. Replacing X0 with X1 normalized.
    Corrected, thank you, my mistake

  2. Normalization instead of standardization
    You are right, that makes more sense, I will impment min-max normalization on predictors that are not in the range[0:1], but being careful not to apply it on the the other predictors that are already within the range[0:1] to avoid distortion.

  3. Risk of having no data on Y (0 or 1) will be quite limited
    You are right, thank you for your test, this will occur, so I will have to implement a screening process to discard miniRaster having only NA for Y

  4. Total size of training data
    Approximately 500GB. A function will extract those files by day to later be processed by tenforflow as in our example.

  5. Data is by date
    Indeed, what I am doing is slicing one same month (30 dates) for each of the 20 last years, and using all those dates as trainingSet / validSet. This is more like a spatial analysis of the data, bounded in time by slicing the same month throughout the years to lessen temporal biases. So to predict for example for the month of August, I will extract the data of August slices during the last 20 years. As you suggested, very good idea, I could use the last yearly month as validation set. For temporal analysis, I am using an LSTM networks. The time series is simply the number of occurences for each day during the last 20 years. I define zones or regions to run the LSTM. So I have 2 separate models, one spatial and one temporal, I don’t know if it would be technically possible to join them somehow later on, instead of keeping both separate.

  6. Architecture too big:
    My computer crashed as well with batch 32 and I have 2 RTX3090. I brought it down to batch 16 and it works, so that’s great. Tough this worries me a bit, as this toy data set has only 4 channels for X and 2 for y, and my real DS will have 70 for X and 100 for Y… so I will have to see how to fine tune the whole

Thanks again Raymond for your enlighted and precious inputs!
Enjoy your week-end!
Manu

1 Like

Hi Manu,

When I wrote my reply yesterday, I kept thinking about sharing my example. The main driving force for me is, if I know a way is better, then I don’t want to keep it from you. On the other hand, I am not sure how you would think about that. Of course now I know that I was over-worried.

But back then, I still decided to share it and to share the best possible one I can think of, because I gambled that we are pragmatic, and all we care is how to achieve. Also given that everytime I read your notebook I can see a lot of efforts, so I think you don’t lack of anything but just maybe an example. So there is no reason for you to get stuck in just how to write Python.

That example was literally the best I could think of, and during which I also tried to search if there was any better way and this is the least I can do for you and for my own learning.

500GB is pretty fantastic, and converting them to TFRecords will take up much more space, so be careful. However, the advantage is fast reading, otherwise it’s going to be a bottleneck during training when the data is read over and over again.

For architecture, unless you are informed, otherwise I suggest to start with less encoder/decoder and add progressively and see how much improvement is brought each time and whether it is worthwhile. By starting with a smaller NN, it gives you a faster way to fine tune things, to understand how much resources you will need to scale up, and establish a baseline to compare with your GBM.

Lastly, my example is actually nothing, the knowledge you and your organization are going to gain from this work is the real thing. This is my belief.

Great weekend, good luck and feel free to let me know if there is any technical things to discuss.

Raymond :slight_smile:

1 Like

3 more things:

  1. In my example, I saved 2 TFRecords - a train and a cv, but since you have 500GB data, please consider to save e.g. one TFRecord per month, or one per week. In this case you have more flexibility, but you will need to change the code for reading the TFRecords because then you will be reading a list of TFRecords, not one. There is a specific function for that.

  2. For merging LSTM & CNN, I don’t have very good idea for now, but if I were you, I will keep them apart, and analyze and compare their prediction results.

  3. If you have too much data, and if you know that there had been dramatic change in the means of observation of X or/and of Y over the years, then I would consider to only use the latest years of data first.

1 Like

Giving an example helped me tremendously! Experience takes time and your example helped me to better understand how people with huge experience, like you, would consider such a process… Especially after having tried on my own… My learning curve is exponential, just by talking with you, trying and reading what you think about it. So I have not enough words to thank you, because you helped me incredibly in my learning journey.

I will follow your advices on consdering saving TFRecord per month or week, starting with a small architecture and yearly dataset.

Thanks again and take care Raymond :slight_smile:
Talk soon
Manu

1 Like