Bike Sharing dataset in C2_W2_Lab_2_Feature_Engineering_Pipeline

Hi everyone
I have tried to follow up the suggestion at the end of the official notebook
C2_W2_Lab_2_Feature_Engineering_Pipeline to
use a different dataset.

Here in attach the modified Jupyter Notebook C2_W2_Lab_2_Feature_Engineering_Pipeline_Bike_Sharing.ipynb (36.2 KB)

The SeoulBikeData.csv is dowloaded programmatically. So there is no need to create locally a folder.
I have tested it on Google Colab. Anyway at the beginning of the notebook I have added a piece of code to detect the platform where the notebook is running on (Colab or not Colab).
This is just the initial release. Everything can be improved.
Looking forward your comments and feedbacks
BR

14 Likes

Is it possible to ingest data from csv, that is not in unicode? Source csv file is not in unicode. Converting it to unicode is not “within pipeline” and to me that seems weird.

Hi @spsh
thanks for your question.
I had to use ‘latin1’ because when I tried to import the SeoulBikeSharing.csv in the usual way (not unicode) I got an error. So I changed the import format to ‘latin1’. If you have found an alternative way please let me know and I will remove that flag.
BR

how do i do missing value imputation ?

any reference code where feature selection was also part of the TFX pipeline ?

Hi @fabioantonini and thank you for having published your notebook !

Can you tell me why you choose ‘latin1’ as encoding ?

Also, when i use the print function like that:

with open(_data_filepath) as f:
  print(f)

I have this displayed:
<_io.TextIOWrapper name=‘./data/SeoulBikeData.csv’ mode=‘r’ encoding=‘UTF-8’>, which seems to indicate that the encoding of the csv file is UTF-8.

But when I run:

context.run(example_gen)

I have this error (like the one you surely encountered):
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xb0 in position 40: invalid start byte

Can you explain why the ‘utf-8’ codec can’t decode the file while it is indicated as utf-8 encoded ?

Sorry to bother you with my unicode questions but I have the feeling it won’t be the last time I will encounter this kind of Unicode error :sweat_smile:

Thank you in advance !

BR

The unicode problem can only be solved by reading the csv into a object like ‘df’ and then after creating the ‘data’ folder or ‘bikerdata’ folder…cnvert that ‘df’ to csv by using the command df.to_csv…

PS: Do Not unzip the csv in the ‘data’ or ‘bikerdata’ folder…otherwise you will face encoding issues.

Also, Thanks for the solution. It worked well

Thank you for creating the Colab version. I really had a tough time setting up the environment on Colab and my local machine (Windows 11). Please could you share links/ materials to how to best troubleshoot environment set up. Thanks