Bike Sharing dataset in C2_W2_Lab_2_Feature_Engineering_Pipeline

fabioantonini · September 18, 2021, 4:15am

Hi everyone
I have tried to follow up the suggestion at the end of the official notebook
C2_W2_Lab_2_Feature_Engineering_Pipeline to
use a different dataset.

Here in attach the modified Jupyter Notebook C2_W2_Lab_2_Feature_Engineering_Pipeline_Bike_Sharing.ipynb (36.2 KB)

The SeoulBikeData.csv is dowloaded programmatically. So there is no need to create locally a folder.
I have tested it on Google Colab. Anyway at the beginning of the notebook I have added a piece of code to detect the platform where the notebook is running on (Colab or not Colab).
This is just the initial release. Everything can be improved.
Looking forward your comments and feedbacks
BR

spsh · November 11, 2021, 6:24pm

Is it possible to ingest data from csv, that is not in unicode? Source csv file is not in unicode. Converting it to unicode is not “within pipeline” and to me that seems weird.

fabioantonini · November 17, 2021, 5:25am

Hi @spsh
thanks for your question.
I had to use ‘latin1’ because when I tried to import the SeoulBikeSharing.csv in the usual way (not unicode) I got an error. So I changed the import format to ‘latin1’. If you have found an alternative way please let me know and I will remove that flag.
BR

Rajapradeepan_Rajend · September 22, 2022, 2:25pm

how do i do missing value imputation ?

Rajapradeepan_Rajend · September 22, 2022, 2:29pm

any reference code where feature selection was also part of the TFX pipeline ?

fab_gen · November 3, 2022, 10:31pm

Hi @fabioantonini and thank you for having published your notebook !

Can you tell me why you choose ‘latin1’ as encoding ?

Also, when i use the print function like that:

with open(_data_filepath) as f:
  print(f)

I have this displayed:
<_io.TextIOWrapper name=‘./data/SeoulBikeData.csv’ mode=‘r’ encoding=‘UTF-8’>, which seems to indicate that the encoding of the csv file is UTF-8.

But when I run:

context.run(example_gen)

I have this error (like the one you surely encountered):
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xb0 in position 40: invalid start byte

Can you explain why the ‘utf-8’ codec can’t decode the file while it is indicated as utf-8 encoded ?

Sorry to bother you with my unicode questions but I have the feeling it won’t be the last time I will encounter this kind of Unicode error

Thank you in advance !

BR

Agam_Mehta · September 28, 2023, 9:48pm

The unicode problem can only be solved by reading the csv into a object like ‘df’ and then after creating the ‘data’ folder or ‘bikerdata’ folder…cnvert that ‘df’ to csv by using the command df.to_csv…

PS: Do Not unzip the csv in the ‘data’ or ‘bikerdata’ folder…otherwise you will face encoding issues.

Agam_Mehta · September 28, 2023, 9:48pm

Also, Thanks for the solution. It worked well

Bestman_ezekwu_Enock · November 8, 2023, 10:00pm

Thank you for creating the Colab version. I really had a tough time setting up the environment on Colab and my local machine (Windows 11). Please could you share links/ materials to how to best troubleshoot environment set up. Thanks

Topic		Replies	Views
Issues with the ungraded "C3_W2_Lab_3_imdb_subwords.ipynb" notebook Natural Language Processing in TensorFlow week-module-2 , week-module-3 , week-module-4	2	576	September 29, 2022
C2W4 Assignment submission Convolutional Neural Networks in TensorFlow week-module-4	6	722	June 26, 2022
C3W2 - Issue with Exercise 4: preprocess_dataset & Exercise 5: create_model & create_mode_and_check_accuracy Natural Language Processing in TensorFlow week-module-2	7	121	December 21, 2024
C2W1 I can't satisfy the output Convolutional Neural Networks in TensorFlow week-module-1	5	31	October 31, 2024
UNICOCODError: 'utf-8 with tfx.components.CsvExampleGen Machine Learning in Production	2	64	November 19, 2021

Bike Sharing dataset in C2_W2_Lab_2_Feature_Engineering_Pipeline

Related topics