I am stuck on C4W4: capstone project part 1 for data transformation, anyone else also has similar issue? The jobs are failed

https://www.coursera.org/learn/data-modeling-transformation-serving/gradedLti/WufHL/graded-programming-assignment-4-capstone-project-part-1-etl-and-data-modeling)

I can not get the jobs: transform songs, and transform json to run successfully.

2 Likes

Hello @Jiachuan_Wang
If you run terraform apply without any issues yet your glue jobs fail, my guess is that there is a bug in the python scripts you completed in section 4.1.1, namely de-c4w4a1-extract-songs-job.py and de-c4w4a1-api-extract-job.py. In order to get a better understanding of the issue, you can search for AWS Glue in the AWS console, choose ETL jobs from the menu on the left hand side, choose a job, and see the logs for the runs on those jobs.
If the issue persists, please provide more details from those logs so that we can investigate further.

2 Likes

Yes you are right. After fixing a bug in previous extract steps. It works for me!

Thank you!

@Jiachuan_Wang what ended up being the bug? Iā€™m having a similar issue where the extract jobs yielded succeeded but the transform jobs are both failing and I canā€™t figure out why.

@Amir_Zare
Iā€™m getting the error on, for example, de-c4w4a1-json-transform-job of AnalysisException: Path does not exist: s3://de-c4w4a1-sensitivedatahere-us-east-1-data-lake/landing_zone/api/users/2024-10-28

For de-c4w4a1-songs-transform-job Iā€™m getting AttributeError: ā€˜DataFrameā€™ object has no attribute ā€˜durationā€™

Is this an issue with my extract job even though those succeeded?

Thank you!

2 Likes

Hello @Reginald_Bain
Apparently, you have filled in the <YOUR-CURRENT-DATE> variable in transform_job/glue.tf file correctly, yet the first job canā€™t find the data it needs, and the data the second one reads doesnā€™t have the required columns. So, my guess is that your issue is with the extract jobs too. After you run your extract jobs, you can check the address you see in the exception, namely s3://de-c4w4a1-sensitivedatahere-us-east-1-data-lake/landing_zone/api/users/2024-10-28, and verify that the files the transform job wants to read indeed exist there.

@Amir_Zare Yeah I thought the date might be an issue but it seems ok. If after I run the extract jobs, the files the transform job are NOT there, do you have a sense of where in the python scripts for the extract job things might be going wrong? I tried re doing the extract portion from scratch and couldnā€™t find anything obvious. Would you be able to take a look at my code? Thank you!

@Reginald_Bain Basically, you need to go back to the extract job and do the work carefully, not just replacing the None, but also pay attention to other things, such as API end point, some dates, etc. Then you will be good.

@Jiachuan_Wang Ah I see. Throughout the courses, we have only replaced things like ā€œNoneā€ and ā€œ<BUCKET_NAME>ā€ so I always leave everything else. One thing I was curious about is in the api_extract_job.py, should we be replacing anything in the following block? (Note: I have not included anything but what was already in the template files provided).

ā€œ#ā€ Replace with your API URLs
api_url = args[ā€œapi_urlā€]
request_start_date = args[ā€œapi_start_dateā€]
request_end_date = args[ā€œapi_end_dateā€]
target_path = args[ā€œtarget_pathā€]
current_timestamp = datetime.now().strftime(ā€œ%Y-%m-%dā€)
print(f"Current Timestamp: {current_timestamp}")

Do we need to replace things like ā€œJOB_NAMEā€? Usually, those things are put in <> when weā€™re supposed to replace them. I guess they decided for the capstone project to do this differently? Thereā€™s a handful of places where the directions seem to tell the user to do something but it seems as though itā€™s already done.

Thanks ā€¦ this comment saved me when i was stuckā€¦ seeing the error in AWS Glue console helped me quikly debug the errorsā€¦ would appreciated if you could include this in the notebook

@hravat @Jiachuan_Wang @Amir_Zare Alright, it looks like one can have a time zone problem in the extract phase.
When using datetime.now() it will give you the current date in UTC. When manually entering the ingest date in the transform glue.tf, you have to make sure to enter the yyyy-mm-dd for the UTC date, NOT your current time zone. This is what was making my transform step not find the data from my extract step. Hope this is helpful for others.

e.g. When I got this error I had entered 2024-10-29 in that glue.tf file when it was around 10 pm my time, but thatā€™s a time on 2024-10-30 in UTC (on AWS). If you go to your S3, you can dig through the directories and find the date its using.

1 Like

Hello @Reginald_Bain
I will double check with the team to see if the date should indeed be in UTC time zone, and we will update the instructions accordingly.
I am sorry for the inconvenience you faced, and thank you for sharing your insight.

Thank you so much! It is great to use the UI for failed glue jobs.

+1 here. Same issue with UTC / time zone issue.

1 Like

@Amir_Zare please have the team update the instructions in the glue.tf file, this just cost me so much time to dig into just to realize UTC was the issue when the instructions say to use PCT

1 Like

Actually never mind, after using the UTC date the jobs run longer, but still fail. Please let me know steps I can take to investigate. No error message, just ā€˜FAILEDā€™ when requesting the job status.

Hello @uebelvan
Servers are deployed in the us-west-1 area, so the time zone we use must be pacific, as the instructions say.
If you have used UTC the errors you get are probably because of that. In order to see the exceptions, you can go to the AWS Glue in AWS console, select the job and the run, and see the logs there.

+1. Same issue where job failed when using Pacific timezone in transform glue.tf - whereas files got extracted using UTC timezone. Thanks for explaining.

Also, you can run the glue job and check its status in the AWS Glue UI (without having to type the commands in the command line.)

Thanks everybody for pointing out some useful comments.
Same issue here.
I had run the extraction jobs successfully but got stuck on the transformation jobs due to the time issue with the following error:
I used both ā€œyyyy-mm-ddā€ and ā€œ%Y-%m-%dā€ formats for the ingestion date on transform jobs but got error in both cases similar to:
ValueError: time data ā€˜%Y-%m-%dā€™ does not match format ā€˜%Y-%m-%dā€™

I wish instructions were clear enough to alert somewhere about changing the scripts. This could avoid redoing the same stuff again.

1 Like