I am stuck on C4W4: capstone project part 1 for data transformation, anyone else also has similar issue? The jobs are failed

Jiachuan_Wang · October 10, 2024, 1:27am

https://www.coursera.org/learn/data-modeling-transformation-serving/gradedLti/WufHL/graded-programming-assignment-4-capstone-project-part-1-etl-and-data-modeling)

I can not get the jobs: transform songs, and transform json to run successfully.

Amir_Zare · October 10, 2024, 8:28am

Hello @Jiachuan_Wang
If you run terraform apply without any issues yet your glue jobs fail, my guess is that there is a bug in the python scripts you completed in section 4.1.1, namely de-c4w4a1-extract-songs-job.py and de-c4w4a1-api-extract-job.py. In order to get a better understanding of the issue, you can search for AWS Glue in the AWS console, choose ETL jobs from the menu on the left hand side, choose a job, and see the logs for the runs on those jobs.
If the issue persists, please provide more details from those logs so that we can investigate further.

Jiachuan_Wang · October 10, 2024, 8:48pm

Yes you are right. After fixing a bug in previous extract steps. It works for me!

Thank you!

Reginald_Bain · October 29, 2024, 4:07am

@Jiachuan_Wang what ended up being the bug? I’m having a similar issue where the extract jobs yielded succeeded but the transform jobs are both failing and I can’t figure out why.

@Amir_Zare
I’m getting the error on, for example, de-c4w4a1-json-transform-job of AnalysisException: Path does not exist: s3://de-c4w4a1-sensitivedatahere-us-east-1-data-lake/landing_zone/api/users/2024-10-28

For de-c4w4a1-songs-transform-job I’m getting AttributeError: ‘DataFrame’ object has no attribute ‘duration’

Is this an issue with my extract job even though those succeeded?

Thank you!

Amir_Zare · October 29, 2024, 5:32am

Hello @Reginald_Bain
Apparently, you have filled in the <YOUR-CURRENT-DATE> variable in transform_job/glue.tf file correctly, yet the first job can’t find the data it needs, and the data the second one reads doesn’t have the required columns. So, my guess is that your issue is with the extract jobs too. After you run your extract jobs, you can check the address you see in the exception, namely s3://de-c4w4a1-sensitivedatahere-us-east-1-data-lake/landing_zone/api/users/2024-10-28, and verify that the files the transform job wants to read indeed exist there.

Reginald_Bain · October 29, 2024, 2:02pm

@Amir_Zare Yeah I thought the date might be an issue but it seems ok. If after I run the extract jobs, the files the transform job are NOT there, do you have a sense of where in the python scripts for the extract job things might be going wrong? I tried re doing the extract portion from scratch and couldn’t find anything obvious. Would you be able to take a look at my code? Thank you!

Jiachuan_Wang · October 29, 2024, 2:05pm

@Reginald_Bain Basically, you need to go back to the extract job and do the work carefully, not just replacing the None, but also pay attention to other things, such as API end point, some dates, etc. Then you will be good.

Reginald_Bain · October 29, 2024, 3:16pm

@Jiachuan_Wang Ah I see. Throughout the courses, we have only replaced things like “None” and “<BUCKET_NAME>” so I always leave everything else. One thing I was curious about is in the api_extract_job.py, should we be replacing anything in the following block? (Note: I have not included anything but what was already in the template files provided).

“#” Replace with your API URLs
api_url = args[“api_url”]
request_start_date = args[“api_start_date”]
request_end_date = args[“api_end_date”]
target_path = args[“target_path”]
current_timestamp = datetime.now().strftime(“%Y-%m-%d”)
print(f"Current Timestamp: {current_timestamp}")

Do we need to replace things like “JOB_NAME”? Usually, those things are put in <> when we’re supposed to replace them. I guess they decided for the capstone project to do this differently? There’s a handful of places where the directions seem to tell the user to do something but it seems as though it’s already done.

hravat · October 30, 2024, 1:44am

Thanks … this comment saved me when i was stuck… seeing the error in AWS Glue console helped me quikly debug the errors… would appreciated if you could include this in the notebook

Reginald_Bain · October 30, 2024, 3:29am

@hravat @Jiachuan_Wang @Amir_Zare Alright, it looks like one can have a time zone problem in the extract phase.
When using datetime.now() it will give you the current date in UTC. When manually entering the ingest date in the transform glue.tf, you have to make sure to enter the yyyy-mm-dd for the UTC date, NOT your current time zone. This is what was making my transform step not find the data from my extract step. Hope this is helpful for others.

e.g. When I got this error I had entered 2024-10-29 in that glue.tf file when it was around 10 pm my time, but that’s a time on 2024-10-30 in UTC (on AWS). If you go to your S3, you can dig through the directories and find the date its using.

Amir_Zare · October 30, 2024, 5:22am

Hello @Reginald_Bain
I will double check with the team to see if the date should indeed be in UTC time zone, and we will update the instructions accordingly.
I am sorry for the inconvenience you faced, and thank you for sharing your insight.

Jihye_Sofia_Seo · November 28, 2024, 12:15pm

Thank you so much! It is great to use the UI for failed glue jobs.

piticfericit · December 1, 2024, 4:25am

+1 here. Same issue with UTC / time zone issue.

uebelvan · December 20, 2024, 3:47am

@Amir_Zare please have the team update the instructions in the glue.tf file, this just cost me so much time to dig into just to realize UTC was the issue when the instructions say to use PCT

uebelvan · December 20, 2024, 4:01am

Actually never mind, after using the UTC date the jobs run longer, but still fail. Please let me know steps I can take to investigate. No error message, just ‘FAILED’ when requesting the job status.

Amir_Zare · December 20, 2024, 8:02am

Hello @uebelvan
Servers are deployed in the us-west-1 area, so the time zone we use must be pacific, as the instructions say.
If you have used UTC the errors you get are probably because of that. In order to see the exceptions, you can go to the AWS Glue in AWS console, select the job and the run, and see the logs there.

shlomoc · December 24, 2024, 2:09am

+1. Same issue where job failed when using Pacific timezone in transform glue.tf - whereas files got extracted using UTC timezone. Thanks for explaining.

Also, you can run the glue job and check its status in the AWS Glue UI (without having to type the commands in the command line.)

Sabergit · December 27, 2024, 3:00pm

Thanks everybody for pointing out some useful comments.
Same issue here.
I had run the extraction jobs successfully but got stuck on the transformation jobs due to the time issue with the following error:
I used both “yyyy-mm-dd” and “%Y-%m-%d” formats for the ingestion date on transform jobs but got error in both cases similar to:
ValueError: time data ‘%Y-%m-%d’ does not match format ‘%Y-%m-%d’

I wish instructions were clear enough to alert somewhere about changing the scripts. This could avoid redoing the same stuff again.

Topic		Replies	Views
C4W4 Capstone Project Part 1 - ETL and Data Modeling: 4.2 - Transformation Zone- Jobs failed Data Modeling, Transformation, and Serving week-4	6	100	December 4, 2024
C4W4 - Capstone 2 :: Transformation Jobs fails Data Modeling, Transformation, and Serving week-4	20	161	May 3, 2025
C4W4A1 Lab - Capstone - unable to SUCCEED in Transform step Data Modeling, Transformation, and Serving week-4	4	56	December 16, 2024
C4W4 capstone project part 2 - Error: creating Glue Catalog Database Data Modeling, Transformation, and Serving week-4	34	233	March 7, 2025
C4W4 - Capstone Project Part 1 Data Modeling, Transformation, and Serving week-4	9	106	February 24, 2025

I am stuck on C4W4: capstone project part 1 for data transformation, anyone else also has similar issue? The jobs are failed

Related topics