C4W4 Capstone project 2 - Data Quality checks

* Link to the classroom item you are referring to:

Data quality checks run as Tasks via Airflow are not working.
Had to turn these tasks off to run DAGs successfully

9433eeb05036
*** Found local files:
***   * /opt/airflow/logs/dag_id=deftunes_songs_pipeline_dag/run_id=scheduled__2020-02-01T00:00:00+00:00/task_id=dq_check_songs/attempt=2.log
*** Found logs served from host http://9433eeb05036:8793/log/dag_id=deftunes_songs_pipeline_dag/run_id=scheduled__2020-02-01T00:00:00+00:00/task_id=dq_check_songs/attempt=2.log
[2025-01-26, 19:50:54 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-01-26, 19:50:54 UTC] {base_aws.py:606} WARNING - Unable to find AWS Connection ID 'aws_default', switching to empty.
[2025-01-26, 19:50:54 UTC] {base_aws.py:180} INFO - No connection ID provided. Fallback on boto3 credential strategy (region_name='us-east-1'). See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
[2025-01-26, 19:50:56 UTC] {credentials.py:1075} INFO - Found credentials from IAM Role: de-c4w4a2-ec2-role
[2025-01-26, 19:50:56 UTC] {glue.py:440} INFO - Submitting AWS Glue data quality ruleset evaluation run for RulesetNames ['songs_dq_ruleset']
[2025-01-26, 19:50:56 UTC] {glue.py:471} INFO - Waiting for AWS Glue data quality ruleset evaluation run RunId: dqrun-ce615f2ead4ef7c01e28a74140e56fae30855a7b to complete.
[2025-01-26, 19:52:57 UTC] {taskinstance.py:3310} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 767, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 733, in _execute_callable
    return ExecutionCallableRunner(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 406, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/glue.py", line 473, in execute
    self.hook.get_waiter("data_quality_ruleset_evaluation_run_complete").wait(
  File "/home/airflow/.local/lib/python3.11/site-packages/botocore/waiter.py", line 55, in wait
    Waiter.wait(self, **kwargs)
  File "/home/airflow/.local/lib/python3.11/site-packages/botocore/waiter.py", line 374, in wait
    raise WaiterError(
botocore.exceptions.WaiterError: Waiter data_quality_ruleset_evaluation_run_complete failed: Waiter encountered a terminal failure state: For expression "Status" we matched expected path: "FAILED"
[2025-01-26, 19:52:57 UTC] {taskinstance.py:1225} INFO - Marking task as UP_FOR_RETRY. dag_id=deftunes_songs_pipeline_dag, task_id=dq_check_songs, run_id=scheduled__2020-02-01T00:00:00+00:00, execution_date=20200201T000000, start_date=20250126T195054, end_date=20250126T195257
[2025-01-26, 19:52:57 UTC] {taskinstance.py:340} ▶ Post task execution logs
[2025-01-26, 20:23:35 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-01-26, 20:23:36 UTC] {base_aws.py:606} WARNING - Unable to find AWS Connection ID 'aws_default', switching to empty.
[2025-01-26, 20:23:36 UTC] {base_aws.py:180} INFO - No connection ID provided. Fallback on boto3 credential strategy (region_name='us-east-1'). See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
[2025-01-26, 20:23:37 UTC] {credentials.py:1075} INFO - Found credentials from IAM Role: de-c4w4a2-ec2-role
[2025-01-26, 20:23:37 UTC] {glue.py:440} INFO - Submitting AWS Glue data quality ruleset evaluation run for RulesetNames ['songs_dq_ruleset']
[2025-01-26, 20:23:38 UTC] {glue.py:471} INFO - Waiting for AWS Glue data quality ruleset evaluation run RunId: dqrun-f594ba68f1dc7c69cf2ef9f4c6edc3f40535a098 to complete.

@konutech glad you got it to work. Hope it helps

@Georgios It’s not working, he turned it off… and I have the same issue.

Mine is at songs ( in the tf file songs have default values, I edit only sessions & users)

@zvika_sinkevich sorry for the inconvenience, could you fill this form for a lab refresh. That might refresh existing data quality checks, thanks

@Georgios form submited

Hello @zvika_sinkevich,
Are your jobs from steps 2.4 and 2.5 succesfull and afterwards could you go to the Databases menu from AWS Glue, delete the rulesets, and then run terraform apply -target=module.data_quality again. Thanks

Now the lab is working fine - everything is working

2 Likes