C4W4 Capstone project 2 - Data Quality checks

* Link to the classroom item you are referring to:

Data quality checks run as Tasks via Airflow are not working.
Had to turn these tasks off to run DAGs successfully

9433eeb05036
*** Found local files:
***   * /opt/airflow/logs/dag_id=deftunes_songs_pipeline_dag/run_id=scheduled__2020-02-01T00:00:00+00:00/task_id=dq_check_songs/attempt=2.log
*** Found logs served from host http://9433eeb05036:8793/log/dag_id=deftunes_songs_pipeline_dag/run_id=scheduled__2020-02-01T00:00:00+00:00/task_id=dq_check_songs/attempt=2.log
[2025-01-26, 19:50:54 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-01-26, 19:50:54 UTC] {base_aws.py:606} WARNING - Unable to find AWS Connection ID 'aws_default', switching to empty.
[2025-01-26, 19:50:54 UTC] {base_aws.py:180} INFO - No connection ID provided. Fallback on boto3 credential strategy (region_name='us-east-1'). See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
[2025-01-26, 19:50:56 UTC] {credentials.py:1075} INFO - Found credentials from IAM Role: de-c4w4a2-ec2-role
[2025-01-26, 19:50:56 UTC] {glue.py:440} INFO - Submitting AWS Glue data quality ruleset evaluation run for RulesetNames ['songs_dq_ruleset']
[2025-01-26, 19:50:56 UTC] {glue.py:471} INFO - Waiting for AWS Glue data quality ruleset evaluation run RunId: dqrun-ce615f2ead4ef7c01e28a74140e56fae30855a7b to complete.
[2025-01-26, 19:52:57 UTC] {taskinstance.py:3310} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 767, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 733, in _execute_callable
    return ExecutionCallableRunner(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 406, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/glue.py", line 473, in execute
    self.hook.get_waiter("data_quality_ruleset_evaluation_run_complete").wait(
  File "/home/airflow/.local/lib/python3.11/site-packages/botocore/waiter.py", line 55, in wait
    Waiter.wait(self, **kwargs)
  File "/home/airflow/.local/lib/python3.11/site-packages/botocore/waiter.py", line 374, in wait
    raise WaiterError(
botocore.exceptions.WaiterError: Waiter data_quality_ruleset_evaluation_run_complete failed: Waiter encountered a terminal failure state: For expression "Status" we matched expected path: "FAILED"
[2025-01-26, 19:52:57 UTC] {taskinstance.py:1225} INFO - Marking task as UP_FOR_RETRY. dag_id=deftunes_songs_pipeline_dag, task_id=dq_check_songs, run_id=scheduled__2020-02-01T00:00:00+00:00, execution_date=20200201T000000, start_date=20250126T195054, end_date=20250126T195257
[2025-01-26, 19:52:57 UTC] {taskinstance.py:340} ▶ Post task execution logs
[2025-01-26, 20:23:35 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-01-26, 20:23:36 UTC] {base_aws.py:606} WARNING - Unable to find AWS Connection ID 'aws_default', switching to empty.
[2025-01-26, 20:23:36 UTC] {base_aws.py:180} INFO - No connection ID provided. Fallback on boto3 credential strategy (region_name='us-east-1'). See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
[2025-01-26, 20:23:37 UTC] {credentials.py:1075} INFO - Found credentials from IAM Role: de-c4w4a2-ec2-role
[2025-01-26, 20:23:37 UTC] {glue.py:440} INFO - Submitting AWS Glue data quality ruleset evaluation run for RulesetNames ['songs_dq_ruleset']
[2025-01-26, 20:23:38 UTC] {glue.py:471} INFO - Waiting for AWS Glue data quality ruleset evaluation run RunId: dqrun-f594ba68f1dc7c69cf2ef9f4c6edc3f40535a098 to complete.

@konutech glad you got it to work. Hope it helps

@Georgios It’s not working, he turned it off… and I have the same issue.

Mine is at songs ( in the tf file songs have default values, I edit only sessions & users)

@zvika_sinkevich sorry for the inconvenience, could you fill this form for a lab refresh. That might refresh existing data quality checks, thanks

@Georgios form submited

Hello @zvika_sinkevich,
Are your jobs from steps 2.4 and 2.5 succesfull and afterwards could you go to the Databases menu from AWS Glue, delete the rulesets, and then run terraform apply -target=module.data_quality again. Thanks

Now the lab is working fine - everything is working

2 Likes

I’m facing the same issue. what’s the problem here? how to fix it? @Georgios

eb534735d22d
*** Found local files:
***   * /opt/airflow/logs/dag_id=deftunes_songs_pipeline_dag/run_id=scheduled__2020-02-01T00:00:00+00:00/task_id=dq_check_songs/attempt=2.log
[2025-03-07, 09:48:58 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-03-07, 09:48:59 UTC] {base_aws.py:606} WARNING - Unable to find AWS Connection ID 'aws_default', switching to empty.
[2025-03-07, 09:48:59 UTC] {base_aws.py:180} INFO - No connection ID provided. Fallback on boto3 credential strategy (region_name='us-east-1'). See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
[2025-03-07, 09:49:00 UTC] {credentials.py:1075} INFO - Found credentials from IAM Role: de-c4w4a2-ec2-role
[2025-03-07, 09:49:00 UTC] {taskinstance.py:3310} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 767, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 733, in _execute_callable
    return ExecutionCallableRunner(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 406, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/glue.py", line 438, in execute
    self.validate_inputs()
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/glue.py", line 435, in validate_inputs
    raise AirflowException(f"Following RulesetNames are not found {not_found_ruleset}")
airflow.exceptions.AirflowException: Following RulesetNames are not found ['songs_dq_ruleset']
[2025-03-07, 09:49:00 UTC] {taskinstance.py:1225} INFO - Marking task as FAILED. dag_id=deftunes_songs_pipeline_dag, task_id=dq_check_songs, run_id=scheduled__2020-02-01T00:00:00+00:00, execution_date=20200201T000000, start_date=20250307T094858, end_date=20250307T094900
[2025-03-07, 09:49:01 UTC] {taskinstance.py:340} ▶ Post task execution logs

Hello @pawanshirbhate,

Could you check that you succesfuly created the songs_dq_ruleset since it says is not found. You could find it in the AWS console :

  1. if you go to AWS Glue>tables under Data Catalog> and choose view data quality in songs

  1. See if it succeded then run DAG again. Hope it helps:

Hi @Georgios, Mine is failing under user.

Hello @sapiensush,

I couldn’t reproduce your issue, the rules looks good but I can see it fails with that value. Could you check at the logs for that specific DAG for more info. Thank you