C1W4 Lab: Paraphrase and help me understand the data directions in batch and stream processing

eadgc5 · February 4, 2025, 3:46am

Absolute novice to both Cloud and Data Engg.
Appreciate any explanations and mentoring.

BATCH Model

In the batch Model, is Glue Crawler run at all? Was it part of GlueETL?
(Why) does the Recommender System write to both VectorDB as well as S3? I am confused with the direction of data flow. Is it the Recommender doing the writing? or a Lambda function? Is Recommender itself an EC2 or Lambda function? At this point, I am too new to be able to deep dive into Glue or Terraform to get a picture.

STREAMING Model
3. Where is Kinesis getting data from? We don’t have any simulators like in LAB2 with Apache Benchmark.
4. I don’t understand bidirectional data flow between the 2 Lambda functions here. What is the chronology?

Thank you

Georgios · February 4, 2025, 3:21pm

Hello @eadgc5,

I don’t think I would be able to answer most of your questions. However, it is good to analyze all those information thrown at you in the first course. The idea is to learn by using and get hands on practice which the book Fundamentals of Data Engineering cannot do.

For the Batch process you will learn to create and use Glue Crawler later in course 3 with Data Lakes. His job is to scan and extract the schema information, data types, and other relevant metadata. The Recommender is the model saved in S3 that takes data from the S3 Data Lake and sends data to the Streaming process via the S3 artifacts.

For the Streaming processyou will also learn about Kinesis in Course 2 with Streaming Ingestion. You will be able to Create data Streams and run scripts in the terminal to send data between producers and consumers. Finally save the output in the S3 recommendations.

Topic		Replies	Views
Recommandation system with streaming data? AI Discussions	8	62	August 12, 2022
Job recommender system AI Discussions	2	71	September 10, 2022
Week2 streaming ingestion difference between kafka and kinesis streaming Source Systems, Data Ingestion, and Pipelines week-2 , module-2	1	24	January 30, 2025
C2_W2_Lab_1_Streaming_Ingestion Source Systems, Data Ingestion, and Pipelines week-2	2	74	October 18, 2024
Glue_json_transformation_job` and `glue_songs_transformation_job` fail Data Modeling, Transformation, and Serving week-4	5	70	March 7, 2025

C1W4 Lab: Paraphrase and help me understand the data directions in batch and stream processing

Related topics