Absolute novice to both Cloud and Data Engg.
Appreciate any explanations and mentoring.
BATCH Model
In the batch Model, is Glue Crawler run at all? Was it part of GlueETL?
(Why) does the Recommender System write to both VectorDB as well as S3? I am confused with the direction of data flow. Is it the Recommender doing the writing? or a Lambda function? Is Recommender itself an EC2 or Lambda function? At this point, I am too new to be able to deep dive into Glue or Terraform to get a picture.
STREAMING Model
3. Where is Kinesis getting data from? We don’t have any simulators like in LAB2 with Apache Benchmark.
4. I don’t understand bidirectional data flow between the 2 Lambda functions here. What is the chronology?
I don’t think I would be able to answer most of your questions. However, it is good to analyze all those information thrown at you in the first course. The idea is to learn by using and get hands on practice which the book Fundamentals of Data Engineering cannot do.
For the Batch process you will learn to create and use Glue Crawler later in course 3 with Data Lakes. His job is to scan and extract the schema information, data types, and other relevant metadata. The Recommender is the model saved in S3 that takes data from the S3 Data Lake and sends data to the Streaming process via the S3 artifacts.
For the Streaming processyou will also learn about Kinesis in Course 2 with Streaming Ingestion. You will be able to Create data Streams and run scripts in the terminal to send data between producers and consumers. Finally save the output in the S3 recommendations.