Why using S3 instead of RDS?

Hi, in the lab exercise of W2, why using S3 for the star database (the database created after ETL) instead of using a relational (RDS) database for it? What are the differences? Why S3 is preferable in this case?

Dear @trhnam
Thanks for posting your question and welcome to the team!

Please find my comments:

  1. Since this is the first course, the idea behind the lab is to show you the general work flow of an entire pipeline for a data engineering transformation, i.e., ingesting->ETL->serving… Consequently, you will see this diagram:

There you will see that we are already using a database as the source of the data consequently after performing the ETL phase it would more sense to show the student the outcome in a storage more related to a datalake system (in this case S3) and then to be able to query using Amazon Athena
2. By exposing the student to more systems, RDS (mysql/postgress), S3 with Athena, you are learning more technologies that reflect the actual case in a production environment.
3. By using S3 you can learn that it would be optimal for storing different types of data, files (video, parquet, csv, audio), databases (queried via Athena).

I hope this makes it clear for you.
Later in the course you will learn how the transformed sql data (to start schema) is loaded into another database (mysql/postgres) but in this introductory lab the idea is to expose the student to more technologies used in a production environment.