I was watching the video Lab Walkthrough - Introduction to the Lab from week 2, where you get an introductory exercise with AWS Glue. The basic pipelines goes as follow:
- ingesting from RDS
- transforming existing normalized schema to star schema
- loading into S3 bucket (from which an analyst will query using Amazon Athena)
I am not sure why choosing S3+Athena over using another relational database after the transformation. Is there any technical reason for this?
Hi Marina.
Good catch. I hadn’t thought about it myself and I’m unsure about the answer. But if were to guess, I’d imagine it to be either a cost thing or that S3 + Athena essentially gives you a data lake architecture which would be handy at some point if the structure of the data was to evolve. Only a guess though. Perhaps someone from the team might see this and give us a sure answer.
For what I’ve been reading, S3+Athena offers
- lower costs (parquet files stored in S3 and Athena is pay per query) than RDS, especially for ad-hoc queries to investigate particular issues instead of ongoing reporting efforts.
- more complex initial setup from engineer perspective compared to RDS, which is just configuring the relational database (but not dramatic if you have a data engineer and compensates later for scalability)
- better scalability when having data in the TB or PB scale than RDS
There is also some discussion about better performance on massive datasets but I am not sure honestly. Apparently, this is a common strategy after ETL jobs.
If anyone from the course or from the audience has any extra insights, comments are very much welcomed 