C3W2-Quiz discussion item for snapshots

evaldasw · January 2, 2025, 7:20pm

Module # Module-2
Link to the classroom item you are referring to: https://www.coursera.org/learn/data-storage-and-queries/assignment-submission/SDPOt/week-2-quiz
Description

Not sure if I agree with the following question’s expected answer. In general, I don’t think snapshots are that great of an idea when it comes to large datasets, does anyone want to have several copies of terabyte-sized tables?

Given Parquet supports schema and reads and automatically merges schemas as long as files are backward compatible, I don’t see the point of having snapshots.

Maybe I’m missing something here, would be great to hear other opinions.

Georgios · January 15, 2025, 6:01pm

Hello @evaldasw,
I think in practice you might not use snapshots for those reasons. Cost is one reason to avoid it but simplicity a different reason to consider it. However, the way snapshots are presented in the course is to reflect the state of the stored data. Not sure if you find it useful but it helps understand where to use any logical abstractions. Hope it helps

evaldasw · January 15, 2025, 6:36pm

Hi @Georgios,

Thanks for your feedback. Yes using data versioning makes things slightly more complicated, especially when it also involved schema changes.

Personally, I have been using Databricks delta lake lib for this, but the only reason is to have ability to write and read data to/from same source by multiple jobs. However, as you say it does incur additional complexity and performance cost, so using it only on part of the data sources that really needs it. It also seems to be slightly easier to use then iceberg at least from users perspective, but I’m still not that familiar how iceberg works to say for sure.

Topic		Replies	Views
Redshift & DLH architecture notes & feedback Data Storage and Queries week-module-3 , coursera-platform	1	11	January 3, 2025
Why using S3 instead of RDS? Introduction to Data Engineering week-module-2 , coursera-platform	1	33	April 15, 2025
C1W2 - Storage systems vs Storage abstractions Introduction to Data Engineering week-module-2 , coursera-platform	3	28	April 2, 2025
Assignment 2: Building a Data Lakehouse with AWS Lake Formation and Apache Iceberg Data Storage and Queries week-module-2 , coursera-platform	6	72	April 10, 2025
C3W2 Assignment 2: Building a Data Lakehouse...cannot see AWS Glue Catalog Data Storage and Queries week-module-2 , coursera-platform	1	27	March 29, 2025

C3W2-Quiz discussion item for snapshots

Related topics