C3W2-Quiz discussion item for snapshots

Not sure if I agree with the following question’s expected answer. In general, I don’t think snapshots are that great of an idea when it comes to large datasets, does anyone want to have several copies of terabyte-sized tables?

Given Parquet supports schema and reads and automatically merges schemas as long as files are backward compatible, I don’t see the point of having snapshots.

Maybe I’m missing something here, would be great to hear other opinions.

Hello @evaldasw,
I think in practice you might not use snapshots for those reasons. Cost is one reason to avoid it but simplicity a different reason to consider it. However, the way snapshots are presented in the course is to reflect the state of the stored data. Not sure if you find it useful but it helps understand where to use any logical abstractions. Hope it helps

Hi @Georgios,

Thanks for your feedback. Yes using data versioning makes things slightly more complicated, especially when it also involved schema changes.

Personally, I have been using Databricks delta lake lib for this, but the only reason is to have ability to write and read data to/from same source by multiple jobs. However, as you say it does incur additional complexity and performance cost, so using it only on part of the data sources that really needs it. It also seems to be slightly easier to use then iceberg at least from users perspective, but I’m still not that familiar how iceberg works to say for sure.

1 Like