Shouldn't LIMIT be used with ORDER BY in Data Preparation lab?

nocturnalgopher · January 21, 2024, 2:56am

There is a SQL query in the Data Preparation lab of the short course which JOINs two large tables and filters the data using the WHERE clause, before the first 10000 records are selected for further processing using the LIMIT clause.

I am not sure if I am allowed to copy&paste the SQL code from the notebook here, but the code is under the Joining Tables and Query Optimisation heading.

I think there is a slight issue with that SQL query, namely in the absence of ORDER BY clause, the rows that are returned by LIMIT are unspecified. This is documented in BigQuery docs.

Given that the instructor has mentioned the importance of data lineage, I think getting a random subset of the data every time the query is run is not best practice.

I suggest that the ORDER BY clause is added to the query so that together with the filters provided by the WHERE clause and the LIMIT clause, the subset of data returned by the query is always well-defined.

Mubsi · January 23, 2024, 8:11am

Hi @nocturnalgopher,

Good catch. I shall get back to you on that.

Thanks,
Mubsi

Mubsi · January 26, 2024, 6:22am

Hi @nocturnalgopher,

The idea is to get random sampled data every time, which is why you don’t see the order by clause being used.

This is important because, it is not ideal to work on consistent data. Then it would be like “hard-coding” your problem. You will only be working on, applying your techniques, improving on a particular set of data, which might bring in good results on it. But if you change your data, you may end up getting worse results as your entire design was made around a particular set to begin with. This is why random sampling is important.

Best,
Mubsi

Topic		Replies	Views
C3 W3 Lab 1 - A small error in Query leads to Cloud9 Crashing Data Storage and Queries week-3	2	32	October 23, 2024
C2_W3 lab 1 regression part W=(n_y,1) Calculus for Machine Learning and Data Science week-3	2	22	October 11, 2024
Why set x0 and a_prev to zero when sampling? Sequence Models	4	508	July 3, 2024
Why Shuffle Sequential Data Sequences, Time Series and Prediction week-2	4	729	November 9, 2023
Data Splittting Strategy in Supervised ML Supervised ML: Regression and Classification week-3	15	256	March 8, 2024

Shouldn't LIMIT be used with ORDER BY in Data Preparation lab?

Related topics