Shouldn't LIMIT be used with ORDER BY in Data Preparation lab?

There is a SQL query in the Data Preparation lab of the short course which JOINs two large tables and filters the data using the WHERE clause, before the first 10000 records are selected for further processing using the LIMIT clause.

I am not sure if I am allowed to copy&paste the SQL code from the notebook here, but the code is under the Joining Tables and Query Optimisation heading.

I think there is a slight issue with that SQL query, namely in the absence of ORDER BY clause, the rows that are returned by LIMIT are unspecified. This is documented in BigQuery docs.

Given that the instructor has mentioned the importance of data lineage, I think getting a random subset of the data every time the query is run is not best practice.

I suggest that the ORDER BY clause is added to the query so that together with the filters provided by the WHERE clause and the LIMIT clause, the subset of data returned by the query is always well-defined.

2 Likes

Hi @nocturnalgopher,

Good catch. I shall get back to you on that.

Thanks,
Mubsi

1 Like

Hi @nocturnalgopher,

The idea is to get random sampled data every time, which is why you don’t see the order by clause being used.

This is important because, it is not ideal to work on consistent data. Then it would be like “hard-coding” your problem. You will only be working on, applying your techniques, improving on a particular set of data, which might bring in good results on it. But if you change your data, you may end up getting worse results as your entire design was made around a particular set to begin with. This is why random sampling is important.

Best,
Mubsi