There is a SQL query in the Data Preparation lab of the short course which JOIN
s two large tables and filters the data using the WHERE
clause, before the first 10000
records are selected for further processing using the LIMIT
clause.
I am not sure if I am allowed to copy&paste the SQL code from the notebook here, but the code is under the Joining Tables and Query Optimisation heading.
I think there is a slight issue with that SQL query, namely in the absence of ORDER BY
clause, the rows that are returned by LIMIT
are unspecified. This is documented in BigQuery docs.
Given that the instructor has mentioned the importance of data lineage, I think getting a random subset of the data every time the query is run is not best practice.
I suggest that the ORDER BY
clause is added to the query so that together with the filters provided by the WHERE
clause and the LIMIT
clause, the subset of data returned by the query is always well-defined.