There is a SQL query in the Data Preparation lab of the short course which
JOINs two large tables and filters the data using the
WHERE clause, before the first
10000 records are selected for further processing using the
I am not sure if I am allowed to copy&paste the SQL code from the notebook here, but the code is under the Joining Tables and Query Optimisation heading.
I think there is a slight issue with that SQL query, namely in the absence of
ORDER BY clause, the rows that are returned by
LIMIT are unspecified. This is documented in BigQuery docs.
Given that the instructor has mentioned the importance of data lineage, I think getting a random subset of the data every time the query is run is not best practice.
I suggest that the
ORDER BY clause is added to the query so that together with the filters provided by the
WHERE clause and the
LIMIT clause, the subset of data returned by the query is always well-defined.