Add One Line of SQL to Optimise Your BigQuery Tables | by Matt Chapman

Add One Line of SQL to Optimise Your BigQuery Tables | by Matt Chapman | Dec, 2023

Last updated: 2023/12/08 at 10:39 PM

Editor AI News

1 Min Read

Clustering: A simple way to group similar rows and prevent unnecessary data processing

In my previous article, I explained how to optimise SQL queries using partitioning:

Now, I’m writing the sequel! (Dad joke, anyone?)

This article will look at clustering: another powerful optimisation technique you can use in BigQuery. Like partitioning, clustering can help you write more performant queries that are quicker and cheaper to run. If you want to develop your SQL toolkit and build those higher-level Data Science skills, this is a great place to start.

In BigQuery, a clustered table is a table that keeps similar rows grouped together in physical “blocks”.

For example, picture a table called user_signups that keeps track of all the people registering an account on a fictitious website. It’s got four columns:

registration_date: the date on which the user created an account
country: the country where the user is based
tier: the user’s plan (“Free” or “Paid”)
username: the user’s username

If we wanted, we could cluster the table by country so that users from the same country are stored nearby each other in the table: