iTranslated by AI
Sampling Dates in BigQuery with GENERATE_DATE_ARRAY
I wanted to look at data over a long period, such as over a year, for tables partitioned by date in BigQuery. However, retrieving the entire period results in a high data processing volume, which is problematic, so I came up with a solution.
Sampling data using GENERATE_DATE_ARRAY
There is a function called GENERATE_DATE_ARRAY that generates an array of dates for a specified period.
The third argument allows you to specify the interval. For example, it looks like this:
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE('Asia/Tokyo'), INTERVAL 7 day), -- Start
CURRENT_DATE('Asia/Tokyo') -- End
, INTERVAL 2 day) -- Date interval
// output
2024-03-15
2024-03-17
2024-03-19
2024-03-21
By sampling dates over a certain period this way, you can reduce the data processing volume.
When using it in an actual query, it looks like the following. Since GENERATE_DATE_ARRAY returns an array, you specify this in the IN clause for the partitioned field.
SELECT *
FROM sourced_data
WHERE DATE(_PARTITIONDATE) IN UNNEST(
GENERATE_DATE_ARRAY(DATE_SUB(CURRENT_DATE('Asia/Tokyo'), INTERVAL 1 year), CURRENT_DATE('Asia/Tokyo'), INTERVAL 5 day)
)
With INTERVAL 5 day, it retrieves data once every 5 days, so the processing volume is reduced to 1/5. If you want to adjust the volume, you can just change this value.
Conversely, you could adjust it to about 4/5 by using WHERE DATE(_PARTITIONDATE) NOT IN UNNEST(....
Discussion