iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🪒

Sampling Dates in BigQuery with GENERATE_DATE_ARRAY

に公開

I wanted to look at data over a long period, such as over a year, for tables partitioned by date in BigQuery. However, retrieving the entire period results in a high data processing volume, which is problematic, so I came up with a solution.

Sampling data using GENERATE_DATE_ARRAY

There is a function called GENERATE_DATE_ARRAY that generates an array of dates for a specified period.

The third argument allows you to specify the interval. For example, it looks like this:

SELECT GENERATE_DATE_ARRAY(
  DATE_SUB(CURRENT_DATE('Asia/Tokyo'), INTERVAL 7 day),  -- Start
  CURRENT_DATE('Asia/Tokyo') -- End
, INTERVAL 2 day) -- Date interval
// output
2024-03-15
2024-03-17
2024-03-19
2024-03-21

By sampling dates over a certain period this way, you can reduce the data processing volume.

When using it in an actual query, it looks like the following. Since GENERATE_DATE_ARRAY returns an array, you specify this in the IN clause for the partitioned field.

SELECT *
FROM sourced_data
WHERE DATE(_PARTITIONDATE) IN UNNEST(
    GENERATE_DATE_ARRAY(DATE_SUB(CURRENT_DATE('Asia/Tokyo'), INTERVAL 1 year), CURRENT_DATE('Asia/Tokyo'), INTERVAL 5 day) 
)

With INTERVAL 5 day, it retrieves data once every 5 days, so the processing volume is reduced to 1/5. If you want to adjust the volume, you can just change this value.
Conversely, you could adjust it to about 4/5 by using WHERE DATE(_PARTITIONDATE) NOT IN UNNEST(....

GitHubで編集を提案

Discussion