First row for each group
How to get first visit row for each user and resource?
In query you presented in question - should remove DESC
in ORDER BY created_at DESC
otherwise it returns last visit - not first
What is the best way to construct such query?
Another option would be to use ROW_NUMBER() as below
SELECT
user_id,
endpoint_id,
created_at
FROM (
SELECT
user_id,
endpoint_id,
created_at,
ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) AS first_created
FROM [visits]
)
WHERE first_created = 1
... but this query will not work for big amount of data
This really depends. Resources Exceeded
can happen If size of your user_id, endpoint_id
partition is BIG enough (as ORDER BY requires all rows of partition to be on the same node).
If this is a case for you - you can use below
trick
Step 1 - using JOIN
SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at
FROM [visits] AS tab1
INNER JOIN (
SELECT user_id, endpoint_id, MIN(created_at) AS min_time
FROM [visits]
GROUP BY user_id, endpoint_id
) AS tab2
ON tab1.user_id = tab2.user_id
AND tab1.endpoint_id = tab2.endpoint_id
AND tab1.created_at = tab2.min_time
Step 2 - There is still something else to take care here - in case if you have duplicate entries for same user / resource. In this case you still need to extract only one row for each partition. See below final query
SELECT user_id, endpoint_id, created_at
FROM (
SELECT user_id, endpoint_id, created_at,
ROW_NUMBER() OVER (PARTITION BY user_id, endpoint_id) AS rn
FROM (
SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at
FROM [visits] AS tab1
INNER JOIN (
SELECT user_id, endpoint_id, MIN(created_at) AS min_time
FROM [visits]
GROUP BY user_id, endpoint_id
) AS tab2
ON tab1.user_id = tab2.user_id
AND tab1.endpoint_id = tab2.endpoint_id
AND tab1.created_at = tab2.min_time
)
)
WHERE rn = 1
and of course obvious and simplest Case - if those three fields are the ONLY fields in [visits] table
SELECT user_id, endpoint_id, MIN(created_at) AS created_at
FROM [visits]
GROUP BY user_id, endpoint_id
I have another solution that avoids the use of window functions (which I believe are v slow in BQ), as well as subqueries (which add complexity):
select
group_column
,array_agg(t order by time_column asc limit 1)[safe_offset(0)] AS first_row
from table AS t
group by 1
The array_agg returns an array, which has the struct of the first row of each group. This is extracted from the array with [offset(0)]. You can further extract from the struct using first_row.column_1. Or you could wrap a select statement around it to extract the columns from the struct:
select first_row.* from (
select
group_column
,array_agg(t order by time_column asc limit 1)[safe_offset(0)] AS
first_row
from table AS t
group by 1
)
You can now use qualify
for a more concise solution:
select
user_id,
endpoint_id,
created_at,
from [visits]
where true
qualify ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) = 1