How to build your GA4 Realtime Report with Looker Studio & BigQuery with pics

Last Updated: March 27, 2024

The Perfect Dashboard CheatSheet

Ship AMAZING ACTIONABLE dashboards every single day.
Learn to build from a core structure & blow away your coworkers.

Table of Contents

Introduction

You want to be able to view your GA4 data in real-time, but you’re having a hard time making up from down with the current GA4 real time report. You have come to the right place!

We’re going to explore everything necessary to have a beautiful real time Looker Studio report using the GA4 export data inside of BigQuery.

GA4 has a report that will show you “real time” data, but it’s scoped to the last 30 minutes and has a number of fatal flaws that make it less than useful. It basically serves as a glorified hit counter and does nothing to tie in performance or session-based data.

We’re going to solve that!

Is Real-Time GA4 Data For You?

This is definitely a question that we’re still asking ourselves. If you know that you absolutely need and want to be able to view your site data in real-time, definitely let us know by sending an email to hello at visionlabs.com

Building the report that we’re going to go over in this report isn’t free.

You will have some very small costs associated even with just turning on the “real-time” connection between GA4 and BigQuery, as well storing your BigQuery data, and running the queries. Everything is pretty small, except for the queries and the time to build the report.

Depending on the usage of the report, the query cost could grow pretty quickly. More on that later.

We’ve identified a few use cases where it does make sense to invest the time and resources for a real-time GA4 report using BigQuery:

  1. You are a CEO who wants to have a real-time view of what’s happening generally in your store. You want to see the transactions that are coming in, what products they’re buying, and where they came from.
  2. You are a CRO monitoring website behaviors to validate that your tests are either working as expected or not. If you shipped a new product action block, then you’ll quickly be able to see if add_to_cart events suddenly drop off, for example.
  3. Media buyer who is analyzing if the new landing pages and/or campaigns are converting as expected. Especially compared to past performance and expectations that the ad is setting.

If anything were to go wrong with your website and a few hours of the problem going undetected is a huge problem, then a real-time GA4 report is just the thing you need to stay frosty!

What’s Wrong With The Built In GA4 Reports?

Oh boy oh boy! I’m going to tell you!

The obvious problem is that the standard reports and explorations take 24-48 hours to populate. Google uses this time to process the data in order to display it properly. What all happens is a bit of mystery, but we know that in some cases they will model and predict the data. 

Under the reports tab there is a page at the top title “realtime” which shows some activity that has happened over the last 30 mins. This report shows:

  • How many users have visited in the last 30 minutes 
  • The first user source/medium/campaign
  • Pages viewed
  • Events that have occurred
  • Conversions that have been triggered
  • And active user properties

With all of these cards, what they don’t show is anything related to a session (session source/medium/campaign/landing page) and actual performance tied to those session based dimensions.

Additionally, even once the data is processed there are some big drawbacks to the reports and explorations that are currently offered:

  1. Landing Pages
    • The dimension for landing pages does not include the hostname. The hostname is part of the URL before the page path.
    • Imagine if your website is on site.com and you have a blog on blog.mysite.com and a store on store.mysite.com. If someone was to visit the “homepage” for any of these sites, all you would see is “/“ in reports and it wouldn’t be immediately apparent which homepage you actually saw.
  2. Page Location
    • When looking at page locations, URL query parameters are not combined into a single row. In Universal Analytics we would set up a filter to strip out the URL parameters in a view so that different UTM parameters would all be counted under that same page. That’s not the case in GA4.
  3. Scope
    • There are a few other oddities when it comes to the “scope” of data and trying to analyze across the different scopes (for example, session-based dimensions and item-based dimensions).

How Does BigQuery Solve The Problem?

Linking your GA4 property to BigQuery gives you a database full of the raw event-level data of your website. Using SQL it’s possible to isolate the exact dimensions and metrics that are valuable to the business, sessionize events, clean up all the issues with page location and landing page, and even build our own data models for any other application.

What we’re going to explore is how to use the “intraday” streaming data (that’s the real-time data stored in BigQuery) to build a Looker Studio Report.

How Much Does This Cost To Use?

There are 3 main costs that you’re going to incur while building this type of report. The first is the cost of your time and energy to build the queries, and plug those into a Looker Studio report. The second is the cost of streaming data from GA4 to BigQuery. The third is the cost of running the queries from Looker Studio. In other words, actually using and interacting with the report. 

Let’s break down the costs in more detail: 

  1. The cost of building the report
    • Depending on your proficiency level in working with BigQuery and GA4, this could take upwards of 10 hours to put all of the pieces together. Multiply that by your hourly rate and that’s the cost of man hours involved to make this happen.
  2. The cost of streaming data
    • This is probably the smallest cost involved, but it needs to be mentioned. Streaming is billed at around $.10 per gigabyte of data. Google estimates that 1 gigabyte of data is around 600,000 events. The daily export limit is 1,000,000 events so most properties shouldn’t exceed about $.20 per day in streaming costs.
  3. The cost of using the report
    • The queries in use are pretty heavy in computation power because it uses lots of analytic functions and unnesting. Running the query a single or multiple times is no problem. However, every time you change any of the interactive components and you have 10+ people viewing the report this cost is going to jump up. You’re going to want to closely monitor the cost and usage of the report to prevent any unexpected bills from Google Cloud.

Keep in mind that you have 1TB free of queries every month and each TB after that is billed at $5 per TB. So it is still relatively inexpensive to run. We’ve also provided a query usage monitor for you to keep track of how your usage is trending.

Building The Session Summary Query

Let’s now take a look at how to build a query that can read the streaming, real-time events coming in from GA4 to BigQuery. This assumes that you already have this connection made between your GA4 property and Google Cloud.

Goals for the query

Here are the main objectives that we’ll be aiming for as we build out this query:

  1. Properly handle sessions and session IDs
  2. Find the first value for our given dimensions in the session
  3. Bring back the hostname and clean up URL parameters from our page location and landing page
  4. Sessionize events and conversions

Step 1. Build An Unnested Events Table

In step 1 we’re going to use a Common Table Expression (CTE) to build a flat table containing all of the fields we want to use in our report.

Later we’re going to use an analytic function to extract the “first value” of these fields as dimensions so first we have to make a flat table containing all the fields we want as step 1. Then in step 2 we can find the first value for each field.

Extract Session ID

One of the most essential parts that make this whole thing work properly. If we want to be able to see session data (Landing Page, Source, Medium, Campaign, Content, Term, Country, and Device) then we need to know what ID we’re going to group on and use to unify all the events pertaining to the same session.

In BigQuery, the session ID is part of the event_params on every event. The ID represents a timestamp for when the visitor’s session started. We can access the session ID for every event by doing a scalar sub query on the nested event_params field.

The query looks like this:

(SELECT value.int_value FROM unnest(event_params) WHERE  key = “ga_session_id”) as session_id

There’s one problem with this… what if 2 people land on the website at the exact same time? To account for this (rare) occasion, we’re going to concatenate the session ID and the user pseudo ID together to form a more perfect ID that represents a session. The revised query looks like this:

CONCAT((select value.int_value from unnest(event_params) where key = "ga_session_id"),"_",user_pseudo_id) as session_id

Format The Event Timestamp For Your Timezone

This query is going to be responsible for grouping data at the session level therefore we need to know the time that the session started in order to group all events in the session properly.

The timestamp is in microseconds so we can use a Timestamp Micros function to help us extract that as a timestamp and format it to our desired format.

To find the minute an event was triggered we can use the following:

FORMAT_TIMESTAMP('%Y-%m-%d', TIMESTAMP_MICROS(event_timestamp), 'America/New_York') AS event_date_formatted

Extract Session Parameters Using Scalar Subquery

Most of the parameters that we want to “sessionize” exist within the event_params field. We’ll use a scalar subquery (basically a mini query within a query) to extract the exact value out into the event level instead of being nested within the event_params.

The scalar subquery to get the “source” dimension will look like this:

ifnull((select value.string_value from unnest(event_params) where key = "source" ), "direct") as source

This should look very similar to how we extracted the session ID. This is because both the traffic information and session ID that we’re looking for exists as parameters inside of event_params.

We also are handling what to do if the value for the parameter is null.

Now repeat that process for all the fields you want to turn into “sessionized” dimensions.

  • Source
  • Medium
  • Campaign
  • Content
  • Term
  • Country
  • Device category
  • Page location

Step 2. Sessionize Event Dimensions

The goal of this step is to create another CTE for sessionized data using the data from the first CTE in step 1.

Let’s talk about sessions a little bit. Why is this step even necessary?

In Universal Analytics, 1 way a session would be defined was the presence of UTM parameters in the URL. When someone visits the site from a Facebook ad and the URL used in the ad contains UTM parameters then UA would start a new session and the session source / medium would be Facebook ads. 

If the landing page had a link to another page, but that link also had UTM parameters, then as soon as you clicked that link, a new session would start. 

This is not the case with GA4. 

Sessions in GA4 are completely different. They only end (by default) after 30 min of activity. So if the site visitor has multiple UTMs or resizes their browser, or switches country while boarding a train… then a single session can have multiple values for these dimensions.

This is bad, because if you’re grouping by session ID and something like source, but they had multiple sources, then all of your metrics (page views, adds to cart, transactions, etc.) are going to be inflated. 

For this reason, we need to do some work to make sure that we are only grabbing the first value of each field per session. This is a process we call sessionization.

One other important thing to note, is you can have sessions with no page_view events. The easiest example of this is to think of a page containing an embedded video. If the video is longer than 30 minutes then the session will timeout. Let’s imagine the video was 60 minutes. At 45 minutes (after the session timed out) the person scrolls to the bottom of the page to see if there’s anything else there. That scroll event or engagement event will then start a new session. The video wasn’t that good and the person closes the tab. Now you have a session with no page_view and no landing page.

Use FIRST_VALUE() analytic function to sessionize fields

Because we’ve extracted the fields into a flat table in step 1, we can now use an analytic function to grab the first instance of that field grouped by the more perfect session ID that we created. 

Essentially we’re going to look at all the events that belong to a session ID, order them from oldest to newest, and then only return the value that is oldest in the session.

An example of how to achieve this for session source:

FIRST_VALUE(source IGNORE NULLS) OVER (PARTITION BY session_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sessionSource

We’ll use the same logic for each of the dimensions we want to sessionize. 

After all the analytic functions, we’re going to make sure that we also have event_name, transaction_id, and session_id in our table.

Step 3. Aggregating At The Session Level

The final step in our query is to group everything by the session dimensions. The goal is to have 1 row for each combination of sessionized dimensions. Then we can use the COUNT_DISTINCT() function to count the number of sessions that an event has occurred in. This is the most useful way of determining if an event happened during a session or not.

This is also where we’re going to clean up the landing page, do some logic to override source / medium if the landing page contains gclid.

Clean the landing page URL

We don’t want every unique combination of URL parameters to count as a separate landing page. We’re going to use REGEXP to clean up the parameters and group pages by their actual page path.

REGEXP_REPLACE(CONCAT(RTRIM(REGEXP_EXTRACT(landingPage, r'^(?:https?://)?([^?]+)'), '/'), '/'), r'^https?://', '') AS landingPageClean

Overwrite source medium if from Google Ad

There’s a known bug with the GA4 export data around attributing sources to Google Ads when there’s a GCLID present in the URL. Basically, if the URL on the landing page doesn’t have UTMs but does have a GCLID we want to overwrite the source / medium to be “Google” and “CPC”.

if(regexp_contains(landingPage,"gclid"),"google",sessionSource) as sessionSource,

Use COUNT_DISTINCT() to aggregate metrics

Since our source table contains all of the event names that have happened in the session, we’re going to get a count of the number of sessions (based on session_id) where that event occurred. 

if(regexp_contains(landingPage,"gclid"),"cpc",sessionMedium) as sessionMedium

There are some main events you may want to use, but you can actually add in any event using this method to have a nice aggregation for that event name.

COUNT(DISTINCT IF(event_name = "view_item",session_id,null)) as view_item

We’ll repeat this query for all the standard events we want to include in the final report.

  • Page View
  • View Item List
  • View Item
  • Add to Cart
  • Begin Checkout
  • Purchase

Group By Session Dimensions

In the final output table we don’t want to have any of the event level data, everything should now be aggregated at the session level. Because we used an aggregation function, we also have to use GROUP BY for the session dimensions.

GROUP BY
sessionDate,
landingPage,
sessionSource,
sessionMedium,
sessionCampaign,
sessionContent,
sessionTerm,
sessionDevice,
sessionCountry

Building A Looker Studio Report 

Build The Perfect Dashboard Every Time

Get your teams, designs & reports standardized across your organization & clients. 

Related Article

Keep the learning going with our recommended related articles on similar topics.

Days
Hours
Minutes
Seconds

Black Friday Offer🔥 Get your data audit. Normally $2,500—this week only $97. We’ll audit your GA4 stack and jump on a live session to prioritize fixes.