Introduction
BigQuery is a fully managed and serverless data warehouse solution available in the Google Cloud Platform that gives anyone the capability to perform analytics on terabytes of data in real time.
MoEngage <> BigQuery
MoEngage and BigQuery integration makes use of MoEngage's GCS Data Exports to transfer data to Google BigQuery for further processing and analytics.
Integration
library_add_check |
Prerequisites
|
You can set up a Cloud Storage Transfer Service to automatically schedule data ingestion from your Google Cloud Storage buckets into Google BigQuery.
Step 1: Setup GCS Exports
Ensure you have already set up the Data Exports to your GCS bucket by following the steps mentioned here. Once the data starts to flow into GCS, you should move to the next step. This is important as we need to predefine the schema of our imports.
Step 2: Pre-requisites to initiating data transfer
Before you create a Cloud Storage Transfer:
- You need to verify that you can access the files in your assigned buckets. Download any one
.json.gz
file on your local computer and uncompress the file to validate its contents. - You need to verify that you have completed all actions required to enable the BigQuery Data Transfer Service.
- Create a BigQuery dataset to store your data.
- Retrieve your Cloud Storage URI.
Note- If your dataset's location is set to a value other than theUS
multi-region, then the Cloud Storage bucket must be in the same region or contained in the same multi-region as the dataset.
Required permissions
When you load data into BigQuery, you need permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you'll also need access to the bucket that contains your data. Ensure that you have the following required permissions:
-
BigQuery: Ensure that the person creating the transfer has the following permissions in BigQuery:
-
bigquery.transfers.update
permissions to create the transfer - Both
bigquery.datasets.get
andbigquery.datasets.update
permissions on the target dataset
The
bigquery.admin
predefined IAM role includesbigquery.transfers.update
,bigquery.datasets.update
andbigquery.datasets.get
permissions. For more information on IAM roles in BigQuery Data Transfer Service, see Access control. -
-
Cloud Storage:
storage.objects.get
permissions are required on the individual bucket or higher. If you are using a URI wildcard, you must also havestorage.objects.list
permissions. If you would like to delete the source files after each successful transfer, you also needstorage.objects.delete
permissions. Thestorage.objectAdmin
predefined IAM role includes all of these permissions.
Refer to Cloud Storage transfers for up-to-date information.
Step 3: Create a new table to store the incoming data
Once you have the necessary details in place, you will need to create a new table with the schema. You can follow the instructions on Create a destination table. Make sure to follow these specific changes once you're on the Create a Table screen:
- Choose "Google Cloud Storage" under "Create table from".
- Select the file from your GCS bucket. Files will be dumped in the following locations:
gs://<your-bucket-name>/event-exports/<your_app_name>/<connection_id>/export_day=YYYY-MM-DD/export_hour=HH/<file-name>.json.gz
- The file format should be "JSONL (Newline delimited JSON)".
- Select the destination Project, Dataset, and Table.
- Under the Schema section, select the "Auto detect" option.
You can change other configurations (apart from these) as per your requirements. Once you have been satisfied with your configuration, click on "Create Table". A new job will be run in the background to load the data to this table. If all goes well, your new table will be created.
Step 4: Verify if your data has been loaded
Before setting up the Cloud Storage Transfer Service, you need to verify if our data has been correctly loaded and that the schema is in place.
- Navigate to the Explorer dashboard
- Click on your table in the sidebar to view its schema.
It should generally look like this: - Use the preview tab to verify.
Once you verify your data here, you are good to go to the next step.
Step 5: Setup the Cloud Storage Data Transfer Service
Follow the steps mentioned on the Cloud Storage Transfers page for a detailed and up-to-date guide.
- Go to the BigQuery page in the Google Cloud console.
-
Click Data Transfers.
-
Click Create Transfer.
-
On the Create Transfer page:
-
In the Source type section, for Source, choose Google Cloud Storage.
- In the Transfer config name section, for Display name, enter a name for the transfer such as
My Transfer
. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.
- In the Schedule options section, for Schedule, leave the default value (Start now) or click Start at a set time.
- In the Destination settings section, for the Destination dataset, choose the dataset you created to store your data.
-
In the Data source details section:
- For the Destination table, enter the name of the table you created to store the data in BigQuery. Destination table names support parameters.
- For Cloud Storage URI, enter the Cloud Storage URI. Wildcards and parameters are supported.
- MoEngage's GCS Bucket Export directory format is similar to
[your-bucket-name]/event-exports/[your-app-name]/[your-connection-id]/export_day={run_time|"%Y-%m-%d"}/export_hour={run_time|"%H"}/*.json.gz
- MoEngage's GCS Bucket Export directory format is similar to
-
For Write preference, choose: APPEND
-
For File format choose your data format: newline delimited JSON.
- In the JSON,CSV section, check Ignore unknown values to accept rows that contain values that do not match the schema. Unknown values are ignored.
-
Click Save.
-
Note- These steps are indicative, you can choose to change configurations as per your requirements. Also, be aware of certain limitations of these data loads.
Once you set it up, you should see your first Transfer run as per your Schedule. You can always go back to the logs to see if something goes wrong or visit BigQuery's Troubleshooting section.
Common Schema
If you have a frequently changing list of events that you will be exporting, it is recommended to create a common schema. To create a new table with a common schema:
- Choose "Empty table" under "Create table from".
- Select or enter the destination Project, Dataset, and Table.
- Under the Schema section, enable the "Edit as text" option.
- Paste the following schema:
[
{
"mode": "NULLABLE",
"name": "db_name",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "export_hour",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "export_day",
"type": "DATE"
},
{
"mode": "NULLABLE",
"name": "event",
"type": "JSON"
},
{
"mode": "NULLABLE",
"name": "app_name",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "export_doc_id",
"type": "STRING"
}
]
You can change other configurations (apart from these) as per your requirements. Once you have been satisfied with your configuration, click on "Create Table". After this, you can continue from Step 5 to set up your Cloud Storage Data Transfer Service.