Banno Usage Data
Introduction
The goal of this pipeline is to make the Banno Usage Data dataset (previously called Mixpanel dataset) available in JH-managed Google Big Query where it can be queried using the SQL interface for business purposes and to drive downstream business processes.
The usage pipeline is being updated to use the new schema. For the main table called “Events”, Unlike the old schema which is flat, the new schema consists of nested JSON column called “Properties”. The “Properties” column has various optional fields which may or may not be populated depending on the client telemetry coming in from Mixpanel. Detailed description of various schema properties collected is provided below.
Scenarios
The Banno Usage Data dataset consists of two tables, namely, events and people. Both tables hold multi-tenant data separated by a single key which is a GUID that uniquely identifies a specific Financial Institution. In addition, the name of the Financial Institution is also included in each row.
Pipelines
The Banno Usage pipeline consists of two different pipelines.
- The first pipeline is owned and managed by Mixpanel and writes raw datasets to Google Cloud Storage storage buckets in Data Hub Google project. This pipeline creates and manages the initial and subsequent incremental loading of data into the two tables in batch mode, typically once an hour. JH provides the empty dataset and permissions to it for the DAA service account. DAA exposes a REST API that is used to create and manage the pipeline. The REST API is authenticated with a project key provided to JH leadership team. Details of the pipeline design and architecture, BigQuery objects, and service account details are provided here.
- The second pipeline, built, owned and managed by Jack Henry, parses the raw data files from Google Cloud Storage (written by the previous pipeline) in realtime, flattens certain nested fields, and loads the data into Big Query tables in the Data Hub Google project.
Development
The REST API to create the Mixpanel BigQuery realtime export pipeline is documented in detail here.
The second pipeline is custom-built using .NET Cloud Functions, and it is available in the repo here.
Schema
Raw Name | Display Name | Description |
---|---|---|
$app_build_number | App Build Number | General build of the app. |
$app_version_string | App Version | Current app version. |
$bluetooth_enabled | Bluetooth | Set to true if Bluetooth is enabled, false if not. |
$bluetooth_version | Bluetooth Version | Set to “none”, “ble”, or “classic”. |
$brand | Brand | Device brand. |
$browser | Browser | Name of the browser. |
$browser_version | Browser Version | Version of the browser. |
$carrier | Carrier | Wireless carrier of the device owner. |
$city | City | The city of the event sender parsed from the IP property or the Latitude and Longitude properties. |
$current_url | Current URL | The URL of the page on which the event was tracked. |
$device | Device | Name of the event sender’s device, if they’re on mobile web. |
$device_id | Device ID | Autogenerated ID that is local to the device. Calling reset() regenerates this value. |
$duration | Duration | A property that is appended to an event to capture the elapsed time (in seconds) between a time_event and a track call. |
$google_play_services | Google Play Services | Verifies that Google Play services is installed and enabled on this device, and that the version installed on this device is no older than the one required by this client. |
$has_nfc | Has NFC | The device supports Near Field Communication (NFC). |
$has_telephone | Has Telephone | Set to true if this device has telephone functionality, false if not. |
$import | Import | Internal Mixpanel property set to true to indicate that events were sent through /import API. |
$initial_referrer | Initial Referrer | Referring URL when the user first arrived on your site. Defaults to “$direct” if the user is not referred. |
$initial_referring_domain | Initial Referring Domain | Referring domain at first arrival. Defaults to “$direct” if the user is not referred. |
$insert_id | Insert ID | A unique identifier for the event, used to deduplicate events that are accidentally sent multiple times. |
$lib_version | Library Version | Mixpanel library version. |
$manufacturer | Manufacturer | Device manufacturer. |
$model | Model | The model of the device. |
$mp_api_endpoint | API Endpoint | Mixpanel property to record the API endpoint the data was sent to:api.mixpanel.com - default ingestionapi-eu.mixpanel.com - EU data ingestionapi-in.mixpanel.com - India data ingestionapi-js.mixpanel.com - Javascript SDK |
$mp_api_timestamp_ms | API Timestamp | UTC timestamp in milliseconds when the event was received by our API. |
$os | Operating System | OS of the event sender. |
$os_version | OS Version | The current version of operating system on the device. |
$radio | Radio | Current cellular network communication standard (3G, 4G, LTE, etc). |
$referrer | Referrer | Referring URL including your own domain. Might not be present if document.referrer does not return a value. |
$referring_domain | Referring Domain | Referring domain including your own domain. Might not be present if document.referrer does not return a value. |
$region | Region | The region (state or province) of the event sender parsed from the IP property or the Latitude and Longitude properties. |
$screen_dpi | Screen DPI | Pixel density of the device screen. |
$screen_height | Screen Height | The height of the device screen in pixels / points (iOS). |
$screen_width | Screen Width | The width of the device screen in pixels / points (iOS). |
$search_engine | Search Engine | The search engine that the customer used when they arrived at your domain. |
$user_id | User ID | The identified ID of the user. Calling identify() sets this. |
$watch_model | Watch Model | The model of the iOS watch. |
$wifi | Wifi | Set to true if the user’s device has an active, available Wifi connection, false if not. |
mp_country_code | Country | The country of the event sender parsed from the IP property or the Latitude and Longitude properties. The value is stored as a 2-letter country code in the raw data and parsed into the country name in the UI. |
mp_keyword | Search Keyword | Search keywords detected on the referrer from a search engine to your domain. |
mp_lib | Mixpanel Library | The Mixpanel library that sent the event. |
mp_processing_time_ms | Time Processed (UTC) | UTC timestamp in milliseconds when the event was processed by Mixpanel servers. |
mp_sent_by_lib_version | Sent By Library Version | Mixpanel library version used to send data (not necessarily the same as the version which enqueued the data). |
utm_source, utm_medium, etc. | UTM Parameters | UTM tags derived from the URL a customer clicked to arrive at your domain. Each UTM will be collected under its own property. |
Querying the data
In the new schema, the data in the Events table can be queried and the corresponding JSON fields can be flattened as follows.
SELECT
DeviceId,
DistinctId,
EventName,
InsertId,
Time,
UserId,
InstitutionId,
InstitutionName,
Properties,
-- Nested JSON fields with $ prefix will need to be extracted using regex matching
REGEXP_EXTRACT(TO_JSON_STRING(Properties), r'"\$city":"([^"]+)"') AS city,
REGEXP_EXTRACT(TO_JSON_STRING(Properties), r'"\$region":"([^"]+)"') AS region,
-- Nested JSON fields with no $ prefix will need to be extracted using JSON extraction
JSON_EXTRACT_SCALAR(Properties, '$.languageCode') AS language_code,
JSON_EXTRACT_SCALAR(Properties, '$.mp_country_code') AS mp_country_code
FROM
sdb-dig-databroker-demoa8b4.garden_bank_banno_usage.events
LIMIT 10;