Digital History Events (Banno)

Pipelined Data Sets > Digital History Events (Banno)

Introduction

The goal of this pipeline is to make the History Events dataset available in Google Big Query where it can be queried using the SQL interface for business purposes and to drive downstream business processes.

Scenarios

The History Events dataset is a made up of data files from each source topic from the History Kafka source. A single data file in JSON format based on specific topics (currently person, user and generic events) are written directly to Google Cloud Storage. The files are then imported into BigQuery and exposed as a native table in BigQuery to provide a SQL interface over the data in order to drive downstream business processes.

This pipeline enables three scenarios:

Initial load: The data files are loaded for the first time
Incremental load: Subsequent data file loads after the initial load
Historical load / backfill: Historical load to clean up existing data and reload from the beginning of time

Pipelines

Events are activity logs generated when a client interacts with our products. These events are sent to the History service and written to Kafka as JSON strings. A dedicated Kafka consumer group pulls up to 50,000 events every two minutes. Once pulled, the events are deserialized into Scala objects, processed, and transformed into a format suitable for BigQuery. Finally, the data is converted to JSONL and written to a Google Cloud Storage bucket using the Google API.

A Data Transfer job then picks up the data file from the Google Cloud Storage bucket and imports it into Google Big Query to expose it as native table. After the data file is imported, the copies of the data file in the Google Cloud Storage bucket are retained for later archival and audit purposes. After the initial data file import into Big Query, the subsequent imports only append to the existing table; there is no additional work to de-duplicate data or purge old(er) entries. Per the current requirements, the two jobs are currently time-driven and not auto-triggered based on events such as the availability of the source file and the completion of the first job.

Schedule

Initially, the data import process is expected to run hourly, however the pipeline is extensible to change it to any schedule as needed in the future.

Querying the Data

Partition Information for Tables in Dataset `jh_digital_conversations_us`

Table Name	Partition Column	Partition Type
generic_history_events	timestamp	MONTH
person_events	timestamp	MONTH
user_events	timestamp	MONTH

Topics in this section

Have a Question?

Have a how-to question? Seeing a weird error? Get help on StackOverflow.
Register for the Developer Office Hours where we answer technical Q&A from the audience.

Please ignore this field

Did this page help you?

Last updated Wed Mar 4 2026