Social User Cohorts Data Pipeline

Problem

Large news aggregation and discussion website needed to find user retention details. Cohort analysis was used to find user retention details. For running the Cohort analysis user data had to be collected, processed, stored and visualized.

User data were sent as streaming data at a very high rate. These user data had to be processed and stored to find out user retention ratio using cohort analysis

Challenges

  • Processing large volumes (In terabytes) of streaming data
  • Performance of the rate at which each record was processed
  • Partitioning and Clustering BigQuery tables to optimize performance and to minimize querying cost

Tools

  • Google PubSub

  • Google BigQuery

  • Google Dataflow

  • Google Data Studio

  • Google Cloud ML

  • Google Datalab

Solution

  • User feeds were sent as high rates streaming data to Google PubSub from the aggregator
  • The data from Google PubSub was consumed by Google Dataflow using Apache Beam PubSub connector
  • Data cleansing and enrichment were done using Google Cloud Dataflow
  • Processed data was stored in Google BigQuery as it handles a huge volume of data
  • Data were further aggregated according to the cohort analysis and were stored in Google BigQuery as well. Google BigQuery scanned 20 TB of data within a few minutes
  • Google Data Studio templates were used for visualization of the data from BigQuery via BigQuery Connector
  • Google CloudML was used for prediction and anomaly detection on the historical data