Creating a streaming data pipeline for a real-time dashboard with Dataflow
New York City taxi cabs are looking to monitor how well the business is doing in real-time.
This streaming data pipeline captures taxi revenue, passenger count, and ride status, then visualizes the results in a dashboard
Tasks covered in Stages
Create a Dataflow job from a template
Subscribe to a Pub/Sub topic
Stream a Dataflow pipeline into BigQuery
Monitor a Dataflow pipeline in BigQuery
Analyze results with SQL
Visualize key metrics in Looker Studio
This project was done in google cloud shell which is a virtual machine that comes with google development tools and it runs on google cloud.
When you log on to google cloud and activate the cloud shell, the project ID is loaded for you and you use the gcloud command which is the command line tool for google cloud.
gcloud auth list : this is used to show the active account for your project .
It shows you an output like this
Credentialed accounts:
- google1623327_student@qwiklabs.net
gcloud config list project : Use this to show the project ID for your project
It show you an output like this
[core]
project = qwiklabs-gcp-44776a13dea667a6
The process will be split into separate Stages:
Stage 1:
- Create a Bigquery Data Set from the already existing NYC Taxi & Limousine Commission’s open dataset
There are 2 ways to do this:
i. Using the command line cloud shell: Google Cloud Shell
ii. Using the Cloud console: Google Cloud Console which is a GUI
Source a pre-created Pub/Sub Topic: As Pub/Sub Topics are already existing on the Open dataset. Pub/Sub is a global messaging service that allows seamless communication between publishers(senders) and subscribers(receivers). These senders and receivers are independent applications which are able to communicate with one another using a shared string (topic). Publisher application creates and sends a message to a topic, subscriber application creates a subscription to a topic to receive messages from it
Task 1: Create a Bigquery Dataset called taxirides
a. On the command line on Cloud shell type the following commands
1. This command creates the **taxirides** dataset
**bq --location=us-west1 mk taxirides**
2. This command creates an empty table schema that we will later stream into.
The table name is taxirides.realtime
**bq --location=us-west1 mk \\
--time_partitioning_field timestamp \\
--schema ride_id:string,point_idx:integer,latitude:float,longitude:float,\\
timestamp:timestamp,meter_reading:float,meter_increment:float,ride_status:string,\\
passenger_count:integer -t taxirides.realtime**
b. On the Cloud Console UI
On the google cloud console on the Navigation menu: Click BigQuery: Click Done on the welcome dialogue
Locate the Project ID: Click on the elipses View Action (3 dots): Click on Create Dataset
In Dataset ID, type taxirides.
In Data location, click us-west1 (Oregon) and then click Create Dataset
In the Explorer pane, click expand node (to reveal the new taxirides dataset)
Click on View actions () next to the taxirides dataset, and then click Open.
Click Create Table: Then in Table Type: Realtime
For Schema click: Edit as Text and paste the code below:
ride_id:string, point_idx:integer, latitude:float, longitude:float, timestamp:timestamp, meter_reading:float, meter_increment:float, ride_status:string, passenger_count:integer
In Partition and cluster settings: select timestamp then ****Click Create Table
So far we have created a dataset and a Table schema to store our streaming data(which we will be sending and receiving through the Pub/Sub topic )
This is the end of Stage 1.
In subsequent posts, we will start the next stage 2 in the data pipeline process.