This is a sample code for Apache Beam to perform ETL from a stream-processing service (Pub/Sub) to BigQuery using Dataflow as the runner.
An implementation of Apache Beam to stream the active flights either going to or leaving London Heathrow airport using Pub/Sub, process the data by joining with a dimensional table, and load them to a BigQuery table.
The project consists of the following Java classes for the ETL pipeline:
- com.streaming.ETL - Main method for injesting the Pub/Sub messages and the dimensional table, and loading the processed data to BigQuery
- com.streaming.messageParsing - Class for parsing the injested Pub/Sub messages into BigQuery row objects (TableRow)
- com.streaming.aggregate - Class for converting TableRows to Key/Value pairs that can be grouped by a key and aggregated
- com.streaming.join - Class for joining the aggregated Pub/Sub messages with TableRows from the dimensional table
The project also consists of the following scripts:
- pom.xml - Configuration file for Apache Maven
- message_subscriber - Directory with the com.subscriber.Subscriber Java Class and pom.xml file to set up a Google Cloud Function to receive messages from a streaming service and publish them to Pub/Sub
From the same directory as the Apache Beam project execute:
mvn compile exec:java \
-Dexec.mainClass=com.streaming.ETL \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=${PROJECT} \
--region=${REGION} \
--inputTopic=${TOPIC} \
--runner=DataflowRunner \
--windowSize=${WINDOW}"
This project relies on the following resources:
- Streaming data source Heathrow Flights from Ably Hub
- OpenFlights dataset for the dimensional tables with airline and aircraft data