This project demonstrates a real-time data processing pipeline using a Go application. The application processes data stored in S3 and stores the results in Redshift. The Go application is deployed using Elastic Beanstalk.
- Create a real-time data processing pipeline with a Go application.
- Deploy the application using Elastic Beanstalk.
- Store processed data in Amazon Redshift.
- Verify data processing and storage using SQL queries in Redshift.
Ensure you have the following installed on your local machine:
-
Download Dataset
- Download the Online Retail dataset from UCI Machine Learning Repository.
-
Create an S3 Bucket
- Follow the instructions to create an S3 bucket.
-
Upload the Dataset to S3
- Upload the
Online Retail
CSV file to the S3 bucket you created. Instructions can be found here.
- Upload the
-
Create a Redshift Cluster
- Follow the steps to create a Redshift cluster.
-
Create a New Database within the Redshift Cluster
- Once your Redshift cluster is created, use the AWS Management Console or AWS CLI to create a new database within the cluster.
# To read from S3:
REGION=
BUCKET=
KEY= #name of the .csv file in S3
# To push data to Redshift:
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
REDSHIFT_CONN_STRING=
docker-compose build
docker-compose up
# To print the processed data:
curl "http://localhost:8080?action=print"
(note: the data is printed in the Docker container's console, not where curl is called)
# To insert processed data into Redshift:
curl "http://localhost:8080?action=insert"
go mod tidy
# To print the processed data:
go run main.go -action=print
# To insert processed data into Redshift:
go run main.go -action=insert