Skip to content

Using the Parquet file format (with Avro) to process data with Apache Flink

Notifications You must be signed in to change notification settings

nezihyigitbasi/FlinkParquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crunching Apache Parquet Files with Apache Flink

This repo includes sample code to setup Flink dataflows to process Parquet files. The CSV datasets under resources/ are the Restaurant Score datasets downloaded from SF OpenData. For more information please see this post.

###Generating the Avro Model Classes

If you make any changes to the Avro schema files (*.avsc) under resources/ you should re-generate the model classes

./compile_schemas.sh

###Step 1: Converting the CSV Data Files to the Parquet Format

Below command converts and writes the CSV files under resources/ to /tmp/business, /tmp/violations, and /tmp/inspections directories in Parquet format.

mvn clean package exec:java -Dexec.mainClass="yigitbasi.nezih.ConvertToParquet"

###Step 2: Running the Flink Dataflow

mvn clean compile assembly:single
java -jar target/FlinkParquet-0.1-SNAPSHOT-jar-with-dependencies.jar

About

Using the Parquet file format (with Avro) to process data with Apache Flink

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published