An implementation of Apache Hadoop to count the unique objects in every curatorial department of The Met Collection using tools available on the Google Cloud Platform.
This project executes a MapReduce job with the Hadoop BigQuery Connector to count the number of unique exhibit types in every deparment of The MET Collection. The data are sourced from the Objects table in The Met Public Domain Art Works dataset that is hosted on Google BigQuery.
This project was built around Apache Maven to manage the Java app dependencies. The pom.xml
file is a configuration file that contains the information required to build the package dependencies (in this case the Hadoop and BigQuery clients), as well as relocate some of the packages. Keep in mind, the version of the Hadoop client declared in the pom.xml
file should match the one run by your cluster. If in doubt, run $ Hadoop version
from the command line of your instance (or Dataproc Cluster if you are using GCP) to identify the Hadoop version. Finally, you can check the latest version of your Java dependencies at the Maven Central Repository.
.
βββ pom.xml # Configuration file for Apache Maven
βββ src
β βββ main
β βββ java
β βββ met_objects # Main package name
β βββ CountArtObjects.java # Java project source code
β βββ TextArrayWritable.java # Subclass extending ArrayWritable into a Text-type class
βββ target
βββ met-object-count-0.0.1.jar # JAR file
-
Compile the Java class files in your local machine using Maven (or another Java project management tool)
-
Copy the JAR file to the Cloud Storage bucket of your project:
$ gsutil cp /home/usr/my_Maven_project/target/met-object-count-0.0.1.jar gs://${PROJECT}/hadoop_job_files
-
Create a Dataproc cluster:
$ gcloud dataproc clusters create ${CLUSTER_NAME} \ --worker-machine-type n1-standard-4 \ --num-workers 0 \ --image-version 2.0.5-debian10 \ --region ${REGION} \ --max-idle=30m
-
Submit a Hadoop job to the Dataproc cluster:
$ gcloud dataproc jobs submit hadoop \ --cluster ${CLUSTER_NAME} \ --jar gs://${PROJECT}/hadoop_job_files/met-object-count-0.0.1.jar \ --region ${REGION} \ -- ${PROJECT} bigquery-public-data:the_met.objects ${OUTPUT_TABLE} gs://${PROJECT}/hadoop_job_files/output # Hadoop job arguments
-
Explore the results in the specified BigQuery output table