DataprocDriver provides a secure, easily accessible way to connect to Apache Hive on Dataproc cluster from everywhere via Java Database Connection through Component Gateway.
Supported Java Version: 1.8
In order to use this library, you first need to go through the following steps:
- Set up a Cloud project
- Setup Authentication
Note that we recommend using service account login rather than end user login. - Make sure Component Gateway is enabled for the cluster(s) you are trying to connect to within the project
- Make sure Hive is running in HTTP mode for the cluster(s)
For Step 4 above, we have provided an example initialization action file at gs://hive-http-mode-init-action/hive-http-config.sh
that would configure Hive to be running in HTTP mode during cluster creation.
-
Clone this repo
git clone https://github.com/GoogleCloudDataproc/dataproc-jdbc-connector.git cd dataproc-jdbc-connector
-
Build Dataproc Driver JAR
./gradlew -p jdbc-driver shadowJar
Note that this step might take around 45 minutes.
Compiled Dataproc Driver JAR will be atdataproc-jdbc-connector/jdbc-driver/build/libs/jdbc-driver-1.0-SNAPSHOT-all.jar
. -
Build example-client JAR
./gradlew -p example-client shadowJar
-
Dataproc Driver for Hive accepts JDBC URL string with prefix
jdbc:dataproc://hive/
-
Client can pass in database name or leave it as empty to use default database
jdbc:dataproc://hive/;
orjdbc:dataproc://hive/dbName;
-
projectId
andregion
are required parameters. To find your project and region, refer back to Set up a Cloud project in the Before you begin section above
jdbc:dataproc://hive/;projectId={pid};region={region}
-
Cluster specific parameters:
clusterName
orclusterPoolLabel
Client can set parameters to pick a particular cluster by setting either one ofclusterName
orclusterPoolLabel
Note thatclusterName
has a higher priority thanclusterPoolLabel
, if you pass in both parameters, the DataprocDriver will only look for the cluster by name.clusterName
is an optional parameter that allows client to specify the name of the cluster to connect to
jdbc:dataproc://hive/;projectId={pid};region={region};clusterName={cluster-name}
clusterPoolLabel
is an optional parameter that supports submitting to a cluster pool
Client can pass in cluster pool labels and the Dataproc Driver will pick a healthy cluster (cluster withstatus.state = ACTIVE
) within the pool to connect to. Please do not putstatus.state
as one of the cluster pool label.
Labels can be specified in the format
clusterPoolLabel=label1=value1[:label2=value2]
Example:
jdbc:dataproc://hive/;projectId={pid};region={region};clusteroPoolLabel=com=google:team=dataproc`
-
DataprocDriver also accepts other semicolon separated list of session variables, Hive configuration variables or Hive variables that Hive supports
jdbc:dataproc://hive/{dbName};projectId={pid};region={region};clusterName={name};sess_var_list?hive_conf_list#hive_var_list
DataprocDriver uses Google OAuth 2.0 APIs for authentication and authorization.
To connect to Dataproc cluster through Component Gateway, the Dataproc JDBC Driver will include an authentication token. For security reasons, it puts the token in the Proxy-Authorization:Bearer
header.
To get the access token set in Before you begin step 2 section above, DataprocDriver will use the Application Default Credentials.
# update the HADOOP_CLASSPATH to include the Dataproc JDBC Driver JARs
export HADOOP_CLASSPATH=`hadoop classpath`:{path-to-driver-jar}jdbc-driver-1.0-SNAPSHOT-all.jar
# tell beeline the class name for the driver using “-d” option
beeline -d "com.google.cloud.dataproc.jdbc.DataprocDriver" -u "jdbc:dataproc://hive/;projectId={pid};region={region};{other-parameters}"
Example:
export HADOOP_CLASSPATH=`hadoop classpath`:/usr/local/home/Downloads/dataproc-jdbc-connector/jdbc-driver/build/libs/jdbc-driver-1.0-SNAPSHOT-all.jar
beeline -d "com.google.cloud.dataproc.jdbc.DataprocDriver" -u "jdbc:dataproc://hive/;projectId=demo-dataproc;region=us-central1;clusterName=demo-cluster"
We have provided an example client -- an example usage of the JDBC driver from Java that connects to Hive using our Dataproc Driver.
Connection connection = DriverManager.getConnection("jdbc:dataproc://hive/default;projectId=pid;region=us-central1;clusterName=my-cluster");
try (Statement stmt = connection.createStatement()) {
ResultSet res = stmt.executeQuery("SHOW TABLES");
while (res.hasNext()) {
System.out.println(res.getString(1));
}
}
To run the example client:
# build example client JAR
./gradlew -p example-client shadowJar
# run the JAR, it will prompt you to enter the JDBC URL string
java -jar example-client/build/libs/example-client-1.0-SNAPSHOT-all.jar
-
To run all unit tests, use
./gradlew test
-
To run specific unit test, use
./gradlew test --tests="com.google.cloud.dataproc.jdbc.{test name}"
-
To run system test, use
./gradlew systemTest --tests="com.google.cloud.dataproc.jdbc.DataprocSystemTest" -DprojectId="{projectId}" -Dregion="{region}" -Duser="{user}"
Note that projectId and region are the parameters for the specific project you are working on; user can be an arbitrary username, used as a prefix for the cluster name to create, delete and avoid cluster conflict during testing.