Skip to content
This repository has been archived by the owner on Aug 3, 2022. It is now read-only.

Latest commit

 

History

History

CDAPTableToDBTable

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Batch CDAP HBase Table to Database Table Application Configuration

The built-in cdap-data-pipeline system artifact can be used to create a data pipeline application that reads from a batch source and persists it to a sink. In this example, we will read an entire CDAP HBase table in batch mode and use a Database Sink to write the table's rows to a database.

The file config.json contains a sample application configuration that you can use to accomplish the above task. Our sample application uses these components:

  • The cdap-data-pipeline system artifact, since we want to perform the pipeline in batch
  • Table source, to read data from the CDAP HBase table
  • Database sink, to write the data from the CDAP HBase table to a database table
  • A JAR file containing the JDBC driver for your database. With this, you will need a JSON file that describes the JDBC driver as an external plugin. See mysql-connector-java-5.1.35.json and postgresql-9.4.json as examples.

You can create and start the application by using the CDAP CLI (or you can use the Cask Hydrator UI for a more visual approach).

Notes:

  • You need to fill in the configurations described below in a file such as the config.json before creating the application.
  • If you want to import the config.json into the Cask Hydrator UI, you will need to modify it to include an artifact property describing the system artifact being used. You can create an initial application as described here using the CLI and then clone it in the UI to develop it further.

Configurations for the CDAP HBase Table Source

  1. name: The name of the HBase table to use as a source.
  2. schema: The JSON representation of the HBase Table's schema.
  3. schema.row.field: Optional field name indicating that the field value should come from the row key instead of a row column. The field name specified must be present in the schema, and must not be nullable.

Configurations for the Database Table Sink

  1. referenceName: This will be used to uniquely identify this sink for lineage, annotating metadata, etc.
  2. connectionString: This is the JDBC connection string that includes the database name.
  3. tableName: This is the table to which you wish to write.
  4. user: This is the username used to connect to the specified database. It is required for databases that need authentication, optional for those that do not.
  5. password: This is the password used to connect to the specified database. It is required for databases that need authentication, optional for those that do not.
  6. columns: This is a comma-separated list of columns in the database table to which data from the HBase table will be written.
  7. jdbcPluginName: The name of the external JDBC plugin. This is the value of the name field in the external plugin's JSON configuration file. Defaults to 'jdbc'. Examples of external plugins are in the files mysql-connector-java-5.1.35.json and postgresql-9.4.json. Refer to the CDAP documentation on external plugins for more details.
  8. jdbcPluginType: The name of the external JDBC plugin. This is the value of the type field in the external plugin's JSON configuration file. Defaults to 'jdbc'. Examples of external plugins are in the files mysql-connector-java-5.1.35.json and postgresql-9.4.json. Refer to the CDAP documentation on external plugins for more details.

Creating a Hydrator Application using the CDAP CLI

Add the JDBC driver as a plugin artifact available to cdap-data-pipeline:

cdap> load artifact </path/to/driver.jar> config-file </path/to/config.json>

For example, to use PostgreSQL:

cdap> load artifact /path/to/postgresql-9.4-1203.jdbc41.jar config-file CDAPTableToDBTable/postgresql-9.4.json
Successfully added artifact with name 'postgresql'

Modify config.json to match your database settings, then create an application named dbExport (replace <version> with your CDAP version):

cdap> create app dbExport cdap-data-pipeline <version> system CDAPTableToDBTable/config.json
Successfully created application

cdap> start workflow dbExport.DataPipelineWorkflow
Successfully started workflow 'DataPipelineWorkflow' of application 'dbExport' with stored runtime arguments '{}'

This will run the workflow once. To schedule the workflow to run periodically:

cdap> resume schedule dbExport.dataPipelineSchedule
Successfully resumed schedule 'dataPipelineSchedule' in app 'dbExport'

To verify that the data has been written to the Database Table, execute this SQL command on your Database:

select * from dest_db_table

You have now successfully created an application that reads from a CDAP HBase Table and writes to a database table.

Suspending and Deleting

You can suspend and delete the application using the CDAP CLI:

cdap> suspend schedule dbExport.dataPipelineSchedule
Successfully suspended schedule 'dataPipelineSchedule' in app 'dbExport'

cdap> delete app dbExport
Successfully deleted application 'dbExport'

Share and Discuss!

Have a question? Discuss at the CDAP User Mailing List.

License

Copyright © 2015-2016 Cask Data, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.