-
Notifications
You must be signed in to change notification settings - Fork 13
Mongo to HDFS Hive migration
We keep it for reference.
This document sums up necessary steps to migrate data from MongoDB storage into HDFS and then use Hive to query it. First, we need to inject data into MongoDB. It can be done either using pymongo directly or use testClient tool from WMArchive. The later assumes you setup WMArchive as service and read local file and replicated it N-times to MongoDB. Here is command you need to do in your shell:
testClient.py --json=./test/data/fwjr_processing.json --ntimes=100
Here we insert 100 copies of given JSON file into MongoDB via WMArchive data-service. Next we'll create an avro file which will contain all documents in it. For that purpose we'll use mongo2avro tool
# create output area where avro files will be stored
mkdir -p tmp/data
# create migration area where accumulated avro files will be transferred once they'll reach a given threshold
mkdir -p tmp/migrate
# run migration script. It takes all mongoio records from MongoDB and put them into single avro file into tmp/data area
# once file reach a threshold (by default it is 256MB and can be adjusted via --thr option) the avro file will be
# moved into migration area
mongo2avro --mongo=mongodb://localhost:8230 --schema=test/data/fwjr_processing.avsc --odir=./tmp/data --mdir=./tmp/migrate --verbose
# here is an example of the output of aforementioned command
[03/Feb/2016:13:16:59] 1454523419.0 MongoStorage mongodb://localhost:8230
[03/Feb/2016:13:16:59] 1454523419.0 MongoStorage Collection(Database(MongoClient(host=['localhost:8230'], document_class=dict, tz_aware=False, connect=True, w=1), u'fwjr'), u'db')
[03/Feb/2016:13:16:59] 1454523419.0 AvroStorage avroio:test/data/fwjr_processing.avsc
[03/Feb/2016:13:17:05] 1454523425.0 mongo2avro 1000 docs ./tmp/data/20160203_181659.avro 7.2MB (7196498 bytes) RSS:66.3MB
[03/Feb/2016:13:17:05] 1454523425.0 mongo2avro wrote 1000 docs ./tmp/data/20160203_181659.avro 7.2MB (7211757 bytes)
In the end we'll have tmp/data/20160203_181659.avro
file with all 1000 records.
This data can be migrated into HDFS either by using hadoop tools, e.g.
hadoop fs -mkdir /path/hdfs/data
hadoop fs -put tmp/data/20160203_181659.avro /path/hdfs/data
or via migrate2hdfs tool.
Once you migrated your data to HDFS we can use Hive to query it. Here our initial setup
# on HDFS we have data area
hadoop fs -ls /path/hdfs/data
# generate avro schema file
json2avsc --fin=fwjr.json --fout=fwjr.avsc
# we put to HDFS our avro schema file
hadoop fs -put fwjr.avsc /path/hdfs/
and we ready to start hive:
shell# hive
# create new database
hive> create database cms_wm_archive location '/path/hdfs/arch.db';
# use our database
hive> use cms_wm_archive;
# create external table pointed to our data files
hive> create external table wmarchive
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe"
STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat"
LOCATION "/user/valya/test/hive/data/"
TBLPROPERTIES("avro.schema.url"="hdfs:///path/hdfs/fwjr.avsc");
# query our data
hive> select jobid from wmarchive;
Voila! It works!