hadoop - Spark Streaming - HBase Bulk Load -

March 15, 2015

i'm using python bulk load csv data hbase table, , i'm having trouble writing appropriate hfiles using saveasnewapihadoopfile

my code looks follows:

def csv_to_key_value(row):     cols = row.split(",")     result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),               (cols[0], [cols[0], "f2", "c2", cols[2]]),               (cols[0], [cols[0], "f3", "c3", cols[3]]))     return result  def bulk_load(rdd):     conf = {#ommitted simplify}      keyconv = "org.apache.spark.examples.pythonconverters.stringtoimmutablebyteswritableconverter"     valueconv = "org.apache.spark.examples.pythonconverters.stringlisttoputconverter"      load_rdd = rdd.flatmap(lambda line: line.split("\n"))\                   .flatmap(csv_to_key_value)     if not load_rdd.isempty():         load_rdd.saveasnewapihadoopfile("file:///tmp/hfiles" + starttime,                                         "org.apache.hadoop.hbase.mapreduce.hfileoutputformat2",                                         conf=conf,                                         keyconverter=keyconv,                                         valueconverter=valueconv)     else:         print("nothing process")

when run code, following error:

java.io.ioexception: added key not lexically larger previous. current cell = 10/f1:c1/1453891407213/minimum/vlen=1/seqid=0, lastcell = /f1:c1/1453891407212/minimum/vlen=1/seqid=0 @ org.apache.hadoop.hbase.io.hfile.abstracthfilewriter.checkkey(abstracthfilewriter.java:204)

since error indicates key problem, grabbed elements rdd , follows (formatted readability)

[(u'1', [u'1', 'f1', 'c1', u'a']),  (u'1', [u'1', 'f2', 'c2', u'1a']),  (u'1', [u'1', 'f3', 'c3', u'10']),  (u'2', [u'2', 'f1', 'c1', u'b']),  (u'2', [u'2', 'f2', 'c2', u'2b']),  (u'2', [u'2', 'f3', 'c3', u'9']),

. . .

 (u'9', [u'9', 'f1', 'c1', u'i']),  (u'9', [u'9', 'f2', 'c2', u'3c']),  (u'9', [u'9', 'f3', 'c3', u'2']),  (u'10', [u'10', 'f1', 'c1', u'j']),  (u'10', [u'10', 'f2', 'c2', u'1a']),  (u'10', [u'10', 'f3', 'c3', u'1'])]

this perfect match csv, in correct order. far understand, in hbase key defined {row, family, timestamp}. row , family combination unique , monotonically increasing entries in data, , have no control of timestamp (which problem can imagine)

can advise me on how avoid/prevent such problems?

well silly error on part, , feel bit foolish. lexicographically, order should 1, 10, 2, 3... 8, 9. easiest way guarantee correct ordering before loading is:

rdd.sortbykey(true);

i hope can save @ least 1 person headaches had.

Search This Blog

Live one

hadoop - Spark Streaming - HBase Bulk Load -

Comments

Post a Comment

Popular posts from this blog

php - XML feed for Wordpress Social Board plugin modifications -

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

javascript - Twitter Bootstrap - how to add some more margin between tooltip popup and element -