hadoop - Spark Streaming - HBase Bulk Load -
i'm using python bulk load csv data hbase table, , i'm having trouble writing appropriate hfiles using saveasnewapihadoopfile
my code looks follows:
def csv_to_key_value(row): cols = row.split(",") result = ((cols[0], [cols[0], "f1", "c1", cols[1]]), (cols[0], [cols[0], "f2", "c2", cols[2]]), (cols[0], [cols[0], "f3", "c3", cols[3]])) return result def bulk_load(rdd): conf = {#ommitted simplify} keyconv = "org.apache.spark.examples.pythonconverters.stringtoimmutablebyteswritableconverter" valueconv = "org.apache.spark.examples.pythonconverters.stringlisttoputconverter" load_rdd = rdd.flatmap(lambda line: line.split("\n"))\ .flatmap(csv_to_key_value) if not load_rdd.isempty(): load_rdd.saveasnewapihadoopfile("file:///tmp/hfiles" + starttime, "org.apache.hadoop.hbase.mapreduce.hfileoutputformat2", conf=conf, keyconverter=keyconv, valueconverter=valueconv) else: print("nothing process")
when run code, following error:
java.io.ioexception: added key not lexically larger previous. current cell = 10/f1:c1/1453891407213/minimum/vlen=1/seqid=0, lastcell = /f1:c1/1453891407212/minimum/vlen=1/seqid=0 @ org.apache.hadoop.hbase.io.hfile.abstracthfilewriter.checkkey(abstracthfilewriter.java:204)
since error indicates key problem, grabbed elements rdd , follows (formatted readability)
[(u'1', [u'1', 'f1', 'c1', u'a']), (u'1', [u'1', 'f2', 'c2', u'1a']), (u'1', [u'1', 'f3', 'c3', u'10']), (u'2', [u'2', 'f1', 'c1', u'b']), (u'2', [u'2', 'f2', 'c2', u'2b']), (u'2', [u'2', 'f3', 'c3', u'9']),
. . .
(u'9', [u'9', 'f1', 'c1', u'i']), (u'9', [u'9', 'f2', 'c2', u'3c']), (u'9', [u'9', 'f3', 'c3', u'2']), (u'10', [u'10', 'f1', 'c1', u'j']), (u'10', [u'10', 'f2', 'c2', u'1a']), (u'10', [u'10', 'f3', 'c3', u'1'])]
this perfect match csv, in correct order. far understand, in hbase key defined {row, family, timestamp}. row , family combination unique , monotonically increasing entries in data, , have no control of timestamp (which problem can imagine)
can advise me on how avoid/prevent such problems?
well silly error on part, , feel bit foolish. lexicographically, order should 1, 10, 2, 3... 8, 9. easiest way guarantee correct ordering before loading is:
rdd.sortbykey(true);
i hope can save @ least 1 person headaches had.
Comments
Post a Comment