nlp - Pig: Filter multi-column table on Set -
i have following inputs:
input = load '$in_data' using pigstorage('\t', '-schmea') ( uid:chararray, pid:int, token:chararray ); stpwrd = load '$stpwrd' using pigstorage('\t', '-schema') ( token:chararray ); my goal can summarized in following pseudo-code:
output = filter input not in(input.token, stpwrd); , ideally gives rows in input table input.token field not in stpwrd.
i checked setdifference() udf in datafu (link), not sure if job, since seems require both table singleton, while input table has multiple columns.
we can achieve objective using right join , filtering records there in stpwrd, example below illustrates usage.
input : input_data
uid1 1 token1 uid2 2 token2 uid3 3 token3 input : stpwrd
token1 token2 pig script :
input_data = load 'input_data' using pigstorage('\t') ( uid:chararray, pid:int, token:chararray ); stpwrd = load 'stpwrd' using pigstorage('\t') ( token:chararray ); output_data = join stpwrd token right, input_data token; req_data = filter output_data stpwrd::token null; output : req_data
(,uid3,3,token3) project required fields req_data alias.
Comments
Post a Comment