nlp - Pig: Filter multi-column table on Set -
i have following inputs:
input = load '$in_data' using pigstorage('\t', '-schmea') ( uid:chararray, pid:int, token:chararray ); stpwrd = load '$stpwrd' using pigstorage('\t', '-schema') ( token:chararray );
my goal can summarized in following pseudo-code:
output = filter input not in(input.token, stpwrd);
, ideally gives rows in input
table input.token
field not in stpwrd
.
i checked setdifference()
udf in datafu
(link), not sure if job, since seems require both table singleton, while input
table has multiple columns.
we can achieve objective using right join , filtering records there in stpwrd, example below illustrates usage.
input : input_data
uid1 1 token1 uid2 2 token2 uid3 3 token3
input : stpwrd
token1 token2
pig script :
input_data = load 'input_data' using pigstorage('\t') ( uid:chararray, pid:int, token:chararray ); stpwrd = load 'stpwrd' using pigstorage('\t') ( token:chararray ); output_data = join stpwrd token right, input_data token; req_data = filter output_data stpwrd::token null;
output : req_data
(,uid3,3,token3)
project required fields req_data alias.
Comments
Post a Comment