nlp - Pig: Filter multi-column table on Set -


i have following inputs:

input = load '$in_data' using pigstorage('\t', '-schmea') (    uid:chararray,    pid:int,    token:chararray ); stpwrd = load '$stpwrd' using pigstorage('\t', '-schema') (    token:chararray ); 

my goal can summarized in following pseudo-code:

output = filter input not in(input.token, stpwrd); 

, ideally gives rows in input table input.token field not in stpwrd.

i checked setdifference() udf in datafu (link), not sure if job, since seems require both table singleton, while input table has multiple columns.

we can achieve objective using right join , filtering records there in stpwrd, example below illustrates usage.

input : input_data

uid1    1   token1 uid2    2   token2 uid3    3   token3 

input : stpwrd

token1 token2 

pig script :

    input_data = load 'input_data' using pigstorage('\t') (    uid:chararray,    pid:int,    token:chararray );  stpwrd = load 'stpwrd' using pigstorage('\t') (    token:chararray );  output_data = join stpwrd token right, input_data token;  req_data = filter output_data stpwrd::token null; 

output : req_data

(,uid3,3,token3) 

project required fields req_data alias.


Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

javascript - Get parameter of GET request -

javascript - Twitter Bootstrap - how to add some more margin between tooltip popup and element -