sql - Efficient way to find customer entries in two large streams of data -
if have data stream gives me 10 million records day (stream a), , gives me 1 billion day (stream b) efficient way see if there overlap in data?
more specifically, if there customer in stream visits webpage, , same customer visits different webpage in stream b, how can tell customer visited both webpages?
my initial thought put records relational database , join, know inefficient.
what more efficient way this? how able using tool hadoop or spark?
a join should efficient way of dealing this. should have both data sets ordered, or index on customerid (and index ordered customerid). because of indexing, sql engine know sets ordered , should able join efficiently.
if you're looking instances customerid in both, might sql query along lines of:
select distinct a.customerid inner join b on a.customerid = b.customerid
Comments
Post a Comment