sql - Efficient way to find customer entries in two large streams of data -


if have data stream gives me 10 million records day (stream a), , gives me 1 billion day (stream b) efficient way see if there overlap in data?

more specifically, if there customer in stream visits webpage, , same customer visits different webpage in stream b, how can tell customer visited both webpages?

my initial thought put records relational database , join, know inefficient.

what more efficient way this? how able using tool hadoop or spark?

a join should efficient way of dealing this. should have both data sets ordered, or index on customerid (and index ordered customerid). because of indexing, sql engine know sets ordered , should able join efficiently.

if you're looking instances customerid in both, might sql query along lines of:

select distinct a.customerid     inner join b      on a.customerid = b.customerid 

Comments

Popular posts from this blog

authentication - Mongodb revoke acccess to connect test database -

r - Update two sets of radiobuttons reactively - shiny -

ios - Realm over CoreData should I use NSFetchedResultController or a Dictionary? -