sql - Efficient way to find customer entries in two large streams of data -

August 15, 2011

if have data stream gives me 10 million records day (stream a), , gives me 1 billion day (stream b) efficient way see if there overlap in data?

more specifically, if there customer in stream visits webpage, , same customer visits different webpage in stream b, how can tell customer visited both webpages?

my initial thought put records relational database , join, know inefficient.

what more efficient way this? how able using tool hadoop or spark?

a join should efficient way of dealing this. should have both data sets ordered, or index on customerid (and index ordered customerid). because of indexing, sql engine know sets ordered , should able join efficiently.

if you're looking instances customerid in both, might sql query along lines of:

select distinct a.customerid     inner join b      on a.customerid = b.customerid

Search This Blog

Live one

sql - Efficient way to find customer entries in two large streams of data -

Comments

Post a Comment

Popular posts from this blog

authentication - Mongodb revoke acccess to connect test database -

python - GitPython: check if git is available -

How to merge four videos on one screen with ffmpeg -