etl - Complex join with google dataflow -


i'm newbie, trying understand how might re-write batch etl process google dataflow. i've read of docs, run few examples.

i'm proposing new etl process driven business events (i.e. source pcollection). these trigger etl process particular business entity. etl process extract datasets source systems , pass results (pcollections) onto next processing stage. processing stages involve various types of joins (including cartesian , non-key joins, e.g. date-banded).

so couple of questions here:

(1) approach i'm proposing valid & efficient? if not better, havent seen presentations on real-world complex etl processes using google dataflow, simple scenarios.

are there "higher-level" etl products better fit? i've been keeping eye on spark , flink while.

our current etl moderately complex, though there 30 core tables (classic edw dimensions , facts), , ~1000 transformation steps. source data complex (roughly 150 oracle tables).

(2) complex non-key joins, how these handled?

i'm attracted google dataflow because of being api first , foremost, , parallel processing capabilities seem fit (we being asked move batch overnight incremental processing).

a worked example of dataflow use case push adoption forward!

thanks, mike s

it sounds dataflow fit. allow write pipeline takes pcollection of business events , performs etl. pipeline either batch (executed periodically) or streaming (executed whenever input data arrives).

the various joins part relatively expressible in dataflow. cartesian product, can @ using side inputs make contents of pcollection available input processing of each element in pcollection.

you can @ using groupbykey or cogroupbykey implement joins. these flatten multiple inputs, , allow accessing values same key in 1 place. can use combine.perkey compute associative , commutative combinations of elements associated key (eg., sum, min, max, average, etc.).

date-banded joins sound fit windowing allows write pipeline consumes windows of data (eg., hourly windows, daily windows, 7 day windows slide every day, etc.).


edit: mention groupbykey , cogroupbykey.


Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

How to get the ip address of VM and use it to configure SSH connection dynamically in Ansible -

javascript - Get parameter of GET request -