etl - Complex join with google dataflow -
i'm newbie, trying understand how might re-write batch etl process google dataflow. i've read of docs, run few examples.
i'm proposing new etl process driven business events (i.e. source pcollection). these trigger etl process particular business entity. etl process extract datasets source systems , pass results (pcollections) onto next processing stage. processing stages involve various types of joins (including cartesian , non-key joins, e.g. date-banded).
so couple of questions here:
(1) approach i'm proposing valid & efficient? if not better, havent seen presentations on real-world complex etl processes using google dataflow, simple scenarios.
are there "higher-level" etl products better fit? i've been keeping eye on spark , flink while.
our current etl moderately complex, though there 30 core tables (classic edw dimensions , facts), , ~1000 transformation steps. source data complex (roughly 150 oracle tables).
(2) complex non-key joins, how these handled?
i'm attracted google dataflow because of being api first , foremost, , parallel processing capabilities seem fit (we being asked move batch overnight incremental processing).
a worked example of dataflow use case push adoption forward!
thanks, mike s
it sounds dataflow fit. allow write pipeline takes pcollection
of business events , performs etl. pipeline either batch (executed periodically) or streaming (executed whenever input data arrives).
the various joins part relatively expressible in dataflow. cartesian product, can @ using side inputs make contents of pcollection
available input processing of each element in pcollection
.
you can @ using groupbykey
or cogroupbykey
implement joins. these flatten multiple inputs, , allow accessing values same key in 1 place. can use combine.perkey
compute associative , commutative combinations of elements associated key (eg., sum, min, max, average, etc.).
date-banded joins sound fit windowing allows write pipeline consumes windows of data (eg., hourly windows, daily windows, 7 day windows slide every day, etc.).
edit: mention groupbykey
, cogroupbykey
.
Comments
Post a Comment