scala - Spark: Incremental collect() to a partition causes OutOfMemory in Heap -


i have following code. need print rdd console, collecting large rdd in smaller chunks collecting per partition. avoid collecting entire rdd @ once. when monitoring heap , gc log, seems nothing ever being gc'd. heap keeps on growing until hits outofmemory error. if understanding correct, in below once println statement executed collected rdds, won't needed safe gc, that's not see in gc log each call collect accumulates until oom. know why collected data not being gc'd?

 val writes = partitions.foreach { partition =>       val rddpartition = rdds.mappartitionswithindex ({          case (index, data) => if (index == partition.index) data else iterator[words]()       }, false).collect().toseq       val partialreport = report(rddpartition, reportid, datecreated)       println(partialreport.name)      } 

if dataset huge, master node cant handle , shutdown. may try writing them file (eg saveastextfile), read each file again


Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

javascript - Twitter Bootstrap - how to add some more margin between tooltip popup and element -

javascript - Get parameter of GET request -